How to stop wasting time fixing broken machine learning pipelines

Carlo Provinciali
Booking.com Data Science
8 min readAug 8, 2022

--

Let me set the scene: it’s Monday morning, and you just came back from a relaxing weekend. You turn on your laptop, excited to get back to exploring new models. Suddenly that weekend feeling is crushed when you start getting error notifications. Your teammates start asking what’s wrong, and then your stakeholders get in touch more urgently.. You have a machine learning pipeline in production, people are relying on those predictions, and your carefully designed system ground to a halt while you were enjoying some well-deserved time off. And worst of all, you have no idea what is causing the issue.

In this blog post, I am going to talk about a few best practices to help with ML pipeline traceability and debuggability, and share the initiatives that we have started at Booking.com to help ML practitioners seamlessly adopt these in their daily work.

Does this scenario sound all too familiar to you? Well, rest assured that you are not the only one. Tracking changes and the way components interact is already complex when working with traditional, deterministic software systems, but the whole situation becomes dramatically more complex when working with ML systems. In fact, with ML models you don’t only have to keep track of code and configurations, but also of data versions, hyperparameters, and, in most cases, there will be semi-random processes that might generate unforeseeable results.

All these moving components can exacerbate the difficulty of figuring out why something is not working, which inevitably causes frustration within the ML community. When you have so many talented and collaborative teams dedicated to solving problems with ML, you don’t want to have them spend the majority of their time tracking down bugs rather than working on the next innovative application. For these reasons, I compiled a list of best practices to help speed up issue resolution and prevent these problems from showing up in the first place.

Know what code runs in your production environment

Why should you care?

Imagine something goes wrong with your model while it tries to make predictions. It is throwing an error saying that your input data has an unexpected format. Nothing unusual … except you are convinced that you fixed that same issue last week. But wait a minute, have your changes been deployed? Which version of your code is currently running in your pipeline? If you can’t answer these questions easily, trying to debug your code might turn into a painful and time consuming process.

What can you do about it?

What if you could neatly break down your code in versions and quickly identify the changes that were introduced with each of them? This way it is much easier to understand why your code that was previously working stopped all of a sudden, and you can always revert back to a previous working version of the code in the worst case. In the meantime, it’s much easier to ensure that your pipelines use the latest working version to avoid the scenario described above.

This can be achieved relatively easily by organizing your code in versioned packages and leveraging git tags. Creating packages means that you are grouping all the different modules and functionalities that make your ML project work in one unit. Much like you’d run an import statement at the beginning of your script to load libraries like pandas and numpy, you can similarly use import statements to get all functionalities you need in your project regardless of whether you are working on your own machine or in another environment. The advantage is that this approach makes it easy to specify which version of your model you want to run in your project requirement file — e.g. my-awesome-ml-model==1.0.0 — and as long as that file is accessible you can immediately know which version of your project is running at the moment. That’s nice, but you might be wondering how this helps you understand which changes were introduced with each version. That’s where git tags come into play; you can tag each version of your package to a specific commit in Git, so you know exactly which code changes were introduced with each version. You could even automate the process by having a CI pipeline that creates the corresponding git tag each time you commit a change to your package version; although, some might feel like it is better to keep this process manual. By adopting packages and git tags, you can easily figure out which version of the code is running at any point in time and which specific changes were introduced in that version.

What have we done to encourage the use of packages and git tags at Booking.com?

Of course, none of this comes for free and organizing your code in packages might require you to take off your ML hat and borrow a Software Engineer one for a day or two. That is to say, it might take a bit of time to learn how to structure your code in a certain way. To make this shift in mindset a bit easier at Booking.com, we have decided to create a “cookiecutter template” that ML practitioners can use as a reference [NOTE: Our version is customized to work with our systems and as such we cannot share it here. If you want to get started on your own version, here’s a good source of inspiration https://github.com/drivendata/cookiecutter-data-science]. This project template essentially pre-loads your project’s git repository with tooling and structures it in a way that is convenient for creating packages. This includes for instance separating main functionality and unit tests in folders, or adding the automation that runs the tests before creating a new commit. Once all these elements are in place, it should only take a quick command to build, test, and deploy one’s package to our internal distribution channels, where it can be easily deployed in different environments.

Even if this cookiecutter template makes it easier to structure your project in a package-friendly format, it still requires a bunch of fine tuning and configuration which some practitioners might not necessarily have the bandwidth for. Luckily, we have well-maintained mono-repo style projects that provide a lot of functionality — including git tagging and package creation; all it’s required is to add your ML code to the project and you should be able to create and deploy packages easily — regardless of if it is a production pipeline or a one-off exploratory Jupyter notebook [by “monorepo” I mean a single repository that contains different projects not related to each other. Each project can and is often — in Booking.com case at least — owned by a different team]. This of course can help you get to the end goal of having more traceable code faster, but you lose some control on the versioning logic and other configurations. But no one-size-fits-all so we provide multiple solutions to cater to the widest possible audience!

Spend Less Time on Code Maintenance

Why should you care?

ML code can rarely keep functioning without being revisited periodically. Data and product requirements change constantly, and so do people that own and maintain the code that powers your predictions, rankings, and classifications. You might even shift to other projects and forget all the intermediate steps in your model’s workflow. You don’t want your future self to be miserable trying to read code that you haven’t touched in 6+ months. You also don’t want your coworkers that might inherit your model in the future to hate you until the end of time because they can’t make sense of what you built. After all, you want you and your fellow ML practitioners to focus on building cool new things, not babysitting ML pipelines that keep throwing tantrums every now and then.

What can you do about it?

There is an ocean of possibilities of what can be done to reduce time spent on code maintenance, but I believe the easiest and most effective one I’ve found is: write testable code.

Sure, you might say, but my pipelines are complex and adding tests for each single functionality might take ages. That might be true, but perfection is the enemy of progress, and even covering part of your code is better than nothing. The benefit is once you have a suite of tests that cover the most common scenarios, you can be relatively assured that any changes you introduce won’t make your ML pipelines grind to a halt. This is immensely useful because it means you don’t have to spend a ton of time monitoring your pipelines once you push your changes and you can focus on that exciting new model that you are building.

Another great thing is to have someone else look at your code for potential areas of improvement. This has a dual effect; not only might the reviewer identify areas of improvement (and/or learn something new in the process), but indirectly you will be encouraged to write better code if you know someone else will be looking at it. The latter aspect comes particularly handy if you hope that someone — even the reviewer themselves — will use or maintain your code in the future. Typically people might ask for others to approve or comment on their changes before merging them to their main git branch (merge request), but there are many other ways to get feedback.

What have we done to encourage the use of automated testing and code review at Booking.com?

Ideally, each time a change in your code is made you want to automatically run tests to make sure you are not introducing any broken code to your main branch or production systems. For these reasons, we have equipped the cookiecutter template that I mentioned earlier with configurations to automatically run tests before merging changes to the main branch. This has helped me multiple times when working on changes to — say — the model inputs and architecture; I wrote multiple tests to make sure model training would go as expected and the model produced sensible predictions, and those tests were vital to detect any issues before the problematic code started getting used more broadly. The same principle is true for our mono-repo style collaboration projects, where new changes need to pass all tests before being pushed to the main branch.

For the code overview — other than the traditional reviews that happen in the merge request process — we have set up code brainstorming groups where anyone who is seeking feedback can get paired with volunteer reviewers. The process is pretty smooth; the requester can fill in a questionnaire with all the info about their model and a link to their code, and a meeting gets scheduled with the reviewers to share any advice or feedback.

To sum up

All of these practices sound nice, don’t they? But the main pushback that I hear from people is that they don’t have enough time to implement them because they are so focused on ensuring their ML models can produce tangible results. I partially agree with that statement, especially in the case of quick, proof-of-concept style projects. But I believe that once these projects become more concrete and start making their way into a production stage, investing some time to implement these best practices can have an immense payout and free you from unnecessary, tedious debugging so you can focus on work that really makes the difference.

Appendix

Here’s some useful resources and further reads about some of the topics that I’ve discussed in this post

Unit Tests:

Code reviews:

--

--