Discover more from Deliberate Machine Learning
Article Review: MLOps: Continuous delivery and automation pipelines in machine learning by Google
Hypergolic (our ML consulting company) works on its own ML maturity model and ML assessment framework. As part of it, I review the literature starting with five articles from Google:
Our focus with this work is helping companies at an early to moderately advanced stage of ML adoption, and my comments will reflect this. Please subscribe if you would like to be notified of the rest of the series.
MLOps: Continuous delivery and automation pipelines in machine learning
This is one of the key documents about MLOps. It is pretty long, so the review will be similarly long. Like the technical debt articles, it is mainly focused on engineering rather than a business and product, so I will make sure that our cross-functional perspectives are added to the comments.
Immediately starts with a clarification of terms:
CI: continuous integration - changes merged into the systems all the time.
CD: continuous delivery - new features are added while the system is running.
CT: continuous training - models can be trained at any time.
Data scientists can implement and train an ML model with predictive performance on an offline holdout dataset, given relevant training data for their use case. However, the real challenge isn't building an ML model, the challenge is building an integrated ML system and to continuously operate it in production.
I think this is one of the roots of ML’s problems. Data Scientists and (ML) Engineers must work closer to each other. Not just organisationally but also in terms of skills. DSes must learn to write better code, and MLEs should consider more product/business questions. Otherwise, siloing and premature optimisation will happen.
DevOps versus MLOps
I think the difference between the engineering needs of ML systems and SWE systems is overemphasised. The experimental nature of ML in practice is no more different than running A/B tests and analysing impacts. The real difference is the evaluation of the solutions. In ML, this can only be done statistically, which requires an entirely new set of skills.
They list key differences:
Team skills: DSes are not SWEs.
True, but they can learn to write better code for smoother cooperation.
Development: ML is experimental.
Experimental branches need to be operated at high quality because they need to go into production for real feedback.
Testing: Testing an ML system is more involved.
True, “Analysis Debt” is a significant issue, and that’s why the DSes are on the team that this is mitigated.
Deployment: In ML systems, deployment isn’t as simple as deploying an offline-trained ML model as a prediction service. ML systems can require you to deploy a multi-step pipeline to retrain and deploy the model automatically.
This is similar to deploying a major feature touching many services. There are engineering principles (versioned APIs, feature flags) to solve that.
Production: ML models can have reduced performance not only due to suboptimal coding but also due to constantly evolving data profiles.
True, the answer is to test more, not just model performance offline but model performance in production, training/evaluation/online data skew, diverging performance on different slices and anything else that can go wrong.
ML model creation steps
An ML project consists of the following steps: data extraction, data analysis, data preparation, model training, model evaluation, model validation, model serving, model monitoring.
The level of automation of these steps defines the maturity of the ML process, which reflects the velocity of training new models given new data or training new models given new implementations.
This is a too generic statement that ties maturity to automation. ML systems serve a business problem, and that problem is implemented in a generic product, and part of that is the ML system. The business reflects the environment changes by assigning requests to the product, resulting in a need for change in the ML system.
The maturity, therefore, needs to be related to how easily the ML system can incorporate and validate these changes. The ML system’s OODA loop must be faster than the product team’s OODA loop, so it is not a bottleneck.
One must identify bottlenecks in the ML iteration process and remove them to increase the speed. Removal of these potential bottlenecks should be the measure of maturity.
Understanding the cost/benefit of these bottlenecks is the main job to build a better system.
MLOps level 0: Manual process
Many teams have data scientists and ML researchers who can build state-of-the-art models, but their process for building and deploying ML models is entirely manual. This is considered the basic level of maturity, or level 0. The following diagram shows the workflow of this process.
This is the “typical” technical debt-ridden state of ML. Code is in notebooks; everything is manual. Training is infrequent because it is too difficult to push changes through the “pipeline”. Model artefact handoff to engineering (interestingly, no mention of a rewrite). Training/serving skew because of the reimplemented preprocessing pipeline and unverified data distributions in production. No logging and monitoring.
MLOps level 0 is common in many businesses that are beginning to apply ML to their use cases. This manual, data-scientist-driven process might be sufficient when models are rarely changed or trained. In practice, models often break when they are deployed in the real world.
I think this is unfortunately true, but I don’t think in 2022 this is an acceptable state. Even at a first time ML product launch, the system needs to be in a better place.
They recommend the following changes (and recommend moving to Level 1):
Actively monitor the quality of your model in production
Frequently retrain your production models
Continuously experiment with new implementations to produce the model
Two critical processes go on in an ML system from a DS perspective: Writing code to change the models’ behaviour (preprocessing, dataset generation and actual model change) and large scale evaluation and analysis (the analysis incorporates live data as well). The actual deployment and running in production is primarily an engineering challenge with known solutions which can be improved independently. This also includes monitoring DevOps related properties of the system like memory and compute consumption which is not really related to the business problem the ML system solves.
MLOps level 1: ML pipeline automation
The goal of level 1 is to perform continuous training of the model by automating the ML pipeline; this lets you achieve continuous delivery of model prediction service. To automate the process of using new data to retrain models in production, you need to introduce automated data and model validation steps to the pipeline, as well as pipeline triggers and metadata management.
The first maturity level is “automation” it is pretty tricky to figure out what they mean, but I will address this later.
MLOps level 1: Characteristics
Rapid experiment: The steps of the ML experiment are orchestrated.
This is because of codifying everything rather than automation.
The transition between steps is automated.
This dismisses the fact that analysis of new models is more involved than can be easily automated. DSes need to spend a _huge_ amount of time to have a good mental model of why the model works. If this is automated, you almost certainly have “analysis debt”.
CT of the model in production: The model is automatically trained in production using fresh data based on live pipeline triggers.
I don’t see what’s the point of retraining. Instead, make the training data available in an offline environment.
Experimental-operational symmetry: The pipeline implementation used in the development or experiment environment is used in the preproduction and production environment, a key aspect of MLOps practice for unifying DevOps.
This is a key feature. DSes need to learn to write good enough code that can go straight into production. If it is in an unfamiliar language, they should work closely with MLEs/SWEs and figure out the symmetry (usually with heavy functional/integration testing.
Modularised code for components and pipelines: To construct ML pipelines, components need to be reusable, composable, and potentially shareable across ML pipelines. Therefore, while the EDA code can still live in notebooks, the source code for components must be modularised. In addition, components should ideally be containerised to do the following:
I think code reuse is overrated. The components of data collection and transformation are too custom to be reusable. The solution is Experimental-operational symmetry from the previous point. Write good code even in EDA.
Isolate each component in the pipeline. Components can have their own version of the runtime environment and have different languages and libraries.
This is has a diminishing return. In practice, you rarely have different languages, and homogenising a runtime environment helps a lot. Why would you deliberately add extra complexity?
Continuous delivery of models: This is more automation. But if your experiment pipeline can go straight into production, you are pretty much there.
Pipeline deployment: This is also the consequence of the Experimental-operational symmetry.
MLOps level 1: Additional components
When you deploy your ML pipeline to production, one or more of the triggers discussed in the ML pipeline triggers section automatically executes the pipeline.
Trigger-based automation is usually excessive for ML systems. A typical workflow is:
Modify the pipeline/data (implement changes, refactoring, etc.)
Run processing and training (in one go without human interaction)
Massive amount of analysis (primarily manual based on predefined queries)
Compare results to the previous version
Verify if changes were in the right direction
New dataset / new features affected the models in the correct way
Deploy (mostly engineering, by this time, code is in proper shape)
Massive amount of analysis
Compare production results to previous versions
Compare production results to offline results
Analyse input distribution for changes
As you can see, most time is spent on manual workflows and manually triggering pipelines is good enough. One rule of thumb about refactoring is that your analysis and new idea pipeline should be rich enough that you already have a new feature implemented when you are supposed to retrain the model because of degradation.
Automated triggers based on pre-set statistical tests can have unintended consequences. The authors mention various scenarios related to data and model validation. Still, they are too simplistic to be used in practice, and thresholds for the triggers are two hard to set to be meaningful.
Most systems end up with periodic training anyway (weekly or monthly), but then what’s the point of automation. Instead, trigger it manually based on new features or excessive analysis.
They mention some MLOps components that are now commonly known:
Helps with feature reuse and avoiding training/serving skew. These are fair points but only additionally to the experimental-operational symmetry that can be achieved without an extra component.
This is the equivalent of what we call a “Model Store”. Data lineage is probably the most important feature of ML systems. Metadata contributes to lineage but is easier to implement only by logging. Storing the performance metrics here is irrelevant as ML models need to be analysed manually anyway. Tools like W&B and Tensorboard are only used to monitor training progress, not the model’s overall validity.
ML pipeline triggers
One worry about triggers combined with generic top-level performance metrics from the previous point is that one might think they can launch models automatically. This is almost certainly going to lead to analysis debt. Avoid automatic triggers as much as possible.
MLOps level 1: Challenges
Assuming that new implementations of the pipeline aren't frequently deployed and you are managing only a few pipelines, you usually manually test the pipeline and its components. In addition, you manually deploy new pipeline implementations. You also submit the tested source code for the pipeline to the IT team to deploy to the target environment. This setup is suitable when you deploy new models based on new data, rather than based on new ML ideas.
However, you need to try new ML ideas and rapidly deploy new implementations of the ML components. If you manage many ML pipelines in production, you need a CI/CD setup to automate the build, test, and deployment of ML pipelines.
This is a good summary of the situation. I would add that the new implementations are usually relatively infrequent, so the overhead coming from lack of automation is not very high.
It is essential to mention that the pipeline is _tested_ frequently during refactoring in a test environment (small amount of data).
MLOps level 2: CI/CD pipeline automation
For a rapid and reliable update of the pipelines in production, you need a robust automated CI/CD system. This automated CI/CD system lets your data scientists rapidly explore new ideas around feature engineering, model architecture, and hyperparameters. They can implement these ideas and automatically build, test, and deploy the new pipeline components to the target environment.
Ok, so my understanding of this system is that a DS can change a pipeline in any way (data collection, transformation, feature engineering and model code). That change flows through the whole system creates enough logs to allow the extensive analysis needed. And also allows human intervention to stop deploying models that fail the analysis.
Honestly, it took me a side-by-side comparison to see what’s the difference compared to the Level 1 diagram. The only one I found is the No. 2. “CI: Build, test …” at the top-right. So are Level 2 and Level 1 the same? If Level 0 is the absolute unmaintainable technical debt-ridden base case, then there is only one step in the maturity model???? Automate everything or bust???
MLOps level 2: Characteristics
The following diagram shows the stages of the ML CI/CD automation pipeline:
The data analysis step is still a manual process for data scientists before the pipeline starts a new iteration of the experiment. The model analysis step is also a manual process.
Ok, so everything is automated apart from the two largest timesinks. Excessive automation is double problematic: Triggers and autonomous retraining can introduce unwanted dynamics that are hard to analyse and lead to adverse situations. Also, to justify the amount of engineering required, you need a lot of models to amortise its cost. Google can justify it, but very few other companies can.
Continuous integration: package and run tests (unit, functional and integration).
The tests described are pretty simplistic; if you have large scale evaluation on each model, simple standalone tests like division by zero will not add too much as an integration test will trigger these errors anyway. The analysis phase will pick up an excessive amount of NaNs.
Continuous delivery: deliver the package to the target environment and run integration tests:
Verifying the compatibility
Testing the prediction service by calling the service API
Testing prediction service performance
Validating the data either for retraining or batch prediction
Verifying that models meet the predictive performance targets
Automated deployment to a test environment triggered by pushing code to the development branch
Semi-automated deployment to a preproduction environment triggered by merging code to the main branch
Manual deployment to a production environment
These tests are relevant but are not adding enough value to justify a higher maturity level.
There are two types of problems that can appear: blocking problems that are software engineering issues and performance problems that are statistical issues. Blocking problems need to be tested by comparing the changes individually in the production and development environments. For example, a new feature must be generated in both places and the values compared. If one of the legs break down, engineerings must find the bug. But this is not a mathematical problem.
Statistical changes are evaluated twice extensively for each change. In practice, the two analysis stages merge into one continuous process. The DSes continuously monitor the state of their changes, combining training, evaluation, ABtest, production and previous versions of all of the above.
This is one of the most influential articles in MLOps. It needs to be scrutinised in detail. One must understand that it was written by one of the most prominent players in the field, and their advice might not be relevant to smaller companies or companies just beginning their journey.
There are key takeaways that I would like to mention:
Experimental-operational symmetry is probably the most important concept in the entire MLOps. The same code that you run your tests on is the same code that you run in production. DS/MLE/SWE teams need to operate in a tight loop, crosstraining themselves to reach optimum performance.
Extensive analysis is the biggest timesink of running an ML system. Updating the pipeline is second, and MLOps is only third. One must invest their resources accordingly.
The extensive analyses also act as the primary gatekeeper to any progress towards production. Automated triggering are secondary compared to manual analysis.
The focus should be on maintaining code quality and experimental-operational symmetry rather than excessive automation.
Modularisation, isolation and reuse are not without friction and have limited upside. One must justify their cost.
I hope you enjoyed this rather long summary of this excellent paper, and please subscribe to be notified about future parts of the series: