Article Review: Machine Learning operations maturity model by Microsoft
Hypergolic (our ML consulting company) works on its own ML maturity model and ML assessment framework. In the next phase, I will review three more articles:
Our focus with this work is helping companies at an early to moderately advanced stage of ML adoption, and my comments will reflect this. Please subscribe if you would like to be notified of the rest of the series.
Machine Learning operations maturity model
The MLOps maturity model helps clarify the Development Operations (DevOps) principles and practices necessary to run a successful MLOps environment. It's intended to identify gaps in an existing organization's attempt to implement such an environment. It's also a way to show you how to grow your MLOps capability in increments rather than overwhelm you with the requirements of a fully mature environment.
One of my major critiques of MS’s maturity model is that it is detached from business goals and DS needs. It is entirely a roadmap to go through and implement features one by one. It doesn’t provide you with the prioritisation of must-haves and good-to-haves. Well, let’s review it anyway.
As with most maturity models, the MLOps maturity model qualitatively assesses people/culture, processes/structures, and objects/technology.
People/process/technology is a popular consulting technique to describe complex systems. We use it as well frequently.
The article is a series of tables with four fixed topics and bullet points describing the area’s given maturity level. In the review, I will go through each topic one by one for all maturity levels and comment on the progress.
One of the largest low hanging fruit in MLOps is collaboration. ML is a cross-functional topic, and you can’t achieve it with a disconnected team. As you can see from the table above, you can accelerate maturity in the people component just by making them talk to each other.
The responsibilities are distributed weirdly. For example, in L2, do the DSes work with the DEs to create repeatable scripts? DEs don’t really deal with anything other than ELT. DSes must do all custom work, and they hand it over to the SWEs. Same as L3 DE→SWE in charge of model integration. Maybe MS uses DE instead of MLE? Even the MLE/DS split is controversial. In an effective team, there is no distinguishing between MLEs and DSes.
SWEs implementing post-deployment metrics gathering at L4 is way too late.
This is one of my main critiques of this model. Instead of focusing on where do, you add more business value with MLOps; it focuses on building a pipeline from start to finish and assigns levels based on where are you with the building. This is more of a roadmap than a maturity scale.
In practice, the value is added _at the end of the pipeline_. You are supposed to add components at the very end first: deployment, A/B testing, logging. Then walk backwards, scaling and automating early processes.
Because there is no data component in the model, I will discuss data here. Automated data gathering and version controlling are the key steps here. These are absolute must-haves. Experiment tracking is important, but that should be resolved through logging (even in training/evaluation pipelines).
Automatic retraining based on production triggers is a hyped feature but, in practice, should be avoided. In most cases, the training needs to be supervised by DSes based on extensive evaluation. If versioning and logging are strict enough, launching training scripts manually should not be too much hassle. Managing compute is a good-to-have cost-saving feature.
There is not much on this topic. Version controlling is paramount, which makes all other issues much smaller.
In general, model release can’t be too much automated. DSes need to spend so much time verifying each mode to make sure they didn’t create some unwanted effect. Of course, these tests need to be version controlled and well documented that the business understands the implication of new models.
It’s not expressed explicitly but based on L3, I read this as SWE/MLE reimplements DSes models. I think this is an antipattern. DSes should create models that go as it is into production. Yesterday’s post [link] on Google’s maturity model brings the concept of “experimental-operational symmetry” DSes pipelines are deployed into production without change so their assumptions can be compared to results in the production environment. Then the performance difference can be attributed to training/serving skew and not to errors of reimplementation.
Large scale evaluation tests act as integration tests, and “Experimental-operational symmetry” will suggest that these assertions will be valid in production as well. Of course, this is a simplistic argument, and edge cases need to be unit tested. It is important to understand the concept of testability and what is precisely the “system under test”. SWE need to test areas that fall into their remit.
I feel this is weaker content than Google’s, but still worth the read. My major takeaways are:
Start from the end where you add value and work backwards: deployment, A/B testing, logging and then concentrate on earlier automation.
Version control _everything_: In 2022, this is table-stakes.
Experimental-operational symmetry is an incredibly powerful concept that will enable you a lot of simplification: DSes/MLEs must write code good enough to go into production.
Largescale evaluation with experimental-operational symmetry brings you free testing. Delegate testing of the infrastructure to DevOps/MLOps.
Analysis done by DSes drive the process; automation has little benefit.
Work together from Day 0. It is the only free lunch in town.
I hope you enjoyed this summary, and please subscribe to be notified about future parts of the series: