Article Review: Hidden Technical Debt in Machine Learning Systems by Google

2022-03-15

Mar 15, 2022

Hypergolic (our ML consulting company) works on its own ML maturity model and ML assessment framework. As part of it, I review the literature starting with five articles from Google:

Our focus with this work is helping companies at an early to moderately advanced stage of ML adoption, and my comments will reflect this. Please subscribe if you would like to be notified of the rest of the series.

Hidden Technical Debt in Machine Learning Systems

This is the famous article with the box plot demonstrating the small fraction of direct ML code in ML systems. This is an updated version of the previous article, which I reviewed here: https://laszlo.substack.com/p/article-review-machine-learning-the, but I think purely for the sake of that plot it is worth going through it, but I will only discuss the differences.

2: Complex Models Erode Boundaries

This is the same as in the other article, apart from the “Correction Cascades” part from the “Data Dependencies Cost More than Code Dependencies” section, which was moved here.

I agree that it has a better place here as crossing multiple abstraction layers to implement a residual model is a clear example of Boundary Erosion. As I recommended, the responsibility of solving these problems should belong to the same team at a reasonable scale company and be solved with organisational change rather than building cross-team coupling.

The “feedback loop” example from “Undeclared consumers” was moved to its own section to “Feedback Loops”. More on this there.

3: Data Dependencies Cost More than Code Dependencies

The only difference I found compared to the other article is that “When Correlations No Longer Correlate” moved from “Dealing with Changes in the External World” to the bullet points of “Underutilized Data Dependencies”:

Correlated Features. Often two features are strongly correlated, but one is more directly causal. Many ML methods have difficulty detecting this and credit the two features equally, or may even pick the non-causal one. This results in brittleness if world behavior later changes the correlations.

I think, strictly speaking, these are not data dependencies but dependencies of the behaviour of the data. I feel it was in a better place in its original place.

There are two types of data dependencies:

Blocking changes: These will break your pipelines and need immediate correction, or you can’t move forward. Technical debt should be reduced through traditional Software Engineering techniques applied to data: Decoupling and Loose Coupling, Separation of Concerns, Bounded Contexts and Context Mapping.
Behavioural changes: drifts, numerical changes and other non-blocking changes. Your product will silently operate, and you can only figure out these changes through in-depth end-to-end analysis.

These two problems are entirely different in nature and need to be separated. This stems from the fundamental difference between ML and SWE.

Software Engineering is exact and immediate. If something is broken, you will know it right away (mostly). You will be able to write a test that probes this hypothesis, and you will immediately know if the system under test fulfils this specification or not.

Machine Learning, on the other hand, is statistical and inexact. It takes a while until really serious effects will surface, but that also means they have the “Graceful Degradation” feature. You have time to fix them. You have time to write in-depth custom analysis, but you need to work a lot on this because there is no exact way to test the various convoluted concepts.

I feel this difference is somehow lost in these articles and the Engineering approach is overly emphasised. Production grade systems are constantly and in detail analysed over and over to make sure they fit the business problem. A significant portion of risks from these two articles will be identified by regular analysis. If you suffer from these risks, an alternative answer to heavy engineering can be to increase the DS team’s time on analysis. This is necessary anyway for Business Intelligence purposes.

4: Feedback Loops

This is a new section promoted from a small section in “Undeclared consumers”. Quite rightly, feedback loops are a massive topic for ML systems.

One of the key features of live ML systems is that they often end up influencing their own behavior if they update over time. This leads to a form of analysis debt, in which it is difficult to predict the behavior of a given model before it is released. These feedback loops can take different forms, but they are all more difficult to detect and address if they occur gradually over time, as may be the case when models are updated infrequently.

I still feel that this is not a technical debt question. Quite rightly, they introduce the term “analysis debt”. My experience is that DS teams don’t spend enough time analysing their own products. This is part of the feature factory/project mindset that pushes them to deliver new elements instead of a holistic data-driven/product mindset that searches for new opportunities from feedback and evidence.

Direct Feedback Loops. A model may directly influence the selection of its own future training data.
Hidden Feedback Loops. […] A more difficult case is hidden feedback loops, in which two systems influence each other indirectly through the world.

Direct feedback loops is the concept I was missing from the other article, I am glad it is mentioned here, and Hidden feedback loops are a version of this. I would think that careful system analysis should address these. The organisation should take steps to move responsibility about closely related systems that cause feedback loops to the same team to address these internally.

5: ML-System Anti-Patterns

This was renamed from “System-level Spaghetti”. This is the section that refers to the famous box chart.

It is unfortunately common for systems that incorporate machine learning methods to end up with high-debt design patterns. In this section, we examine several system-design anti-patterns that can surface in machine learning systems and which should be avoided or refactored where possible.

They don’t address _why_ these patterns emerge. Calling all code that is not modelling “plumbing” certainly won’t help.

Here is the famous chart:

Unfortunately, the chart provides no understanding of value or hierarchy or dependency. It doesn’t tell you that given the cross-functional nature of DS teams, who is in charge of which part? Just a bunch of boxes thrown on a canvas. One might even interpret that “ML Code” is unimportant, even though the whole system hangs on its performance.

“Glue Code”, “Pipeline Jungles”, “Dead Experimental Codepaths” are lifted from the original article; I am not going to review them again.

They introduce a new interesting concept:

Abstraction Debt. The above issues highlight the fact that there is a distinct lack of strong abstractions to support ML systems. Zheng recently made a compelling comparison of the state ML abstractions to the state of database technology, making the point that nothing in the machine learning literature comes close to the success of the relational database as a basic
Reference: A. Zheng. The challenges of building machine learning tools for the masses. SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). [Slides]

I think this reflects on the prevailing attitude for enterprise ML systems: An abstraction separates the DSes from the entire infrastructure. Eight years have passed since the article, and this problem hasn’t been resolved yet. This is partly because existing one-stop-shop systems underestimate the complexity of real-world problems and the solutions and analysis required to solve them. Models are indeed the most important part of the systems as the highest added value component, but all the other parts are important as well. These are too convoluted to be solved by a simple SaaS product.

Ok, so if the complexity of a usual problem is too high and it is not solvable by a one-stop-shop system, what’s the solution. The abstraction debt doesn’t go away. In any complex system, maintaining “Layers of Abstraction” is the most important design principle. Given that no system provides this, the product teams’ Data Scientists must maintain this by observing and adapting software engineering principles (Decoupling, Dependency Injection, Separation of Concerns, etc.).

For distributed learning in particular, there remains a lack of widely accepted abstractions. It could be argued that the widespread use of Map-Reduce in machine learning was driven by the void of strong distributed learning abstractions.

This is more of an insight into Google in 2014. The authors thought it is important to mention distributed learning which is not a widely practised (though upcoming) practice for “reasonably scaled” companies even in 2022. This is partly because of the gap between cost and benefit in practical situations. One must take any ML system recommendations coming from Google with a pinch of salt.

The following subsection is new as well, introducing “Code Smells”, one of my favourite topics that I regularly write about [link]:

Common Smells. In software engineering, a design smell may indicate an underlying problem in a component or system. We identify a few ML system smells, not hard-and-fast rules, but as subjective indicators.
Plain-Old-Data Type Smell (This is just the “Primitive Obsession” code smell - Laszlo)
Multiple-Language Smell (Maybe this was different in 2014 but nowadays most of these systems are in python - Laszlo)
Prototype Smell. It is convenient to test new ideas in small scale via prototypes. However, regularly relying on a prototyping environment may be an indicator that the full-scale system is brittle, difficult to change, or could benefit from improved abstractions and interfaces. Maintaining a prototyping environment carries its own cost, and there is a significant danger that time pressures may encourage a prototyping system to be used as a production solution. Additionally, results found at small scale rarely reflect the reality at full scale.

Prototype smell is a crucial concept. One of the primary problems we identified in DS productivity is “POC thinking”. As I wrote above, productionised ML problems are usually very complex and need finetuning from real feedback. To solve this in a POC environment, one must put massive efforts to mimic all the nuances of a real environment without committing to the quality of the solution. Once the POC is moved to production, the whole system needs to be refactored to match the old POC system with its duck-taped assumptions while writing off all efforts that went into building it. Finding out if the difference between the POC and Production system is relevant is also a considerable source of confusion that must be investigated.

We recommend that with the right level of abstraction, code and assumptions about the POC level solution can be moved straight to production without translation. This increases the ROI of a project as time and effort spent on POC is reused. Also, this makes moving from POC to Production smoother as there is no big-bang translation effort that needs to happen, taking the process closer to traditional CI/CD solutions. Contact us for more on this: https://hypergolic.co.uk/contact/

6: Configuration Debt

This has been promoted to its own section.

All this messiness makes configuration hard to modify correctly, and hard to reason about. However, mistakes in configuration can be costly, leading to serious loss of time, waste of computing resources, or production issues. This leads us to articulate the following principles of good configuration systems:
• It should be easy to specify a configuration as a small change from a previous configuration.
• It should be hard to make manual errors, omissions, or oversights.
• It should be easy to see, visually, the difference in configuration between two models.
• It should be easy to automatically assert and verify basic facts about the configuration: number of features used, transitive closure of data dependencies, etc.
• It should be possible to detect unused or redundant settings.
• Configurations should undergo a full code review and be checked into a repository.

Dealing with configurations is a pain all DS team members suffer from. There is a growing trend to declare solutions through DSL type languages (which will appear as configurations) instead of a real programming language (python). Some of the “good principles” above are similar to coding principles, which would happen automatically in a coding environment. Our recommendation is to avoid complex DSLs and use python as much as possible.

For example: “It should be possible to detect unused or redundant settings” How would one go about this? Configuration is not code, so static analysis or code coverage will not pick this up. Or “Configurations should undergo a full code review and be checked into a repository.” Code will already go through code review. How would one review a usually not very readable config file?

7: Dealing with Changes in the External World

This is the same as the other article apart from “Up-Stream Producers”, but this should have been in the “Data Dependencies…” section.

Data is often fed through to a learning system from various upstream producers. These up-stream processes should be thoroughly monitored, tested, and routinely meet a service level objective that takes the downstream ML system needs into account. Further any up-stream alerts must be propagated to the control plane of an ML system to ensure its accuracy.

This should be mitigated by adequate end-to-end analysis.

8: Other Areas of ML-related Debt

This is a new section mentioning four critical sources of debt. Interestingly, the authors thought these are “additional” sources of debt while, in my opinion, these are _core_ sources of debt.

This comes from the interpretation of what is an “ML System”. Is it the end-to-end system that solves a business problem and is maintained by a cross-functional team, or is it an engineering system that _enables_ DS teams to solve ML problems by creating models. Who is the customer of an “ML System”? The Data Scientist or the Business? Major difference. For a “reasonable” scaled company, one would recommend the first definition to simplify the organisational structure.

Data Testing Debt. If data replaces code in ML systems, and code should be tested […]

This is mentioned here separately after it had its own section. Testing ML systems is complicated, but a large amount of end-to-end evaluation is the best way to achieve this.

Reproducibility Debt. As scientists, it is important that we can re-run experiments and get similar results, but designing real-world systems to allow for strict reproducibility is a task made difficult by randomized algorithms, non-determinism inherent in parallel learning, reliance on initial conditions, and interactions with the external world.

This is indeed hard, but ML reproducibility can be relaxed to a reasonable similarity. If simple retraining on the same data can result in widely different performance characteristics, one might have bigger problems than reproducibility.

Process Management Debt. […] This raises a wide range of important problems, including the problem of updating many configurations for many similar models safely and automatically, how to manage and assign resources among models with different business priorities, and how to visualize and detect blockages in the flow of data in a production pipeline.

I think this is a problem of scaling. Most companies in an early stage of ML adoption should worry less about scaling and more about reproducibility, CI/CD, data lineage and end-to-end evaluation.

Cultural Debt. There is sometimes a hard line between ML research and engineering, but this can be counter-productive for long-term system health. It is important to create team cultures that reward deletion of features, reduction of complexity, improvements in reproducibility, stability, and monitoring to the same degree that improvements in accuracy are valued. In our experience, this is most likely to occur within heterogeneous teams with strengths in both ML research and engineering.

They left the best one to the end. This is _huge_. ML is a cross-functional sport. Every team member must bring their own unique skills to solve the problem, and these need to fit together perfectly. ML is hard enough already without additional team-level issues. Heterogenous teams also cross-train themselves improving each team member individually.

It is highly recommended that you take the author’s (and our) advice.

9: Conclusions: Measuring Debt and Paying it Off

This section was extended with these excellent questions, which I copy here one-by-one:

A few useful questions to consider are:
• How easily can an entirely new algorithmic approach be tested at full scale?
• What is the transitive closure of all data dependencies?
• How precisely can the impact of a new change to the system be measured?
• Does improving one model or signal degrade others?
• How quickly can new members of the team be brought up to speed?

But these will be discussed in the next part when I review: “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction”.

Summary

This article is the big brother of the one I reviewed yesterday and, just like that and an excellent one. It shows Google’s prowess in the field already eight years ago.

The box chart is reproduced everywhere to demonstrate the importance of MLOps at the expense of modelling. But still, the value is generated by the model, so the focus should be on the enabling of its improvement.

Some takeaways:

Difference between SWE and ML.
- SWE: exact, easier to test, must be fixed immediately.
- ML: statistical, harder to test, can be fixed later, “Graceful Degradation”.
Analysis Debt: You must test much more than you would think. Most problems should be detectable with end-to-end evaluation.
Abstraction Debt: Just because the infrastructure is not abstracted away, one must still need to maintain “Layers of Abstraction”.
Prototype Smell: Using “Layers of Abstraction” to avoid friction of moving from POC to Production.
Configuration Debt as an Anti Pattern: If you define your system with a significant amount of non-native code, you will have problems.
Culture Debt: ML is cross-functional, so the teams must be as well.

I hope you enjoyed this rather long summary of this excellent paper, and please subscribe to be notified about future parts of the series:

Deliberate Machine Learning

Discussion about this post