Article Review: Machine Learning: The High-Interest Credit Card of Technical Debt by Google
2022-03-14
Hypergolic (our ML consulting company) works on its own ML maturity model and ML assessment framework. As part of it, I review the literature starting with five articles from Google:
Machine Learning: The High-Interest Credit Card of Technical Debt
The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction
Rules of Machine Learning: Best Practices for ML Engineering
MLOps: Continuous delivery and automation pipelines in machine learning
Our focus with this work is helping companies at an early to moderately advanced stage of ML adoption, and my comments will reflect this. Please subscribe if you would like to be notified of the rest of the series.
Machine Learning: The High-Interest Credit Card of Technical Debt
The article starts with the definition of technical debt. Ward Cunningham established the term as a metaphor for corporate debt (and not household debt). This later started its own life and went through what Doc Norton calls “metaphorphosis” [Youtube]. Practitioners began to call any kind of bad practices “tech debt” despite what they actually are: bad practices. We will see that some of the mentioned items below are not tech debt (which is a well-calculated risk that can be later mitigated) but outright errors that should be avoided.
In this paper, we focus on the system-level interaction between machine learning code and larger systems as an area where hidden technical debt may rapidly accumulate. At a system level, a machine learning model may subtly erode abstraction boundaries. It may be tempting to re-use input signals in ways that create unintended tight coupling of otherwise disjoint systems. Machine learning packages may often be treated as black boxes, resulting in large masses of “glue code” or calibration layers that can lock in assumptions. Changes in the external world may make models or input signals change behaviour in unintended ways, ratcheting up maintenance cost and the burden of any debt. Even monitoring that the system as a whole is operating as intended may be difficult without careful design.
The end of the first section highlights four major categories that will be discussed later in detail. The categories are excellent, covering well the entire modelling pipeline. I keep the original numbering for easier cross-reference.
2: Complex Models Erode Boundaries
Entanglement
Unfortunately, it is difficult to enforce strict abstraction boundaries for machine learning systems by requiring these systems to adhere to specific intended behaviour. Indeed, arguably the most important reason for using a machine learning system is precisely that the desired behaviour cannot be effectively implemented in software logic without dependency on external data. There is little way to separate abstract behavioural invariants from quirks of data. The resulting erosion of boundaries can cause significant increases in technical debt.
ML is exactly about coupling code with data, which can lead to an erosion of boundaries. But current ML practices are often abandoning any abstraction and boundary separation, which will lead to an excessive amount of technical debt or even the failure of the project.
Just because some coupling is inevitable, one must attempt to maintain abstraction through end-to-end testing and refactoring.
No inputs are ever really independent. We refer to this here as the CACE principle:
Changing Anything Changes Everything.
The authors argue that interconnected elements result in CACE. But this is inevitable with pipeline style products like ML. The goal should be the isolation of change and monitoring the effect of this change no matter how complicated evaluation is needed.
This is a frequent trope in this article that the modelling is treated as an engineering problem rather than a cross-functional business intelligence + data science + software engineering project. Traditional SWE teams are not well equipped to evaluate end-to-end systems, and that’s why they need functions that can run in-depth analyses on the various data problems.
They also suggest serving ensembles, which is the right solution if the _model_ specification changes.
Hidden Feedback Loops
Another worry for real-world systems lies in hidden feedback loops. Systems that learn from world behaviour are clearly intended to be part of a feedback loop.
They give an example of weekly total CTR being delayed by a week, which is natural for cumulative features over time. They acknowledge that analysing these problems are natural for ML researchers to investigate.
These kinds of ML modelling decisions are higher-level level problems that are not related to technical debt. These are “essential complexity” rather than “accidental complexity”. The question is how quickly you can change this feature to another and have an evaluated model, but that’s not related to the fact that temporal features are difficult to reason about.
Undeclared Consumers
This section starts with a typical case of one model using the output of another model. Business problems are solved with end-to-end system-level thinking, and this type of behaviour needs to be analysed at evaluation time. Again I don’t think this is tech debt. Conway’s Law would expect to move both models under the same owner at a “reasonable scale” company. I appreciate that Google is not a reasonable scale company, and this might not be possible for them, making the authors mention this issue.
This section doesn’t mention feedback loops resulting from recording user behaviour that is affected by the very model under observation then training on this data. These effects need to be treated with AB testing and causal modelling.
3: Data Dependencies Cost More than Code Dependencies
Dependency debt is noted as a key contributor to code complexity and technical debt in classical software engineering settings. We argue here that data dependencies in machine learning systems carry a similar capacity for building debt. Furthermore, while code dependencies can be relatively easy to identify via static analysis, linkage graphs, and the like, it is far less common that data dependencies have similar analysis tools. Thus, it can be inappropriately easy to build large data-dependency chains that can be difficult to untangle.
Unstable Data Dependencies
Data signals can change over time purely due to external changes or model changes. Also, the separation of data engineering and model engineering can amplify this. The authors recommend using versioned copies of the data to mitigate this. And they argue that the requirement to maintain multiple versions of the same signal over time contributes to technical debt in its own right.
External change is inevitable in ML, so you don’t fight it but manage it. Data versioning is a good step but data lineage (which model was in prod at which time and which data it was trained on) is even better. One must have a “time-travel” capability. Maintaining data history is a must in ML problems; therefore, “versioned data sets” will come as a free feature.
This section feels like only focusing on ML as an extension of SWE while ignoring the Business Intelligence (BI) part of DS, which is inseparable from it. ML products are always evaluated end-to-end through business goals and BI techniques, not just through engineering paradigms.
Underutilised Data Dependencies
Underutilized dependencies can creep into a machine learning model in several ways.
Legacy Features
Bundled Features
Epsilon-Features
These forms of “dead code” issues need to be removed by pruning while monitoring end-to-end performance. But one must be careful. Small impact doesn’t mean uselessness, as maybe the feature is important in some high-risk edge cases. This is why evaluating models is a BI problem and not a purely statistical one.
As an example, suppose that after a team merger, to ease the transition from an old product numbering scheme to new product numbers, both schemes are left in the system as features. New products get only a new number, but old products may have both. The machine learning algorithm knows of no reason to reduce its reliance on the old numbers. A year later, someone acting with good intent cleans up the code that stops populating the database with the old numbers. This change goes undetected by regression tests because no one else is using them any more. This will not be a good day for the maintainers of the machine learning system.
This happens when the model evaluation and monitoring are done ad-hoc. ML systems are never left alone and always evaluated end-to-end. Versioning data dependencies and proper ELT policy will help you mitigate the above situation.
This again sounds like the separation of DS/BI and SWE/ML. Having a cross-functional team will help you deal with this.
Static Analysis of Data Dependencies
Breaking changes due to data changes. This needs to be addressed by defensive programming; external data dependencies should be tested, and sudden changes should be flagged by anomaly detection (Again, anomaly detection is an E2E solution). Google is large enough for these dependencies to be incomprehensible for a single person, but most places should be able to resolve these.
Correction Cascades
This describes an error specific to the incorrect implementation of a special ensemble related to residual modelling. Then immediately answers it with the correct solution: All members of the ensemble belong to the same end-to-end project and should be owned and evaluated together.
4: System-level Spaghetti
Glue Code
Using self-contained solutions often results in a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages. […] Glue code can be reduced by choosing to re-implement specific algorithms within the broader system architecture.
Managing layers of abstraction in a clean architecture should be the primary way to manage external packages and not reimplementation. Most programming efforts should be made on business problem specific code, and the solution should connect to external solutions through Dependency Inversion. I will write more about this later:
They also give a 5% - 95% ML code - glue code ratio. This is an incorrect way to think about ML products. The argument should be about essential versus accidental complexity and treat essential code outside of ML code as first-class citizens of ML products and not just the code of the actual model. If you allow your team to discredit 95% of the codebase as “glue code”, “why do you expect them to maintain it, and why are you surprised that tech debt slows the team down?
Pipeline Jungles
It’s worth noting that glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles. When machine learning packages are developed in an ivory-tower setting, the resulting packages may appear to be more like black boxes to the teams that actually employ them in practice.
Pipeline Jungles are a fact of life in ML though their issues are often overstated. In practice most of the time there are only a few steps in a typical ML pipeline. One must enable refactoring to converge toward a simple and maintainable logical state to manage the changes during the product’s lifecycle.
The quote points to an obvious problem of separation of research and productionisation. Clean Architecture enables the merging of these parts into a single continuous process and reduces friction during product delivery.
Dead Experimental Codepaths
This is the same problem as in “traditional” Software Engineering. Clean Architecture, refactoring capability and end-to-end evaluation should enable the creation and the eventual removal of experimental code paths.
Configuration Debt
Consider the following examples. Feature A was incorrectly logged from 9/14 to 9/17. Feature B is not available on data before 10/7.
This is an important section; the example they give is typical and hard to handle. One would hope that end-to-end evaluation and specific, well-named code paths will help to resolve this problem.
Their quote regarding ML systems containing more configuration lines than code lines is an antipattern, or they only treat the direct ML code lines (see above 5/95 split re ML/glue code) as relevant.
ML systems are always evaluated end-to-end, and configuration is part of this evaluation. This might make understanding cause and effect regarding parts of the code, configuration and data difficult, but that’s reality.
5: Dealing with Changes in the External World
Fixed Thresholds in Dynamic Systems
It is often necessary to pick a decision threshold for a given model to perform some action: to predict true or false, to mark an email as spam or not spam, to show or not show a given ad. One classic approach in machine learning is to choose a threshold from a set of possible thresholds, in order to get good tradeoffs on certain metrics, such as precision and recall. However, such thresholds are often manually set.
These thresholds are usually set by business requirements rather than the result of an optimisation algorithm. It is crucial to record the historical values of these thresholds akin to changes in labelled data or configuration files. Understanding the feedback effect of these changes needs to be incorporated into the end-to-end analysis.
When Correlations No Longer Correlate
However, if the world suddenly stops making these features co-occur, prediction behaviour may change significantly.
Again as before, ML products are never “left alone”, end-to-end analysis and evaluation never stop precisely for this reason. You always need to keep one eye on a model in production in case something changes unexpectedly.
Monitoring and Testing
Prediction Bias: […] Slicing prediction bias by various dimensions isolate issues quickly, and can also be used for automated alerting.
Action Limits. In systems that are used to take actions in the real world, it can be useful to set and enforce action limits as a sanity check. These limits should be broad enough not to trigger spuriously.
Same as before, one must never stop evaluating.
Summary
This was a great article, and I really enjoyed rereading it. As always, one must remember that “You are not Google”, but many excellent points can be distilled from it for “reasonable” sized companies.
In general, I feel the content is very engineering-focused, which misses out on the cross-functional nature of ML. In a reasonable scaled company, these problems should be resolved by moving ownership of problems to a team that have BI, DS, ML and SWE skills at the same time. Some interdependency problems can be resolved by “Inverse Conway Manoeuvre”, moving the dependent parts to the same team rather than keeping the risk of an external link.
Also, the article (or, in general, the entire industry) underestimates the amount of end-to-end analysis that needs to go into creating and maintaining an ML product. Some of the risk factors above will be resolved if the BI/DS team evaluates the model frequently and rigorously purely to understand its business impact.
To make this happen, one must maintain a rigorous data lineage facility. Everything related to the model (data, labelled data, configuration, code) and its usage (training, testing and production data) must be recorded and made available for evaluation.
The fourth conclusion is the application of traditional Software Engineering principles on ML projects. One cannot call “glue code” everything outside the concrete model code, or no one will maintain it. ML teams need to be familiar with concepts like refactoring, decoupling, clean code, clean architecture and practice them.
I hope you enjoyed this rather long summary of this excellent paper, and please subscribe to be notified about future parts of the series: