Article Review: Rules of Machine Learning: Best Practices for ML Engineering by Google
2022-03-17
Hypergolic (our ML consulting company) works on its own ML maturity model and ML assessment framework. As part of it, I review the literature starting with five articles from Google:
Machine Learning: The High-Interest Credit Card of Technical Debt
The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction
Rules of Machine Learning: Best Practices for ML Engineering
MLOps: Continuous delivery and automation pipelines in machine learning
Our focus with this work is helping companies at an early to moderately advanced stage of ML adoption, and my comments will reflect this. Please subscribe if you would like to be notified of the rest of the series.
Rules of Machine Learning: Best Practices for ML Engineering
Forty-three points!!!! This will be a long one, so buckle up.
do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.
Ok, what’s if you are not a great engineer (yet)? Strive to be better. You find a lot of content on this blog about this:
How to simplify analysis: [link]
How to structure your teams and fit into your org: [link]
How to connect business value to DS projects: [link], [link] and [link]
Make sure your pipeline is solid end to end.
Start with a reasonable objective.
Add common-sense features in a simple way.
Make sure that your pipeline stays solid.
So this is essentially Test Driven Development and Continuous Delivery through pipeline based functional testing. I think we are at the right place, and this article will nicely fit into the general topic of this blog.
Before Machine Learning
Rule #01: Don’t be afraid to launch a product without machine learning.
Rule #02: First, design and implement metrics.
Rule #03: Choose machine learning over a complex heuristic.
Instead of using ML or not, a more important question is the generic data culture of the company. The typical situation to use ML is where a high amount of microdecisions happen in data-rich contexts. These are ripe for automation. A company with a good data culture will proactively start collecting data about these situations to enable a head start for future ML projects.
Rule #02 is a no-brainer, not just work on metrics but place the ML product in the broader context of the business and align with its AI strategy. How does it help the company? How the project will be accelerated by existing ML infrastructure and components, and what other future elements will it enable. This can get you support from different parts of the business in favour of your project.
Rule #03 is one of my favourites. Some people think that rule-based systems are somehow simpler than ML solutions. They are not. Above a level of complexity, it is simpler to train models than deal with the convoluted relationships between rules. If you want some heuristics, you can implement those through feature engineering.
ML Phase I: Your First Pipeline
Rule #04: Keep the first model simple and get the infrastructure right.
Rule #05: Test the infrastructure independently from machine learning.
Rule #06: Be careful about dropped data when copying pipelines.
Rule #07: Turn heuristics into features, or handle them externally.
We advocate a system architecture called “Clear Architecture” this separates the core of the ML model from the infrastructure part hooking it up through adapters to temporary data sources. As the project progresses, these adapters can be replaced with the real ones without touching the ML model. This enables functional testing and performance evaluation throughout the project while enabling a smooth transition to production if the early stages are successful. This, in general, helps to reduce “Prototyping Debt”.
Ahh, Rule #07 is just what I mentioned at Rule #03.
Monitoring
Rule #08: Know the freshness requirements of your system.
Rule #09: Detect problems before exporting models.
Rule #10: Watch for silent failures.
Rule #11: Give feature columns owners and documentation.
As I mentioned in the earlier reviews, many projects suffer from “Analysis Debt”. DSes are not running enough evaluations to observe the whole system both in production and off-line. Many integration and data issues can be spotted when doing a large scale analysis and start looking for a root cause. The goal is to be exhaustive to find sources of potential problems.
Your First Objective
Rule #12: Don’t overthink which objective you choose to directly optimize.
Rule #13: Choose a simple, observable and attributable metric for your first objective.
Rule #14: Starting with an interpretable model makes debugging easier.
Rule #15: Separate Spam Filtering and Quality Ranking in a Policy Layer.
This is an interesting point. It refers to Rule #39: “Launch decisions are a proxy for long-term product goals.” and discusses how optimised metrics and goals can diverge. This is a crucial point. I would rather focus on aligning the ML product with the business goals from Day 0.
At first, I would use a simple model and iterate with the business team on real impact studies rather than work on a simplified use case. Often business teams don’t have a completely crystalised idea of what they want. Still, if you start showing them concrete impact charts, they will clarify what they actually care about of the many KPIs an ML product can impact.
Rule #15 refers to a convoluted ML case where different ML problems are mixed into one product. They interact with each other, have different contexts and need to be trained and used differently. On the other hand, it is very Google-centric, so I just link it here for the reader to find analogues to their own work: [link].
ML Phase II: Feature Engineering
Rule #16: Plan to launch and iterate.
Rule #17: Start with directly observed and reported features as opposed to learned features.
Rule #18: Explore with features of content that generalize across contexts.
Rule #19: Use very specific features when you can.
Rule #20: Combine and modify existing features to create new features in human-understandable ways.
Rule #21: The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have.
Rule #22: Clean up features you are no longer using.
Lot’s of great rules about feature engineering. In general, prepare for change. ML is an iterative subject, and your codebase should be constantly in a state that you can change it and have confidence that you made a positive impact not only on the statistical metrics but also on business goals.
If you took our advice on Clean Architecture and plenty of functional testing and large scale evaluation, you would be in a good place to iterate.
Rule #21 is a good rule for deciding model complexity and using more data. But take into account the impact of each feature and simplify as much as you can. Always link back to the business case and assess real impact, not just model metrics.
Human Analysis of the System
Rule #23: You are not a typical end-user.
Rule #24: Measure the delta between models.
Rule #25: When choosing models, utilitarian performance trumps predictive power.
Rule #26: Look for patterns in the measured errors, and create new features.
Rule #27: Try to quantify observed undesirable behaviour.
Rule #28: Be aware that identical short-term behaviour does not imply identical long-term behaviour.
One of the main reasons we advocate extremely fast productionisation (even if it is just with domain experts or business units) is to expose the product to someone else than the product team. Even internal users are better judges of your product than you are. Find external design partners who are willing to look at your product at an early stage and gather feedback.
Otherwise, this section is a reflection on “Analysis Debt”. More evaluation in a “real-life” context is the best.
Rule #24 (measuring delta) is an excellent point. Pair that with evaluation based on different slices as in Point III.6 from [here] and combine that with Clear Architecture based iteration, and you have an extremely powerful workflow.
Look at slices that are somehow different from the average
Make small change
Run evaluation
Compare delta to original evaluation (slice and average)
Reason about effects
Iterate from 1
Training-Serving Skew
Rule #29: The best way to make sure that you train as you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time.
Rule #30: Importance-weight sampled data, don’t arbitrarily drop it!
Rule #31: Beware that if you join data from a table at training and serving time, the data in the table may change.
Rule #32: Re-use code between your training pipeline and your serving pipeline whenever possible.
Rule #33: If you produce a model based on the data until January 5th, test the model on the data from January 6th and after.
Rule #34: In binary classification for filtering (such as spam detection or determining interesting emails), make small short-term sacrifices in performance for very clean data.
Rule #35: Beware of the inherent skew in ranking problems.
Rule #36: Avoid feedback loops with positional features.
Rule #37: Measure Training/Serving Skew.
Nine rules on a single topic, and most of them are incredibly wordy. I think this shows how big this problem is.
It’s hard to add anything here. Rule #29 (log feature data at serving and later train on it) is a major piece of advice. Also, Rule #32 (reuse code). One of the main reasons we advocate Clean Architecture is that it enables extensive code reuse. In general, you are in charge of the project, and it is your job to measure and mitigate train/serve skew.
ML Phase III: Slowed Growth, Optimization Refinement, and Complex Models
Rule #38: Don’t waste time on new features if unaligned objectives have become the issue.
Rule #39: Launch decisions are a proxy for long-term product goals.
Rule #40: Keep ensembles simple.
Rule #41: When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals.
Rule #42: Don’t expect diversity, personalization, or relevance to be as correlated with popularity as you think they are.
Rule #43: Your friends tend to be the same across different products. Your interests tend not to be.
This is the business-as-usual phase of the ML product. Our methodology tries to get to this stage as soon as possible. Hence, most of the product work happens in a “real-life” environment and gets a lot of business attention to enable better decision-making based on KPIs rather than statistical metrics.
Rule #40 is a good reminder that most productionised models are ensembles. Evaluating them is an extra hassle, but they allow you to solve more complex problems than standard ones.
Summary
Just like the previous three, this one was top class as well. After three “problem statement” articles, now we are into actionable advice territory. This article was more business-focused which is always favoured. A couple of takeaways:
Get to production as soon as possible, so the business and design partners care about your product, provide feedback, and clarify your goals.
Heavy evaluation helps you understand your product better and reduce “Analysis Debt”.
Clear Architecture helps you iterate on your product and reuse the code in production as it is reducing “Prototype Debt” and train/serve skew.
Delta testing/feature engineering/ensemble modelling together with the above gives you a workflow that is extremely powerful against production-grade problems.
Log _everything_.
I hope you enjoyed this rather long summary of this excellent paper, and please subscribe to be notified about future parts of the series: