Article Review: The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction by Google

2022-03-16

Mar 16, 2022

Hypergolic (our ML consulting company) works on its own ML maturity model and ML assessment framework. As part of it, I review the literature starting with five articles from Google:

Our focus with this work is helping companies at an early to moderately advanced stage of ML adoption, and my comments will reflect this. Please subscribe if you would like to be notified of the rest of the series.

The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction

This one is a natural continuation of the “Hidden Tech Debt” article I already reviewed. Once you have identified a problem, write a checklist to make sure the problems are mitigated. Classic Google…

Most of the article details the differences between testing SWE and ML systems, particularly focusing on testing data-related issues, just like demonstrated in the following picture:

Unfortunately, abstraction levels are mixed between engineering issues, modelling issues and production level monitoring issues. Separating these and addressing them with stock solutions is integral to structuring an ML project. I will make these distinctions in my comments. I will also point out what generic capability is required to make these checks.

I will add the following tags to the points:

[EVAL]: evaluation related question. If you have good data storage, data lineage and a flexible tool to run large scale evaluations on different datasets, you are good to go. Even if this is just an SQL DB and a dashboard.
[MODEL]: modelling, coding related question. Maintaining coding quality and refactoring will enable you to satisfy these points.
[TEST]: unit testing and testability related question. This requires quality code and a way to run the model in a test setup. It must be fast to make it easy to run it frequently, but it doesn’t matter if the results are not relevant statistically.

II. TESTS FOR FEATURES AND DATA

1 Feature expectations are captured in a schema. [EVAL]

It is useful to encode intuitions about the data in a schema so they can be automatically checked. For example, an adult human is surely between one and ten feet in height.

Our experience is that apart from trivial use cases like the ones mentioned in the article, it is very hard to come up with an exhaustive list of tests. Having these tests can also give a false sense of security. Rather than trying to accomplish the impossible, the focus should be on anomaly detection and matching distributions. I would describe it as “top-down” rather than the individual “bottom-up” tests. A generic evaluation tool should be helpful here sourced from the data-lineage enabled data source.

2 All features are beneficial. [MODEL/EVAL]

A kitchen-sink approach to features can be tempting, but every feature added has a software engineering cost.

Models should be as simple as possible but not simpler. These tests are critical and should be part of the modelling process.

3 No feature’s cost is too much. [MODEL/EVAL]

It is not only a waste of computing resources, but also an ongoing maintenance burden…

This is probably less of a problem nowadays. I would lump this together with the previous point. Also, periodic refactoring generated features should mitigate at least part of the problem.

4 Features adhere to meta-level requirements. [EVAL]

It might prohibit features derived from user data, prohibit the use of specific features like age, or simply prohibit any feature that is deprecated.

This is a hugely important factor that must be handled at all times. The recent increase in regulations often makes this a mandatory issue. This concerns not only the use of protected features but also their indirect correlations (e.g. given names or birthplace to ethnicity).

5 The data pipeline has appropriate privacy controls. [EVAL]

These should be lumped together with the previous step. One major headache dealing with PII data is that the product team might not be cleared to run experiments on it. The ML infrastructure should be even in an early project phase to be able to connect remotely (without the involvement of DSes) to protected data and create analytics results that the DSes can evaluate. This is one of the reasons we advocate Clean Code/Clean Architecture principles from an early project stage that inexpensively enables this.

6 New features can be added quickly. [MODEL]

For highly efficient teams, this can be as little as one to two months even for global-scale, high-traffic ML systems.

I just added this quote to demonstrate how far ML went that this timeframe was “highly efficient”. Well, times change. Amp it up (©Frank Slootman)

High-quality coding practices will certainly help to do this. Then the already existing evaluation facilities will help you validate the impact of your changes.

7 All input feature code is tested. [TEST]

There are many ways to do this. Sometimes it is too cumbersome to write unit tests for model pipelines, then fast, functional tests on 0.01% of the dataset must be used. The goal is to have a testing facility that can help with refactoring and quick iterations. The most important feature is that it needs to be _fast_ instead of complete. Once you refactor the code to a good place, you can run a full-scale evaluation in the usual way.

III. TESTS FOR MODEL DEVELOPMENT

1 Model specs are reviewed and submitted. [MODEL]

Every model specification undergoes a code review and is checked in to a repository: It can be tempting to avoid code review out of expediency, and run experiments based on one’s own personal modifications.

Yeah, it’s 2022; all code is version controlled and reviewed. No, it doesn’t make you slower.

2 Offline and online metrics correlate. [EVAL]

The previous article referred to this as “analysis debt”. I feel that DS teams, because of their engineering problems, often don’t spend enough time on writing evaluations and find out if there is a problem somewhere with the data. These ongoing analyses are mandatory. You never really leave a model on its own; it is never ready; there is always something to improve. The goal is to make these evaluations easy. This includes comparing training data to evaluation data or recorded real data. As I mentioned in the previous article, ML systems have “Graceful Degradation” you have time to fix them, but you do need to fix them because, unlike SWE systems, they degrade over time.

3 All hyperparameters have been tuned. [MODEL]

Often hyperparameters are set to the right ballpark numbers based on long time experience. Good modelling practices should at least validate that the current parameters are in the right ballpark.

4 The impact of model staleness is known. [EVAL]

Many production ML systems encounter rapidly changing, non-stationary data. Examples include content recommendation systems and financial ML applications. For such systems, if the pipeline fails to train and deploy sufficiently up-to-date models, we say the model is stale.

Here the authors are thinking about the model quality staleness and not the training pipeline’s staleness which is a programming question (code rot).

Again this is a problem of how easy to evaluate models in a controlled environment. Data lineage and large scale evaluation capabilities are a must.

5 A simpler model is not better. [EVAL]

This is usually not a real concern despite a generic aversion to using “complex” models. Nowadays, you can have a high-quality baseline performance in no time. The crux of productionised models is the attention to detail. How do you address the latest edge cases by adding a new model to an ensemble?

I added the [EVAL] label to this as the baseline model is trivially available most of the time. Hence, the effort is onto justifying a more complex model through a series of statistical tests.

6 Model quality is sufficient on important data slices. [EVAL]

Slices should distinguish subsets of the data that might behave qualitatively differently, for example, users by country, users by frequency of use, or movies by genre. Examining sliced data avoids having fine-grained quality issues masked by a global summary metric.

This is a crucial production-grade point. Productionised models are not just launched based on a high-level generic metric like the F1 score. They are tested in detail to determine if a recent change didn’t adversely affect a subgroup of data points.

7 The model is tested for considerations of inclusion. [EVAL]

This is very similar to the PII question above, and most of the same suggestions apply. The company usually sets a fairness policy that needs to be turned into an evaluation policy to ensure the model can be signed off. This is not a good-to-have feature. An ML product that doesn’t pass the PII/Fairness policy is not a viable product and should not be considered for deployment.

Of course, this requires a fairness policy which must be part of any corporate AI strategy.

IV. TESTS FOR ML INFRASTRUCTURE

1 Training is reproducible. [MODEL/EVAL]

As I wrote above, everything should be in a version-controlled and reviewed in 2022, which is a huge step toward reproducibility. In recent years there has been a lot of effort to eliminate the problems generated by the seeds of RNGs, but this should not be a significant problem.

Still, some solutions are unstable and result in widely different outputs. This is crucial in evaluation as fragile (non-robust) solutions have little commercial value. The reputation risk will be higher than any benefit.

Reproducibility is not a 100% issue but should be achieved on a best effort basis, don’t go over the top as this can be a huge time sink with little added value trying to chase some elusive errors. ML models are only statistically correct anyway.

2 Model specs are unit tested. [TEST]

We have found in practice that a simple unit test to generate random input data, and train the model for a single step of gradient descent is quite powerful for detecting a host of common library mistakes, resulting in a much faster development cycle.

The authors recommend the same functional test we do: run the same pipeline with a fraction of the data.

3 The ML pipeline is integration tested. [TEST]

This is pretty much the same as the previous point. It is essential to mention that the outputs of these tests are not used for statistical purposes. These two are only to facilitate refactoring and model creation time. These need to be quick to catch relatively trivial bugs, while evaluation tests must be thorough to catch statistical problems.

4 Model quality is validated before serving. [EVAL]

Again this is a three-way reconciliation between training, evaluation and real data. Reducing analysis debt is an important goal. If there are considerable risks in deploying, one must have A/B/Canary testing facilities.

5 The model is debuggable. [MODEL]

This is a question of the programmer’s taste. ML models are notoriously difficult to debug. Clean Architecture can help you isolate the model in a test setup and run it line by line if needed, as this is a question of testability.

6 Models are canaried before serving. [EVAL]

Same as 4, needs A/B/Canary testing facilities.

7 Serving models can be rolled back. [MODEL]

Interestingly, they promoted this to its own point. Model serving solutions should solve this problem (and also A/B/Canary testing), but other dependent and changeable components like feature engineering are a different issue. Even if all code is version controlled, this can be a real pain. One must have a deployment policy that can handle this and do enough evaluation that a rollback is rarely needed.

V. MONITORING TESTS FOR ML

1 (Data - L) dependency changes result in notification. [EVAL]

ML systems typically consume data from a wide array of other systems to generate useful features. Partial outages, version upgrades, and other changes in the source system can radically change the feature’s meaning and thus confuse the model’s training or inference, without necessarily producing values that are strange enough to trigger other monitoring.

This is the bread and butter of keeping a model in production. A tremendous amount of effort goes into figuring out that (physical and statistical) assumptions about the data is still there and haven’t changed. And if it does, how to quickly react to those changes, re-run the evaluations, re-assess performance and re-deploy the model.

2 Data invariants hold for inputs. [EVAL]

In practice, careful tuning of alerting thresholds is needed to achieve a useful balance between false positive and false negative rates to ensure these alerts remain useful and actionable.

Setting thresholds for alerts is an _incredibly_ difficult task. Set it too high, and you will be constantly harassed by false alarms until you start tuning them out. Set it too low, and you have a false sense of security.

Our recommendation is more evaluation, but I don’t think this comes as a surprise. Catching drifts in progress is a difficult task anyway. But again, you don’t have to because ML systems (usually) have “Graceful Degradation”.

3 Training and serving are not skewed. [EVAL]

Same advice: evaluate more. One not just looks at a precision-recall curve and call it a day but constantly looks for possibilities how the model could be broken because of a change in data distribution. This is also the primary source of ideas for new model features and insights into how the problem and the model interact, naturally having a high ROI.

4 Models are not too stale. [MODEL/TEST/EVAL]

This is similar to IV.3 (integration testing). High-quality coding practices are usually helpful to avoid code rot, and frequent testing and evaluation make sure the model is not stale.

Temporal analysis of results (train on the distant past and evaluate on the recent past) can give an idea about the required retraining schedule. Given that most ML models are never done, and the business requires new changes all the time, this is less of a problem in a product minded DS team.

5 Models are numerically stable. [EVAL]

Invalid or implausible numeric values can potentially crop up during model training without triggering explicit errors, and knowing that they have occurred can speed diagnosis of the problem.

Anomaly detection. Finding all potential silent failure modes is a difficult task, but usually, they result in a low likelihood event.

6 Computing performance has not regressed. [???]

I would think this is a DevOps issue rather than ML. DevOps teams have better tools to monitor this and give recommendations to the ML team.

7 Prediction quality has not regressed. [EVAL]

Validation data will always be older than real serving input data, so measuring a model’s quality on that validation data before pushing it to serving is only an estimate of quality metrics on actual live serving inputs. However, it is not always possible to know the correct labels even shortly after serving time, making quality measurement difficult.

This is a common problem, and checking for it should be a standard part of a DSes job. As always, our recommendation is: evaluate more.

Summary

Another excellent one!

As you can see from the tags, most points are related to writing more evaluation tests. The previous article referred to this as “analysis debt”. Indeed, we found it in our practice that DSes are not running enough high-quality evaluations that can inform further decisions. This is usually because the evaluation setup is ad-hoc and requires a large amount of effort and coding. This is what the previous article referred to as “prototype debt”. Our suggestion is to maintain code quality from Day 1. This will also help resolve the [MODEL] and [TEST] related tags as well.

I hope you enjoyed this rather long summary of this excellent paper, and please subscribe to be notified about future parts of the series:

Deliberate Machine Learning

Discussion about this post