5 Comments
User's avatar
Phillip Henry's avatar

It's probably horses for courses but I don't like tests that fail just because I've *changed* something. I prefer them to fail when I *break* something. This leads to less cognitive overhead and is much more useful (I know when I changed something because I just executed `git push`).

Also, pickling objects and comparing them can make the test suite very fragile. For example, a schema change will break the tests even if nothing is actually wrong.

I've found a better strategy is to make vague but accurate assertions. For example, I don't know exactly how many p-values are going to be less than 0.01 for a linear regression model on randomly generated, synthetic data but I expect it be most of them (depending how much bias I've added to the test data).

Interesting stuff. ML has a lot to learn from SWE and this fascinates me.

Expand full comment
Laszlo Sragner's avatar

Good comments.

I think you touched on a very important point here (from two directions).

We found that classic unit tests are inconvenient for ML because there are no easy way to get truth values. So comparing current values to the previous run's values is a cheap hack. If your code is deterministic and you didn't make a behaviour change then these should match. If they don't match and you didn't intend that's a problem.

This part is only for refactoring (moving parts of the code around) not for let's say changing features (which changes the behaviour of your code).

So what happens if you _do_ want to change the behaviour? At that point you need to figure out that only the right tests are failing and mark the current values as "truth" (by deleting the cached version). In reality this is a much more involved process as you quickly run out of low hanging fruit. We used to have a cascade of 0.1%, 1%, 10%, 100% tests and do a detailed comparison to see what was the reason for any failures. Unfortunately often this is the only mechanical way to find cause and effect in pipelines. Still much better than ad-hoc.

>For example, a schema change will break the tests even if nothing is actually wrong.

I would absolutely want my code break if a schema changes. Data inside the ML product should be as minimal as possible so if there is a slack column I want the code to be refactored by removing that column.

Yes you can make assertions on end results, for example one of the elements we used to pickle is the confusion matrix.

But again this is pretty flexible and you figure out what makes you confident that your code still does what you think it should do. My goal is to provide terminology, methods and tools to achieve this.

Expand full comment
Phillip Henry's avatar

> I would absolutely want my code break if a schema changes.

I think there might be some confusion here. If the schema changes and I refactor the prod code perfectly, I don't want the test to fail just because the test is using some out-of-date, pickled object for comparison. I want the test to fail because the *prod* code is wrong, not the *test* code. False positives can undermine the trust we have in the test suite.

> We found that classic unit tests are inconvenient for ML

We found them enormously helpful even with non-deterministic inputs (a common paradigm in Property Based Testing you find in frameworks like ScalaCheck that Spark uses in its builds).

The idea is that you assert various invariants irrespective of the inputs. For example, "given synthetic, biased data, does my model score at least x% above random"? This helped us to catch at least one bug: in mapping the age of patients to {0, 1} based on a threshold. We mistakenly applied this mapping twice resulting in this feature being 0 for everyone. Consequently, although the model was better than chance (there were many other features) it did not meet the x% above random threshold and we caught the bug before merging the code with master.

Note that (a) we did not use any difficult to maintain, pickled data and (b) we can change the code (eg, adding new features) without this test failing for a good reason.

Expand full comment
Laszlo Sragner's avatar

Maybe we mean something different under schema.

If you pickle a df and refactor the code the df in the new run must match the previous one. If you make a change that changes that df then it should break. It's not a false positive as you intentional changed how the code behaves.

Property Based Testing is really interesting, I haven't thought about how to apply it to ML. I understand that in some cases like you mentioned unit tests can help but in practice we found that apart from trivial cases we always found elusive bugs at large scale testing. And trivial cases were found by breaking the above pipeline (either by failing an assertion or throwing an exception) Statistical changes of our models after a point were so convoluted that they needed to be verified in manual analysis (load data into notebooks and compare the two runs output as queries) Given that analysis risk is the largest problem in ML we considered that time well spent (DSes needed time to figure out what's going on anyway and they were doing this parallel to comparing the outputs)

The above framework was to maintain code quality but not to change the behaviour materially.

Expand full comment
Matteo Latinov's avatar

If I may, I think the confusion here stems from specifics in the testing workflow in the case of a change in behavior:

1. Change behavior (e.g. adding features, which is NOT refactoring code).

2. Run code and expect the right assertion errors to occur, i.e. in correct location in codebase. This is like an extra sanity check.

3. If 2 is OK, BE SURE TO delete current pickle file and rerun code to replace it with new pickle file that is new reference output (i.e. with new features in data)

Expand full comment