Can you version control Jupyter notebooks?

2022-09-16

Sep 16, 2022

There is not a day pass by that I don't see on LinkedIn that "version controlling notebooks is solved" or a "solution" to version control them.

These all seem to ignore the killer feature of notebooks and the killer feature of version control. These paradigms are fundamentally incompatible, making all resolutions broken.

But Data Scientists love their notebooks despite the technical debt they cause, and like all good customers, they want a magic solution to resolve it rather than changing their habits. Does a salesman tell a customer that a magic solution doesn't exist? No. A good salesman gives them _a_ solution and tells them it is a magic solution. Hence the proliferation of all these products.

So why are the two paradigms incompatible?

Notebooks' killer feature is "REPL", the "read-eval-print-loop", meaning you can interactively see the result of your code and edit it freely. This makes it No#1 tool for experimentation, quite rightly (we use it a lot). You don't need to think too much about what you are doing and be extremely productive. Similar to browsing the internet, you can focus on content, not form.

On the other hand, version control is about deliberate change. Every time you change your code, you create a diff and compare each difference side-by-side to ensure you know exactly that you only change the code you want to. Then someone else checks it as well before the main branch is updated. This makes sure that everyone's mental model of the codebase is intact. The killer feature is that you are able to change and move from one coherent state to the next (and backwards if needed). Turns code changes from Jeff Bezos's Type 1 (irreversible) decisions to Type 2 (reversible) ones.

These are unresolvable differences. If you prioritise deliberateness, you ruin the benefits of the REPL feature. If you prioritise moving fast, your pull requests will be incomprehensible.

Two usage patterns

There are two patterns to combine these. Neither of them is satisfying:

Regularly restart the kernel and "Rerun All" cells, and clear all cell results before merging: This is the "pretend notebooks are .py files" method, which ruins the benefits of notebooks. At the same time, you still need to deal with accidental cell changes.
Use version control as a backup, a dumping ground. Merge notebooks without any thought. The state of each notebook is a mystery. Code quality is not checked. This removes the benefits of version control (Type 1->2), but you still have your files.

What's the solution?

In our practice, we separate code and analysis. Code is written outside of notebooks, then imported into notebooks. This imported code is set up (according to Clean Architecture principles) and then run there for testing and experimentation (later, this can be moved to shell scripts). All analysis and visualisation happen in notebooks. Code is version controlled from Day 0. Notebooks are only stored for safekeeping.

This setup allows us to exploit the benefit of both tools and is beneficial even in EDA/POC phase as well.

Summary

Notebooks and version control have contradicting killer features:

Notebooks: REPL, fast iteration, experimentation
Version Control: Deliberate changes. Turn Type 1 (irreversible) decisions into Type 2 ones.

The solution is to separate code that matters from analysis code, version only the first while experiment with the second. And use Clean Architecture: [talk/slides on the topic].

Deliberate Machine Learning

Discussion about this post