Data Science code quality hierarchy of needs
Join our Discord community, “Code Quality for Data Science (CQ4DS)”, to learn more about the topic: https://discord.gg/8uUZNMCad2. All DSes are welcome regardless of skill level!
After several lengthy conversations on code quality/production code for data scientists, I realised that most people have a "completionist" view of code quality. They treat the topic as an all-or-nothing exercise where achieving "production code" is a checkboxing exercise (have typing? check. have docstrings? check. have unittests? check).
This makes investing in the DS codebase an expensive exercise which requires higher level "business value" arguments. As a checkboxing exercise, the benefits are not clear and hard to assess its ROI. Let's be honest; most advice indeed does not sound useful or beneficial at all.
Code quality bridge too far
It is irrelevant if a better codebase would make the team faster in experimentation or more agile in production. Unless you can express code quality improvements as a series of single steps, each bringing immediate benefit, you will not reach the required codebase.
Code quality as instant added (business) value
This requires reframing what code quality for data scientists (CQ4DS) means. At any given point in time in the ML lifecycle, the relevant advice should look immediately beneficial to the DS and bring immediate value. There are too many distinct points in a code quality argument for all of them to be applicable at the same time. The ML lifecycle is too heterogenous to apply each of them at all stages.
When does this apply: Analysis vs Creation
ML Data Scientists operate in two primary mindsets: Analysis and Creation.
improving their understanding of the context (business, environment, data, product).
updating their mental model of the world (queries, stats, diagrams).
shortlived and dynamic, emphasis is on speed.
actively influencing the world through building ML products (code, data transformations, datasets, models). Of course, next, they switch back to analysis mode to understand if they are influencing in the right way.
long-term, iterated, emphasising agility (ability to respond to change).
Most code quality advice is for the Creation phase. Code in the Analysis phase simply doesn't live long enough. It would require too high of a yield to produce ROI to cover the costs. If an analysis is long-lived, it is usually handed over to BI teams, who turn these into dashboards.
On the other hand, when you are building an ML product (even in EDA/POC phase), you are planning to work on the same codebase for an extended period. I think any code that lives longer than a month is a good candidate. This allows investment into a series of small yields to compound and add up to an outsized ROI.
Trigger warning for Software Engineers
I will recommend an order that is radically different from SWE best practices. This is based on breaking down SWE practices into first principles and finding which apply to ML (or more likely each step of the ML lifecycle). The primary difference between SWE and ML is that SWE is highly specified and instant (if something is wrong and breaks specs, you can immediately know), while ML is only statistically correct. You can only validate your work over a longer term. This makes some SWE practices less relevant because they are used to validate an aspect that is not validatable in ML.
Data Science code quality hierarchy of needs
This list is in the order that aligns with the ML Lifecycle, and the earlier you can or start doing them, the more benefits you will get. This is a net benefit in the actual step of the cycle, not just over the course of the entire project or only if your project is successful. This turns the "business value vs code quality" argument on its head because the early steps provide _more_ business value than without; therefore, applying them is a net positive regardless of the outcome of the entire project.
You will experiment faster and more confidently, you can take larger risks safely and therefore end up at an ultimate conclusion faster. You will be faster to productionise in case of success and quicker to react to changes that appear during deployment.
The top of the list has higher importance, but with practice, the later elements will be native to the practitioner, so you don't need to weigh if you should or should not do it in a new project. Using all of these in a new project will be just how you do things from then on.
Code Review step allows the team to improve cohesion (they can read the repository without it as well if you don't yet have a CR policy) which will bring benefits not only on an individual level but on a team level driving more value for the organisation.
Simple git workflow is a safety mechanism that is cheap to practice and brings huge benefits. It turns overwriting your codebase from Bezos's Type 1 decisions (high risk, can't recover) into Type 2 decisions (you mess up something you can backtrack).
If you are worried about changing your code because it might break, or you have ever needed to rewrite something from scratch because you lost in a long list of changes, this will be a massive help for you.
We are not talking about code review (yet) or CI/CD. Just a safety mechanism to protect yourself from accidents.
Blogpost: Simple git process
There is no agility without refactoring, and there is no refactoring without testing. Software engineers have unit testing, but does it work in ML? Unit tests are implementations of the specification, but what happens if those specs are just statistical metrics? Is there something simpler? Yes, there is.
If you know that your code produces the right results, save it, change your code, rerun it and compare it to the saved version. Boom, functional testing. If it doesn't change, you probably didn't break anything. Too slow? Save and run it only on a sample (0.1%-1%), compare only that and only occasionally run more extensive tests (10%-100%) of the dataset.
Blogpost: How can a Data Scientist refactor Jupyter notebooks toward production-quality code?
Instead of imagining what you would like to do, prepare for it over the long term, do it and worry if it will work or break or not add any value: just do it. You have tests, and you have version control.
Significant changes will become routine, and you start thinking through typing. You leave an immediate mark of your thoughts that you can later revert from, throw away or improve upon.
Refactoring is changing your code without changing its behaviour. Write first, then structure it to enable new changes. All this is in small continuous steps that don't require massive preparations.
Blogpost: Same as above
Change needs optionality, things you can choose from, and decoupling makes these changes cheap. Do you want to try a new algorithm? Decouple the old one with an interface, wrap the new one with the same interface and swap them. It didn't work out? No worry, just swap the old one back.
Your data comes from a pickle, but the new dataset is in a JSON? Wrap the source with an adapter and plug that into your data loading. Are you moving into production? Wrap the database with another adapter and plug it in when your code goes to production.
Blogpost: You only need 2 Design Patterns to improve the quality of your code in a data science project
(this was later revised to 3): Slides for my talk at PyData London 2022: "Clean Architecture: How to structure your ML projects to reduce technical debt."
Compose instead of declare
Complex behaviour is composed. If your ML product does multiple things (and it most certainly does: data cleaning, training, evaluation, modelling, features, etc.), you don't want to have the entire process in a single script. You want optionality to try new setups and new datasets. A good setup allows you to use the same code in different contexts.
Blogpost: Clean Architecture in Data Science + the slides from above
Not declaring your entities as classes leads to several "Code Smells", antipatterns that don't break your code but make your life harder. Primitive Obsession, Data Clumps, Long Parameter List, and Shotgun Surgery can all be a sign that your data model is weakly defined, and better structuring will lead to higher agility. Use dataclasses or pydantic.
Blogpost: What is a Code Smell, and what can you do about it as a Data Scientist?
Code is read 10x more than written. You are helping future you by investing a bit now so you can read it easier (10 times) later. This is often misinterpreted as "single line functions", which is a bad idea in general. A well-written function should be readable in one continuous scroll without jumping up and down.
Because refactoring and decoupling make moving and changing code relatively effortless, this is more of an ongoing housekeeping activity. If something takes too much time to change, you should think if it is an actual programming activity rather than a readability issue and approach it like that. But moving a guard clause (see blogpost for details) to the top of a function and swapping the condition on an if statement should not be a big deal.
This is usually seen as a massive issue. "OMG, someone will look at my work; what will they think about me?" (This raises other questions about your work environment, but that's something different.)
In the DS context, the purpose of the CR is first to inform someone if they "get it" and understand what you want to do. This also immediately moves the "bus factor" from 1 to 2.
Second, it publishes your ideas to the broader team, where they can get familiar with your techniques and ideas. It is fundamentally a positive activity that helps team cohesion. Allows new team members to learn faster, take part in updating the codebase and be more valuable quicker.
Unit testing is usually listed as the first of any DS code quality steps. But it is very hard to have complete coverage of tests for an ML model, which gives you the same confidence as TDD in SWEs. So why trying?
Well, sometimes critical edge cases are so important that you still want to test them explicitly, and unit tests are the best for this.
There is a lot of context regarding unit vs functional testing in a fast-changing and high-risk environment to expand here. We used Kent Beck's experience at FB as an inspiration for our paradigm, I recommend to watch it yourself: Kent Beck - 3x Explore, Expand, Extract
The usual adage is, "If you can't express yourself well in code, why do you think you can express yourself better in comments?"
Focus on changing the code and making it readable. To help users, write blog posts detailing how your code works and give them more context and examples. Point them to the code and ask if they “get” it. If not, that's an opportunity for refactoring.
This is the other high-priority element others recommend. The problem is that you are supposed to do it everywhere, and that's a lot of places in an otherwise dynamic language to pick up a few bugs and misuses.
The only places where we use it are dataclasses and Typer scripts. (and later with FastAPI endpoints)
SOLID (and more acronyms like KISS/DRY/WET)
This is a list of acronyms that are usually thrown around. It is tough to turn them actionable as they are. A lot of them are very software engineering specific. Better focus on refactoring, readability and other hands-on steps.
Decoupling (and Dependency Injection (the “D” from SOLID)) will take care of most of these, which can be achieved with three design patterns (Adapter/Factory/Strategy) in an actionable way.
More Design Patterns
Don't use design patterns just for the sake of it. Some patterns are made to implement a feature that a dynamic language like python already has.
This doesn't mean they are not useful, but they are only selectively impactful.
CI/CD is automation. Automate when you need, but not earlier. If your project is still in a high-risk phase, why would you focus on automating something that might not go anywhere?
Follow Elon Musk's 5 step process: Elon Musk's 5 step process for making things in a better way.
Tying yourself to infrastructure early on will cause more trouble than benefits. You will focus on specific DSL details and not on solving the business problem at hand. The concept of Clean Architecture is created so infrastructure decisions can be held off till the latest point. Right until you can't move forward without making the call. That is the optimal point to make that decision because you will have the most information to decide about it (See above Elon Musk's five steps).
Of course, this assumes that you can actually solve the problem. If you need 100 GPUs even to prove feasibility, you will need infrastructure on Day 0.
But just like CI/CD, early infrastructure is a type of premature optimisation, which we all know by now is the root of all evil.
Code quality is a spectrum. For a high-risk enterprise like an ML lifecycle, you need tailored advice at each step. This is particularly true if you are just starting to care about your codebase. Focus on a few key points, get routine, socialise the benefits and then move on to the next step. Don’t try to go all in on the entire menu because you will struggle to remember everything, and the entire activity will be a chore. Don’t forget this is supposed to be a value-added activity at any given time.
I hope you found this article useful and subscribe now for more similar content: