Discover more from Deliberate Machine Learning
What is a Code Smell and what can you do about it as a Data Scientist? (Part 1)
Join our Discord community, “Code Quality for Data Science (CQ4DS)”, to learn more about the topic: https://discord.gg/8uUZNMCad2. All DSes are welcome regardless of skill level!
The first part of the series introduced how to bring your code from a notebook into a state where you can start refactoring. The second part introduced two important structural design patterns helpful in increasing the code’s modularity by decoupling.
This third part will introduce the concept of code smells. Like the previous parts, I will not give a complete picture but only mention the ones I think are typical or important for Data Science workflows. It is more important to get momentum on the topic than to be complete. I will provide additional material in the comments at the end of the article.
As we already know:
code is read ten times more than written
Usually, the refactoring process starts with reading the code looking for a suitable place to change a new feature. During this, you realise that something is not right. Something stops you from either quickly understanding the code or changing the code.
This can be a sign of code smell.
What is a code smell?
It is essential to understand that code smells are not bugs; the code works as intended but is implemented in a subjectively weak way.
I copy the definition from Wikipedia: “Smells are certain structures in the code that indicate a violation of fundamental design principles and negatively impact design quality” and also “Code smells are an indicator of factors that contribute to technical debt.”
An alternative definition is that code smells are a set of 22 well defined and named patterns that might need refactoring attention. Listing these 22 and naming them enables building a common language and communication around them.
Not every smell needs refactoring, but it is usually worth starting refactoring with them.
Five categories of code smell
Data Scientists regularly end up with mega classes and mega functions that are hard to understand and change.
These smells are collectively called “Bloaters”, code elements that shouldn’t be as big as they are now. Reading a long piece of code is tiring, especially if you need to change it and comprehend more code than necessary.
Three code smells in this category are Long Method, Large Class and Long Parameter List. The treatment for these is usually to extract part of the code into a separate class or function. Move variables and code that belong together closer and make them decoupled from the part where they are right now.
Long methods, classes and parameter lists are usually a sign that a part of the code is doing too much, it has too much responsibility, and multiple functions are coupled. If you need to change any of them, you need to think through how it works to make sure you don’t change anything accidentally.
There are two other important ones for DS projects in this category: Primitive Obsession and Data Clumps. These happen when you use basic types like strings and integers instead of a class. The typical situation is when you return multiple values from a function or pass a pair of variables around, and they always appear in each other’s company (typically input and output values, train and test sets etc.). Treating these can also help the Long Parameter List smell as well.
2. Object-Orientation Abusers
You can find many sources for talking down the four OOP principles, and DSes rarely follow them anyway, so I think they are less relevant in this article. Please check out the linked sources in the comments to learn more about these. My recommendation is don’t overuse the four principles, and you will be fine.
There is one here I mention nevertheless by name. It is: Switch Statements or If Statements in general. This is an interesting case I often see in heavily numerical applications, partly together with long parameter lists.
You can regularly see a pattern that multiple primitive types and boolean variables are passed to a method that does way too many things at once. Inside the function, different behaviours can be selected based on these input parameters by logical statements.
This is very hard to read because the behaviour is dependent on the values, and you need to replay the code in your head to understand which part is relevant for you.
The solution is to decompose the function into multiple parts. Parameters that govern general behaviour should be handled at construction time with Strategy Patterns. Remember that complex behaviour is built, not declared.
3. Change Preventers
I selected Shotgun Surgery from this category, partly because it has a funky name, but also I found this typical in DS projects. You try to change something and realise that you need to change the code in multiple places. This is a sign of coupling, which makes change harder and accelerates the accumulation of tech debt. In code review, you can spot this when many files need to be changed in a single pull request.
To resolve this, you need to think about how and why different parts of the codebase are related and try making these get the information from the same place, so in the future, you only need to change that one place. It can be related to Primitive Obsession.
This is an easy one, something that adds no value to the solution: Duplicate Code, Dead Code, Comments. Of course, as code smells are not bugs, you need to make subjective decisions with these. Comments are probably the most controversial one here. Still, one should aim for self-documenting code unless there is some external pressure (some users can’t access the source code, complicated use, etc.). In general, comments are not tested, so they can diverge from actual behaviour and cause more problems than benefits.
Another worth mentioning is Speculative Generality or YAGNI. Usually, it is not worth writing something that has no immediate utility. Because code is read 10x more than written each time you make some speculative prediction, it really needs to be worth it, or otherwise, it will be ten times read and confuse the reader. It is better to keep the codebase in a state where change is easy and make that change when you actually need it.
These are usually coupled classes for reasons that can be avoided. Two typical ones are Feature Envy and Inappropriate Intimacy. I selected these partly because they are typical in DS project and also again for the funky names.
A class is using another class’s data or fields. Usually, it helps to rethink the relationship between the two classes. For example, if you need to get all the data out of a class, do a calculation and then pass on the result, why is that calculation not part of the original class? Call a member function and get the result straight from the class instead of getting the data out. For Inappropriate Intimacy, why do you need to get access to a class? What are you doing with the data? Why is the class itself not responsible for this behaviour? Restructure to eliminate the access.
Often a refactoring effort restructures code and leaves new places to fix. It is worth taking a big picture view of what exactly happening and fixing newly introduced smells. The goal is to quickly iterate to a structure that solves the problem in its current iteration rather than when the project started.
In the following parts, I will be focusing on concrete examples and methods to resolve some of these in practice. I also plan to write about organising code in the body of a single method to have a clean flow and good readability. Subscribe to be notified and see further readings in the comments.