Refactoring for Data Scientists: How to maintain readability in a single method?
Join our Discord community, “Code Quality for Data Science (CQ4DS)”, to learn more about the topic: https://discord.gg/8uUZNMCad2. All DSes are welcome regardless of skill level!
So far in the series, we covered:
Before we continue our journey towards clean code and discuss how to eliminate code smells, I thought I write a post on tactical level thinking:
How to structure your code in a single method?
If you would like to be notified about further parts of the series please:
If you followed the series, you have assertion-based testing, but you might want more.
If you want to write unit tests, it is best to start it here to lock down the specification. At this stage, you shouldn’t change your code apart from enabling testing. This means refactoring coupled dependencies as you did in the first part but now on a method level.
Tests should only test the behaviour of the method under test and not coupled structures, and you should replace these coupled structures with test doubles.
Covering the entire testing literature here is out of scope, but a quick help is to start writing tests for the shortest branch in your code and move towards longer ones. This will be the easiest, and you can get into the habit of decoupling and wiring test doubles. If you want to read more about this, I recommend searching for “Given-When-Then” style testing (e.g., here).
There is no right or wrong way to write code, just like writing in English.
This is why you need simple principles that you can adhere to and check your refactoring attempts against asking yourself if you are doing the right thing. “Why exactly do I want to rename this variable or extract this method?”
There are many principles, but here are the more important ones:
Optimise for readability. Code is read ten times more than written.
Avoid premature optimisation. Focus on readability even in exchange for some performance. You can come back later if performance is an issue and deal with it. This is done usually in the form of a Strategy Pattern swapping out the easy to read but inefficient legacy solutions with high performing ones.
Keep parts that depend on each other close together. Coherence is key. You might recognise a structure (method or class) that you can extract.
Focus on reducing coupling. Change is inevitable, and coupling makes change difficult.
Campsite Rule (FKA Boy Scout Rule). Leave the codebase in a better shape than you found it. Pay it forward for yourself in six months.
The refactored method should be readable from top to bottom in one go without jumping back and forth in the codebase. The success of these efforts will be tested during code review by the reviewer.
There are a few general style patterns that you can converge to. These are by no means rules, but they definitely help:
Maintain a shallow structure. Large indentations are very hard to follow; try extracting methods and move code around to simplify. Eliminate else return blocks by flipping the conditionals (see guard clauses).
Move guard clauses to the top. A guard clause checks that the code should not run at all, typically throwing an exception or returning with some default value like 0 or None. If you read your code in six months and find out if an illegal error throws an error or returns default value, you don’t want to reread the entire method, just the error handling part.
Keep the “happy” path on the left. Your eye should just scroll downwards in the standard branch as you read the code instead of jumping between indentation levels. This will make you much faster in the future to recall what the method does.
The very last line of the method should be the return statement. (apart from guard clauses). Of course, this is not always possible so, but it definitely helps.
Hopefully, by now, you have an optionally unit tested code and fixed some of the readability goals above, and now it is time to get deeper.
The best place to start is in the deepest (most right intended) part of the method.
The shallow structure is the most challenging goal to achieve, so this is a typical situation to face. The middle of the codebase usually depends on bad practices like global variables and other dependencies. On the other hand, the deepest branch gets all the external information does something with it and returns the results, so this is the easiest part to start.
A couple of typical strategies to change the code:
Identify feature envy. Getting the data out of a class, doing some calculations, and passing on the result is a typical (anti-)pattern. Does that code belong to here or in the class where the data already is? Move calculation there and call the class’s new member function.
Inline variables. The fewer variables you have, the fewer things you need to name, which we know is the second hardest thing in Computer Science. Variables also use coupling; if they don’t exist, you can’t depend on them in another part of the code.
Increase coherence. Move creation and usage of a variable close to each other. You might recognise a structure (method or class) that you can extract (e.g. with Strategy Pattern).
Remove code. Try solving the same problem with less code. Combine variables into data classes eliminating the Primitive Obsession, and Data Clumps code smells. Then inline the created variables.
There are other principles on how to name things:
Name methods according to what it does, not how it does it. Don’t forget that you might change the implementation later. You don’t want to chase down each use of the method to rename it to reflect the new implementation. Easiest is to name methods according to what their output is as that’s their most important feature.
Name things according to what they are, not what you want to use them for. You might use it for something else.
Use domain and business language. This will help you communicate about your solution as you don’t need to translate conversations with the business into a different language embedded in your codebase.
This tactical readability refactoring is a good step towards better-structured code.
Unlike in the previous steps where you just moved code around here, you actively change working code, so it is riskier. I recommend regularly using the written unit tests (or, in their absence, the functional tests from Part 1) and committing your code frequently to verify if you are still OK. If not, you can always undo your work, don’t forget the whole exercise is to turn unrecoverable Type 1 decisions into Type 2 decisions that can be undone.
If you are lost in version control, there is always help at:
In the next part, I will write about how to set up shell scripts, directory structures and tests for convenience. I will continue writing about concrete code smells, and then we can move on to the high-level structure of your code. Subscribe to be notified and see the used sources in the comments.