[edit]
We updated the exercise based on feedback from the community. Simpler and more educational steps, consistent formatting, and better instructions:
https://github.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/pull/21
Community announcement: CQ4DS
We started building a Discord community called “Code Quality for Data Science (CQ4DS)”, dedicated to (self-)educating data scientists to write better code. I am not going to lie to you, we are figuring this out as we go along, but it is much better to do it together than struggle alone. If you want to join a community where everyone is a beginner, please feel free to do so here:
Entry requirement: Work in a data-related field and be willing to learn and help others, even if just by reading and commenting. No seniority or previous experience is required. Inclusivity is paramount.
Refactoring the Titanic
Refactoring the Titanic
This blog post is to provide background information to a new type of content: refactoring the classic Titanic modelling experiment into a structure that has beneficial features (more on that later). Through this, you will learn how to think and change your own code in your own (professional) environment.
See the repository: https://github.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise.
Follow the “Howto” section to start the programme.
Read the notebook in Step 0.
Read this blogpost parralel.
Start doing the steps in INSTRUCTIONS.md.
Feel free to compare your progress along the way to the steps in this pull request: https://github.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/pull/21
Join the CQ4DS community for more help and comments: https://discord.gg/8uUZNMCad2.
High-level refactoring strategy and plan
Before you start refactoring, decide on the features and capabilities of the refactored code. How would it help you in the future? Typical answers are:
I want to try new models and features while keeping the old ones as well.
I want to get data from different sources.
I want to connect to different existing systems.
I want to do this safely without worrying about ruining my previous work.
I want to react to uncertainty.
The goal is not to have a strict roadmap but a general idea to understand when you can stop refactoring because the code is good enough. Also, there will be a lot of uncertainty. Who knows what hides in the spaghetti?
Setup
The next question is, what do you need to do to achieve your goals while modifying the code but not its behaviour? There are always a couple of sanity steps. Move your code into a script, create a requirements.txt and a virtual environment. [Step 0-4]
Next, find out how you know you didn't change the behaviour of your code. Which variables' values need to stay fixed after you change your code? This will be your test set, save their value, rerun the code and compare the new values to the old ones. These will be your tests. Be economical about how many you choose. Otherwise, you will have a hard time changing and removing variables. This might not be trivial in a data-heavy and computation-heavy environment, so experiment until you are satisfied. Also, fix the seeds of the random number generators. [Step 0 and Step 4]
Pipelines, pipelines everywhere
Once you have established these, look at the code and think about its general structure. What does it expose to the world, and what does it connect to where its output goes? What are the big-picture steps? (I usually put comments, TODOs and sometimes just a long line of dashes in a comment to see where the boundaries are between bigger chunks. It is roughly:
Connects to a database (included in the repository as an SQLite file).
Gathers some data (the classic Titanic example).
Does feature engineering.
Fits a model to estimate survival (sklearn's LogisticRegression).
Evaluates the model.
This code has a "pipeline" generic structure. It has a source (the database), a sink (model performance results), and a lot of code in between. This is not universal but pretty typical for DS projects.
An important strategy when dealing with legacy codebase is to enable yourself to focus only on one thing at a time. Fix that and move on. Keep running the tests, which reassures you that while you changed this one thing, the rest of the code didn't affect the output.
The pipeline structure naturally gives you this breakdown, and the most straightforward place to start is at the beginning. When progressing, look at each section on its own and assess how it blocks you from having the benefits listed at the beginning of the article. Make changes towards having the capability. Move the code around to improve readability.
Design Patterns bring you options
Refactored code gives you optionality. This optionality is easily achieved by decoupling: moving code into a class and plugging the class back through the constructor (this is Dependency Injection). If you want to change the code, just write a new class and plug that in at running time. Three Design Patterns will help you with this (Adapter/Strategy/Factory) [see my slides on this, links also in the instructions].
Entanglement, coupling, spaghetti
Pipeline structure has a simple logical framework: Each pipe has a beginning and an end. Entangled code has multiple of these "pipes" (logical programming tasks) mixed, which makes it hard to read and hard to decouple if you have to. Pipes communicate through variables, and one sign of this entanglement is that the lifecycle of variables are all over the place. They are created in one place and used much later. They should be declared at the latest possible point and abandoned as soon as possible. Can you move their first and last use and the code that affects them closer to each other? This can help you improve structure by shifting code up and down, revealing a so far obscured structure.
Readability as tactical moves
At the same time, look out for classic readability issues. Confusing names, large logical expressions, for-loops instead of list/dictionary comprehensions, and repeated code blocks. Make these tactical changes to get into the habit of changing code. It is a similar mental habit as breaking writer's block by starting to type about your day. It helps you to get into the flow and trust your tests.
If you can't resolve some entanglement by moving code around, you can duplicate code temporarily. You can always remove it later. On the other hand, duplicated code can be a sign that you have a functionality that is operated on different data but is not expressed explicitly as a function. Kind of like you have a virtual pipe, but you use it twice entangled into one pipe. Running the same operations on both training and test data is a typical sign of that. [Step 14 and Step 16, and many others around them].
Another tactical operation is writing things out as you want instead of implying. "Explicit is better than implicit". Someone editing the code in the future won't have the same context as you have now, so help them by writing these out. If the external context changes, it is better that the code breaks rather than silently fails by using some now invalid assumptions. Avoid default values and implied structure (for example, the ordering of columns in a dataframe or even the column names). Try using dataclasses to pass values around and access dataframes through the itertuples() iterator. That makes the column names explicit, and if someone renames or drops one upstream, the code here will signal, and you can investigate why the source changed.
Naming variables or not
Try extracting names that are relevant to your project [Step 10]. Even if the technical context changes, the names of entities in the business context are unlikely to change (or if they do, your code should also reflect this). It is good to share the same language as your stakeholders (this is also called "ubiquitous language".
Inline variables (within reason). If a variable is only used once (twice), copy the constructing code into each place rather than having a variable (make a note to address the code duplication, see above). This can help to reduce entanglement.
Naming is one of the two hardest problems in computer science (the other two are cache invalidation and off-by-one errors, but luckily, these are not in play here). The easiest way to name is not naming at all. Feel free to construct a class or a comprehension in the return statement.
Finishing and next steps
Once you have gone through the whole code, take a look if it:
Achieves the benefits that you wanted at the beginning.
Has a clear and well-communicated (through code and structure) intent.
Passes the tests (does the same thing as before refactoring).
If so, the refactoring is done (for now). If you are stuck, move on with what you want to change. Try making the change, and you will see if the code "enables" you to make these changes easily. If not, maybe you left something to clean up that is clearer now. Or perhaps you weren't clear about what benefits you want from your code. Review the list you made before starting refactoring and go through the (now much cleaner code) with this new thing in your mind.
Heavy usage (writing and rewriting) will make it much more straightforward what you want from the code. It is a much better way to figure out structure than rigid planning. Testing and version control should enable you to experiment with different structures (not just other models/sources/etc.), so trying out new ones is a relatively low effort.
I hope you will work yourself through the steps and learn refactoring in practice. If you have more questions or comments, please join the CQ4DS discord: