While working on the “Code Quality for Data Science” project, I collected a tremendous amount of content on the topic. I am planning to share some in a new series. Join the Discord channel for discussions (more than 400 members in a month!).
Invite link: https://discord.com/invite/8uUZNMCad2
Adam Tornhill’s video on technical debt is one of my favourite pieces of content on the topic.
Key Points
Version control data is a good source for measuring technical debt.
A good metric is the length and number of changes for each file (2D metric), then look for files that score high on both. These are called hotspots.
Then you look at each hotspot file looking for places in the code (functions/classes) that are hit by commits more often than others. These are prime candidates for refactoring.
“How big microservices (for DSes probably replace this with classes) should be?” Look at the large ones and how much they change. If they change a lot, they are probably too large.
Temporal coupling: Any time you change A, you change B as well: These might be coupled, and reorganisation is warranted.
Separation of concerns:
Based on technical layers: change coupling, changes go through the entire system (CACE). All coders work on the whole codebase, stepping on each other’s changes.
Feature-based layers: You need an architecture that supports this, or you are back at point one.
Then, a section about Conway’s law and how it should be reflected on the codebase; otherwise, you have emergent negative patterns like “diffusion of responsibility”, “shotgun surgery”, and “high process loss”. And again, SWEs work on the whole codebase all the time.
He has a free visual tool to analyse your repository so that you can test this in practice.
Interesting content
I plan to write once a week on the topic. Let me know in the comments if you think this is worth it. Also, I am working on a series about Design Patterns in Data Science. Subscribe now if you want to be notified: