Code Quality in Data Science: 17 posts on my substack
2024-06-29 - Clean Architecture, Design Patterns, Code Smells, Technical Debt
I had to share a set of my articles on code quality in data science so many times that I decided to write a “Best of” article that I could just drop on someone with a single link who is interested in the topic.
On Technical Debt
How can Data Scientists sell “Tech Debt” to their managers?
Sourced from the great Adam Tornhill, it’s what it says on the tin.
[Interesting content] Adam Tornhill - A Crystal Ball to Prioritize Technical Debt
Once you sold it, you need to deal with it. Check out this review of his video.
Can you version control Jupyter notebooks?
No, you can’t, and you don’t have to, but that’s the whole point.
There is a reason this article is in the Technical Debt section in a post about Code Quality in Data Science.
Big picture thinking:
Popular framework to measure high-velocity projects in uncertain environments. The diagram helps you place the framework into a simple mental model.
The application, DS, and Model domains are the three main areas through which your solution needs to join your company’s larger software environment.
It explains how to think in terms of DDD (Domain Driven Design) in the context of ML.
Clean Architecture in Data Science (Part 1)
Super important.
[Edit: The feedback was that this one is dated, too abstract and hard to understand. I will write an update on this and share it here. Until then, read/watch my presentation below at PyData London 2022]
No, there is no Part 2. Because pretty much all other content is advocating this.
You Only Need These 3 Data Roles in a Data-Driven Enterprise
Conway’s Law says that your software will look like your org structure, so here is a post on how to structure that as well.
Also talks about “Self Determination Theory”, an important framework to keep teams motivated.
[Interesting content] 3x Explore, Expand, Extract • Kent Beck • YOW! 2018
Your product itself changes as it grows, and this post describes how code quality will help you in the three stages Kent Beck identified when he was on FaceBook. Also, his hilarious presentation on YT included.
Structuring code; writing clean code:
“Clean Architecture: How to structure your ML projects to reduce technical debt”
I presented this at the PyData London 2022 conference.
The slides describe the workflow and justification for creating Clean Architecture in high-velocity projects. I also included a link to the presentation video.
“Code Smells in Data Science: What can we do about them?”
I presented this at the PyData London 2023 conference.
If Clean Architecture is “strategy”, then this is “tactics”. Once you structure your project, you need to take care of the actual code blocks.
The recording was broken at the conference, so I included another one when I presented the same talk at MLOps Bristol at GraphCore’s offices.
I regularly present these two to interested teams at various companies. Get in touch at https://hypergolic.ai/contact-us/ if you would like to join them.
Actual Code
How can a Data Scientist refactor Jupyter notebooks towards production-quality code?
Warmup for the Titanic post.
Go from a notebook to a well-structured system.
Code available on GitHub.
This, the two PyData talks, and the PPE course below form the core of our corporate training program on Code Quality for Data Scientists.
Extended, of course, wildly extended.
8x2 contact hours, 2/3 with hands-on programming for the participants.
How did I change my mind about dataclasses in ML projects?
TL;DR: It’s about pydantic. It is useful.
5 Minimalist Tips for Data Scientists to reduce frustration while working with Pandas
Don’t use pandas, but if you do use it, here are 5 tips to keep your sanity.
Quick content, fast takeaways:
The most important code structuring advice
Spoiler: It’s Dependency Injection.
If you only read one article from this page, read this one. I am kidding; read all of them; this is not a pick-and-mix. If you focus on a single part of a complex system, you won’t reap the benefits.
Python Project Essentials course for Data Scientists
If you want to learn about packaging, buy my course…
If you can’t afford it, email me for a discount; if you can’t, email me, at least read this article.
This post is a product of several years of writing combined with even more years of consulting, writing code, and solving the same problems again and again.
I can’t promise I will be this productive in the future, but if you liked this content, please subscribe. And, please, share this post with interested parties to spread the message.
If you want to contribute, join our discord: https://cq4ds.com/