Slides for my talk at PyData London 2022

"Clean Architecture: How to structure your ML projects to reduce technical debt"

Jun 20, 2022

[Edit3]

Join our Discord community, “Code Quality for Data Science (CQ4DS)”, to learn more about the topic: https://discord.gg/8uUZNMCad2. All DSes are welcome regardless of skill level!

[Edit2]

PyData London returned with a three-day event at the usual place after a pandemic-related hiatus. I had the opportunity to present related to our key public mission:

“Increasing Machine Learning productivity through improved coding practices for Data Scientists.”

[Edit]:

The recording of the talk is available on YouTube: https://www.youtube.com/watch?v=QXfsS-ZOeyA

The title of my talk:

Clean Architecture: How to structure your ML projects to reduce technical debt

I’ve been asked for the slides by multiple people. See them at the end of the post. I will share the recording as well when they are available.

I started with a (just to be clear - fake) quote from Clausewitz to the great amusement of the attendance:

Quick summary of slides:

What do we mean by “ML products”?
Why does tech debt matter in ML?
How ML Lifecycle affects tech debt?
Tech Debt vs Tech Mess (This slide was received by a significant amount of laughter)
What is refactoring?
What is Experimental-Operational Symmetry (EOS)?
What is decoupling?
Inversion of Control and Dependency Injection
Clean Architecture in Production
Three Useful Design Patterns (Adapter/Factory/Strategy)
Workflow building a system from scratch
Interoperability with Jupyter notebooks

Slides can be downloaded from here:

PyData London 2022 Slides Substack

3.19MB ∙ PDF file

Download

Clean Architecture: How to structure your ML projects to reduce technical debt

Download

I write on the topic regularly; if you would like to read more on this, please take a look at my about page: [link]

Gabriel Barros

Sep 4, 2022

Hey Lazlo,

I really enjoyed your presentation, and I've decided to apply some principles in my current workflow, especially following the order stated in the "Workflow from Scratch" slide.

However, I don't know if the first principle is applicable in my use case.

Let's say that my work is "bounded" in a box that receives input data stored in a BigQuery Table (entities and features transformed in DataForm) and also outputs a BigQuery table (reverse ETL for batch style scoring).

Currently, I am just reading this big input table as a pandas dataframe, using BQ python's client API, doing tabular ML, and writing in a buffer table that then is tested and copied to its final location.

Questions:

1) In this somewhat simple setup, I would gain what by "decoupling" the rows into domain entities and storing them in python dataclasses?

2) How I would deal with the fact that I have several features (and a lot more combinations) for each entity? Should I define them all in the dataclass initialization? And then I would store this how? In Python objects using pickle? Right now I have some dataprep steps where I "cache" data in parquets.

It seems to me that my mind is always defaulting to an Airflow DAG style of data preparation steps in which the data modeling (in a star schema sense) is delegated to the decisions made in the Data Warehouse, so I don't know how can use the first step ("define your dataclasses") in my day-to-day code, or even if I should use it.

Thanks!

Expand full comment

2 replies by Laszlo Sragner and others

Matteo Latinov

Jul 14, 2022

Hey Laszlo,

Great talk, I went to see the youtube video (for anyone interested: https://www.youtube.com/watch?v=QXfsS-ZOeyA). And in general, I really appreciate the content you put out, it is very clear and practical and I feel it fills a knowledge gap I had when developing solid ML systems.

Just a few follow up questions if I may:

1. You mention wrapping data in domain specific data classes? Could you give us a few examples of how you do this?

2. There was an interesting question regarding the use of pandas for data cleaning and if I understand correctly you recommend avoiding using dataframes in your ML workflow? Why is this exactly? How do you substitute data cleaning operations with pandas? I always thought of using the strategy pattern with concrete classes that run pandas chained operations on dataframes but this might be the wrong way to look at things.

Many thanks for your valuable content!

4 more comments...

Deliberate Machine Learning

Discussion about this post