Slides for my talk at PyData London 2022

Laszlo Sragner

Jun 20, 2022

"Clean Architecture: How to structure your ML projects to reduce technical debt"

Read →

6 Comments

Gabriel Barros

Sep 4, 2022

Hey Lazlo,

I really enjoyed your presentation, and I've decided to apply some principles in my current workflow, especially following the order stated in the "Workflow from Scratch" slide.

However, I don't know if the first principle is applicable in my use case.

Let's say that my work is "bounded" in a box that receives input data stored in a BigQuery Table (entities and features transformed in DataForm) and also outputs a BigQuery table (reverse ETL for batch style scoring).

Currently, I am just reading this big input table as a pandas dataframe, using BQ python's client API, doing tabular ML, and writing in a buffer table that then is tested and copied to its final location.

Questions:

1) In this somewhat simple setup, I would gain what by "decoupling" the rows into domain entities and storing them in python dataclasses?

2) How I would deal with the fact that I have several features (and a lot more combinations) for each entity? Should I define them all in the dataclass initialization? And then I would store this how? In Python objects using pickle? Right now I have some dataprep steps where I "cache" data in parquets.

It seems to me that my mind is always defaulting to an Airflow DAG style of data preparation steps in which the data modeling (in a star schema sense) is delegated to the decisions made in the Data Warehouse, so I don't know how can use the first step ("define your dataclasses") in my day-to-day code, or even if I should use it.

Thanks!

Expand full comment

Reply (1)

Laszlo Sragner

Sep 4, 2022

What triggers the process? I am fairly sure AirFlow is an overkill here.

How big are those tables?

Dataclasses helps you a lot if you need to do convoluted preprocessing steps. The resulting python code will look cleaner, more pythonic and will be easier to maintain.

First step I would do is to switch dataprocessing to row based processing and use itertuples() iterator. If you refactor the code the result will be very close how your code will look like when you have a dataclass defined datamodel.

(Of course this is only relevant if you actually have problems regarding technical debt and a lot of transformation steps, if all you do is put the dataframe into a sklearn pipeline and return the data then there is little upside.)

Expand full comment

Reply (1)

Gabriel Barros

Sep 10, 2022

Those tables are too big for looping them on Python, so I try to use all the vectorized methods that I can, as well as trying to leverage BigQuery in window functions with the range clause (something that is slow in Python/pandas, because of the non-fixed window size (especially with non-numerical columns).

Another thing that I must do is target encoding with is by definition group-based processing, so it would be troublesome to ditch pandas in this step.

Looping through pandas/numpy can be fast if one's using numba, so I think it is worth to git a try. I will resume work from step 2 onwards, and I will comment on your blog as new insights come to the surface.

Expand full comment

Matteo Latinov

Jul 14, 2022

Hey Laszlo,

Great talk, I went to see the youtube video (for anyone interested: https://www.youtube.com/watch?v=QXfsS-ZOeyA). And in general, I really appreciate the content you put out, it is very clear and practical and I feel it fills a knowledge gap I had when developing solid ML systems.

Just a few follow up questions if I may:

1. You mention wrapping data in domain specific data classes? Could you give us a few examples of how you do this?

2. There was an interesting question regarding the use of pandas for data cleaning and if I understand correctly you recommend avoiding using dataframes in your ML workflow? Why is this exactly? How do you substitute data cleaning operations with pandas? I always thought of using the strategy pattern with concrete classes that run pandas chained operations on dataframes but this might be the wrong way to look at things.

Many thanks for your valuable content!

Expand full comment

Reply (1)

Laszlo Sragner

Jul 14, 2022

Good point, I should edit the article to add the recording. Thanks for the feedback, much appreciated!

1. Often when you do your EDA/POC you think in terms of dataframes or dictionaries, but you still need to write a lot of code that moves the data around. Then in MVP/ABC/BAU you need to rip this out and transform this into deployable code.

Instead turn this around: Treat each row as a (data)class, anything that has an id as an entity. And when you want to do analysis turn this into a dataframe but otherwise your code should depend on the properties of the objects. For example when we did NLP we had an Article class, a Sentence class, a Token class, but also Country, Person, Company and other classes. This made our logistics code much cleaner and this was especially useful when we needed to (constantly) change it over time.

2. The problem with pandas is that it chains your entities (the rows) together. When you want to ship any logic into prod the same logic needs to run on individual entities (API calls) so you need to rewrite whatever know-how you gained into this form. And rewrites are very costly and error prone. Better if you write the logic immediately in the form you will ship it, the speed benefits are marginal. If you have speed issues you shouldn't use pandas in the first place.

One design principle we use is to think about what the end results should be and work backwards to EDA/POC: What is the best course of action now that is easiest to ship eventually.

Also one of the reasons we focus on EOS because being in some "semi-productionised" state (internal launch, working with design partners) is much more common than it is perceived now. So you need to have a productionised mindset much early on, that makes the benefits speed improvements in EDA diminishing. I hope this helps.

Thanks for your comment and your attention again!

Expand full comment

Reply (1)

Matteo Latinov

Jul 15, 2022

This is fascinating stuff, eye-opening to say the least. I'm starting to pick up on a few things now, checking out content left and right (including a lot of yours). Just a few remarks:

2. This whole reasoning makes a lot of sense to me. Why would you say that the same same logic __needs__ to be run on individual entities (API calls) as opposed to being applied in a vectorized manner just like in an experimental phase with pandas? Is this simply because we are coupling ourselves to pandas which is an infrastructure tool essentially?

Thank you Laszlo!

Expand full comment

Deliberate Machine Learning

Slides for my talk at PyData London 2022