6 Comments

Hey Lazlo,

I really enjoyed your presentation, and I've decided to apply some principles in my current workflow, especially following the order stated in the "Workflow from Scratch" slide.

However, I don't know if the first principle is applicable in my use case.

Let's say that my work is "bounded" in a box that receives input data stored in a BigQuery Table (entities and features transformed in DataForm) and also outputs a BigQuery table (reverse ETL for batch style scoring).

Currently, I am just reading this big input table as a pandas dataframe, using BQ python's client API, doing tabular ML, and writing in a buffer table that then is tested and copied to its final location.

Questions:

1) In this somewhat simple setup, I would gain what by "decoupling" the rows into domain entities and storing them in python dataclasses?

2) How I would deal with the fact that I have several features (and a lot more combinations) for each entity? Should I define them all in the dataclass initialization? And then I would store this how? In Python objects using pickle? Right now I have some dataprep steps where I "cache" data in parquets.

It seems to me that my mind is always defaulting to an Airflow DAG style of data preparation steps in which the data modeling (in a star schema sense) is delegated to the decisions made in the Data Warehouse, so I don't know how can use the first step ("define your dataclasses") in my day-to-day code, or even if I should use it.

Thanks!

Expand full comment

Hey Laszlo,

Great talk, I went to see the youtube video (for anyone interested: https://www.youtube.com/watch?v=QXfsS-ZOeyA). And in general, I really appreciate the content you put out, it is very clear and practical and I feel it fills a knowledge gap I had when developing solid ML systems.

Just a few follow up questions if I may:

1. You mention wrapping data in domain specific data classes? Could you give us a few examples of how you do this?

2. There was an interesting question regarding the use of pandas for data cleaning and if I understand correctly you recommend avoiding using dataframes in your ML workflow? Why is this exactly? How do you substitute data cleaning operations with pandas? I always thought of using the strategy pattern with concrete classes that run pandas chained operations on dataframes but this might be the wrong way to look at things.

Many thanks for your valuable content!

Expand full comment