6 Comments
Sep 22, 2022·edited Sep 22, 2022

I don't know that this is really relevant here, but how do you manage feature engineering with dataclasses? So at the beginning of your project you have raw data and you work on defining dataclasses that better represent the domain problem. What's your approach when it comes to the iterative nature of feature engineering. Do you load the dataclasses in a notebook and work on features and modeling there and then later refactor any new interesting features into already existing dataclasses (e.g. add/redefine fields in the dataclasses)?

Thanks Laszlo!

Matteo

Expand full comment
author

I see I haven't answered the technical parts:

I usually construct the modelling task in a notebook (plugging in the FeatureStrategy that converts domain classes to features. I might implement a custom strategy in the notebook itself and then move it to the repo later if it turns out to be useful (You might need multiple ones at the same time for testing).

I try not to change the domain classes as they belong to the domain. If there is an information need that belong to a domain class I try to figure out what their combination "means", is it a sign of a new domain entity? Maybe the strategies should have that combined class as an input. But then what is its implication? Do I need new adapters? New factories?

If this is messy, I try to do a refactoring loop where I only think of the implementation not the structure/added value. That's enabled by the refactoring/testing loop.

Expand full comment
Sep 22, 2022Liked by Laszlo Sragner

Yep, thanks for taking the time here. I follow a similar pattern, where I have a DataLoader class that takes in the domain model and spits out datasets for modeling. So it reformats the data (eg. list of lists for input to sklearn) and splits dataset into train, test for modeling. I guess I could think of adding feature engineering in this layer, possibly in a different class.

Expand full comment
author

The key is flexibility where you need flexibility. If you need something in a different context then it is probably worth decoupling. But usually you can't make this decision until you actually need it. So you need to know what needs to be done (which pattern and how to implement) to make it happen at refactoring.

Expand full comment
author

This is an interesting question. I usually think of that part of the code as two contexts: Domain Context and Model Context. DC contains the business specific entities and their transformations and the MC contains the actual structure that can be fed to a model (aka features in array/vector/other form)

The boundary between them is a domain class -> features and output -> domain class

The first can be implemented with Strategies and then swapped in and out depending of your model's needs. These strategies operate on the same Domain model but extract the features in a different way. Once you isolated it like this and you have a couple of examples better structuring can emerge.

Expand full comment

Ah I see, these are good insights. The strategy can perform feature engineering and data preprocessing/formatting for the models depending on which models are being used (e.g. input formats for sklearn models differ from those of PyTorch models).

So I understand you think of this stage not as being part of the inner-most layer in your Clean Architecture diagram (https://laszlo.substack.com/p/clean-architecture-in-data-science), but rather the layer just above/around it (in blue).

Thanks!

Expand full comment