3 Ways Domain Data Models help Data Science…

Laszlo Sragner

Mar 13, 2022

2022-03-13

Read →

6 Comments

Matteo Latinov

Aug 2, 2022

Hey Laszlo,

I've been thinking a lot about DDD and its application to data science/ML systems as described in your article. You and I have exchanged here and there but I thought that sharing a more comprehensive comment here on your substack could provide beneficial to others as well!

What I am trying to wrap my head around is how to apply DDD depending on the context at hand. Thanks to the material you've been sharing online, I now understand the benefits of code quality, clean architecture and, more specifically, using python dataclasses to work with data and decouple the DS/ML code from infrastructure/storage details (pandas or other) and follow a clean architecture approach.

However, I am finding it hard to understand how far to take the DDD approach in terms of defining data models (and their behaviours?) in the code. In other words, when is it right to try to model the domain problem in an ML system using concepts like Entities and Value Objects and their relationships/behaviours? How far should one take this approach depending on the context?

To be more specific, a current application I am working on is a surrogate model for predicting emissions in a diesel engine. I am working with a simple .csv file as raw input data (common scenario to have input data in tabular form I reckon). The file contains rows that essentially represent engine operating points (speed and torque setpoints) and columns that represent measurements of various engine parameters (~ 50-100 params) at those given operating points (e.g. fuel flow rate, air flow rate, fuel temperature, emissions, etc.).

This seems to be a kind of problem where there isn't much business logic/interactions to be captured in the code but only data to be stored in dataclasses (i.e. anemic objects). I am essentially working with static measurements that I'd like to make predictions on. I am not really trying to describe and/or map the behaviour of a combustion reaction or anything of the like (should I be?).

A possible code implementation here could be to create a dataclass called OperatingPoint which represents an Entity of the domain. This dataclass however would contain 50+ fields which begs the question: should I be dividing this up in to other entities and/or value objects (e.g. Fuel, Engine, ExhaustGas, etc.)? What is the real benefit of doing that? What real advantages might that bring if there is no real behaviour or no real interactions between the Entities that I can reproduce in the code?

It might simply be a question of semantics here: using python dataclasses to work with data in your code has benefits regardless of seeing it as DDD practice or not.

(I found an interesting take on the topic: https://softwareengineering.stackexchange.com/questions/411638/applying-domain-driven-design-to-an-analysis-driven-domain)

I hope this can lead to an interesting discussing that could help myself and others gain a little more perspective on the application of DDD depending on the problem at hand.

Thanks again for all the content you put out, I truly believe this is a critical topic in our field.

Expand full comment

Reply (1)

Laszlo Sragner

Aug 7, 2022

Interesting usecase. First: These posts are recommendations and not laws, so please treat them as such. Second: The primary goal is to solve the problem the secondary is to maintain technical debt. How quickly you can change your system to try things out (this includes models and the data transformations and evaluations that surround them). A good domain model supports this work, things that belong together and operate on each other are "close" to each other and parts of the data that change separately are decoupled/isolated.

I recommend making a first attempt and make sure you have testability, then try to change your code and see where it "resists" that is a potential tech debt and thing about refactoring it with enabling the change by reducing this resistance.

This will allow the code and the needed changes to drive your coding rather than prior cognitive load.

Expand full comment

Reply (1)

Matteo Latinov

Aug 13, 2022

Thanks for the feedback, this makes sense; mustn't miss the forest for the trees. I will continue iterating towards the best solution.

Expand full comment

Matteo Latinov

Jul 14, 2022

Hey Laszlo,

I am very interested in this topic. Do you have any additional resources to get started on this? Or are you planning on writing an article series on this?

Many thanks for your content as always!

Matteo

Expand full comment

Reply (1)