I've been thinking a lot about DDD and its application to data science/ML systems as described in your article. You and I have exchanged here and there but I thought that sharing a more comprehensive comment here on your substack could provide beneficial to others as well!
What I am trying to wrap my head around is how to apply DDD depending on the context at hand. Thanks to the material you've been sharing online, I now understand the benefits of code quality, clean architecture and, more specifically, using python dataclasses to work with data and decouple the DS/ML code from infrastructure/storage details (pandas or other) and follow a clean architecture approach.
However, I am finding it hard to understand how far to take the DDD approach in terms of defining data models (and their behaviours?) in the code. In other words, when is it right to try to model the domain problem in an ML system using concepts like Entities and Value Objects and their relationships/behaviours? How far should one take this approach depending on the context?
To be more specific, a current application I am working on is a surrogate model for predicting emissions in a diesel engine. I am working with a simple .csv file as raw input data (common scenario to have input data in tabular form I reckon). The file contains rows that essentially represent engine operating points (speed and torque setpoints) and columns that represent measurements of various engine parameters (~ 50-100 params) at those given operating points (e.g. fuel flow rate, air flow rate, fuel temperature, emissions, etc.).
This seems to be a kind of problem where there isn't much business logic/interactions to be captured in the code but only data to be stored in dataclasses (i.e. anemic objects). I am essentially working with static measurements that I'd like to make predictions on. I am not really trying to describe and/or map the behaviour of a combustion reaction or anything of the like (should I be?).
A possible code implementation here could be to create a dataclass called OperatingPoint which represents an Entity of the domain. This dataclass however would contain 50+ fields which begs the question: should I be dividing this up in to other entities and/or value objects (e.g. Fuel, Engine, ExhaustGas, etc.)? What is the real benefit of doing that? What real advantages might that bring if there is no real behaviour or no real interactions between the Entities that I can reproduce in the code?
It might simply be a question of semantics here: using python dataclasses to work with data in your code has benefits regardless of seeing it as DDD practice or not.
(I found an interesting take on the topic: https://softwareengineering.stackexchange.com/questions/411638/applying-domain-driven-design-to-an-analysis-driven-domain)
I hope this can lead to an interesting discussing that could help myself and others gain a little more perspective on the application of DDD depending on the problem at hand.
Thanks again for all the content you put out, I truly believe this is a critical topic in our field.
I am very interested in this topic. Do you have any additional resources to get started on this? Or are you planning on writing an article series on this?
Many thanks for your content as always!