Data Science is mostly about handling data (Doh, … Thanks, Captain Obvious… ) in productionised environments. Despite this, Data Science teams rarely define their own Domain Data Model. Omitting this crucial step makes their processes brittle and hard to change, leaving them with hard to reconcile tech debt at the very beginning of their projects.
What is a Domain Data Model?
I recently gave a presentation about our new course to a group of Data Scientists at a prominent startup. Domain Data Model (or just Domain Model for short) is a core part of our process and terminology on making DS teams more effective.
About halfway through the presentation, I saw confused faces when I kept saying: “Domain Model this and that”, so I thought I’d ask if they knew what a domain model is.
They didn’t. Lesson learned: Never assume your audience knows the terminology, hence this article.
The concept of Domain Models come from Domain Driven Design (DDD) and enables ubiquitous language (Meaning of “ubiquitous” - for non-native speakers like me - “appearing, or found everywhere”). It explicitly means that it will be shared with everyone in the project: Software Engineers, Subject Matter Experts, Data Scientists and Executives.
Entities, Values and Aggregates
The Domain Model describes Domain Objects: Entities - something with an ID, Values - a property value wrapped in a class, and Aggregates - a group of related Entities. These together represent the domain in which the problem is solved.
Through these, you explicitly name the domain terms you will use later in your modelling and analysis. You also define how these domain terms are created from the data you have access to. For example, if your company calls users clients, you will have a “Client” class. If you are a food delivery company, the people who move food from A to B will be represented by a class called “Riders”, but if your company calls them “Couriers”, then you will call them “Couriers” as well.
Once you have these Entity classes, you define their properties explicitly with Value classes and their relationship to other Entities. You also define how to get these classes from data sources that are usually not defining these explicitly (e.g. logs, database tables, pandas dataframes).
Let’s take a look at the benefits of using a Domain Model.
Benefit 1: Smoother analytics and coding in the Data Science project
One of the immediate effects of using a DM is a conceptual separation of analysing data and creating data. The Domain Model is an interface between them. Your analysis will purely depend on the DM. You also don’t need to validate at each point if your data specification still applies, making your analytics code simpler.
Your data processing code will be simpler because the only thing it needs to produce is domain objects. It creates them from raw data, and if it cannot, it flags it in logs or errors. Instead of the many-to-many source to analytics problem, you define a single common layer.
Benefit 2: Smoother communication with Subject Matter Experts and the Executive
Data Science projects rarely appear out of context.
You will be expected to work closely with SMEs with little or no technical knowledge. Ubiquitous language means it is shared by the entire team, not just technical personnel. Domain Model will help you use their terminology, reducing the friction of context switching and translation from their concepts to data science concepts. You can go straight from the meeting to the codebase and implement feedback into your code.
When you report on your project, executives will be happy to hear that the DS team speaks in the same language as the rest of the business. Having a domain model forces the DS team to be on the same page with the rest of the company.
Benefit 3: Smoother interoperation with Software and Data Engineering
DDD is a popular concept in programming. Just like DSes are not working out of context, neither does Software Engineering. They probably go through the same journey and use the same ubiquitous language. Adopting this practice will reduce translation friction when you will productionise your model and keep it low effort to update based on feedback.
Summary
As you can see, using a Domain Model will be hugely beneficial to any Data Science project. Maintaining it is a high ROI activity that will yield in communicating with all external stakeholders.
Hypergolic (our consulting company) usually starts each engagement by building a domain model and defining data processing and analytics in terms of it. Get in touch to learn more about this at: https://hypergolic.co.uk/contact/
Working with Domain Models has a lot of nuance (Bounded Contexts, Context Mapping, Translation Boundaries) that cannot fit into an introductory article. I intend to write about this in the future. Please subscribe if you would like to be notified:
Hey Laszlo,
I've been thinking a lot about DDD and its application to data science/ML systems as described in your article. You and I have exchanged here and there but I thought that sharing a more comprehensive comment here on your substack could provide beneficial to others as well!
What I am trying to wrap my head around is how to apply DDD depending on the context at hand. Thanks to the material you've been sharing online, I now understand the benefits of code quality, clean architecture and, more specifically, using python dataclasses to work with data and decouple the DS/ML code from infrastructure/storage details (pandas or other) and follow a clean architecture approach.
However, I am finding it hard to understand how far to take the DDD approach in terms of defining data models (and their behaviours?) in the code. In other words, when is it right to try to model the domain problem in an ML system using concepts like Entities and Value Objects and their relationships/behaviours? How far should one take this approach depending on the context?
To be more specific, a current application I am working on is a surrogate model for predicting emissions in a diesel engine. I am working with a simple .csv file as raw input data (common scenario to have input data in tabular form I reckon). The file contains rows that essentially represent engine operating points (speed and torque setpoints) and columns that represent measurements of various engine parameters (~ 50-100 params) at those given operating points (e.g. fuel flow rate, air flow rate, fuel temperature, emissions, etc.).
This seems to be a kind of problem where there isn't much business logic/interactions to be captured in the code but only data to be stored in dataclasses (i.e. anemic objects). I am essentially working with static measurements that I'd like to make predictions on. I am not really trying to describe and/or map the behaviour of a combustion reaction or anything of the like (should I be?).
A possible code implementation here could be to create a dataclass called OperatingPoint which represents an Entity of the domain. This dataclass however would contain 50+ fields which begs the question: should I be dividing this up in to other entities and/or value objects (e.g. Fuel, Engine, ExhaustGas, etc.)? What is the real benefit of doing that? What real advantages might that bring if there is no real behaviour or no real interactions between the Entities that I can reproduce in the code?
It might simply be a question of semantics here: using python dataclasses to work with data in your code has benefits regardless of seeing it as DDD practice or not.
(I found an interesting take on the topic: https://softwareengineering.stackexchange.com/questions/411638/applying-domain-driven-design-to-an-analysis-driven-domain)
I hope this can lead to an interesting discussing that could help myself and others gain a little more perspective on the application of DDD depending on the problem at hand.
Thanks again for all the content you put out, I truly believe this is a critical topic in our field.
Hey Laszlo,
I am very interested in this topic. Do you have any additional resources to get started on this? Or are you planning on writing an article series on this?
Many thanks for your content as always!
Matteo