Before Easter, I attended the GAIA 2023 conference in Gothenburg to present on Clean Architecture in Data Science. I was pleased to see how much Swedish DS teams care about code quality in their work. During one of the conversations, Eric Evans’s book on Domain Driven Design was brought up, which reminded me that I never wrote about how we use domain thinking in our workflow. So here it is to fix that:
What are domains?
System decomposition and layers of abstraction are the most important system design principle. This is because humans suffer from bounded rationality. Each individual is unable the comprehend the whole system at once, so we decompose it into smaller parts so we can focus on either the whole system at a generic level or the details of one component while forgetting all others.
Domains are a tool to break up high systems at a high abstraction level (focusing on structure and semantics rather than low-level implementation details (or infrastructure and hardware).
Domains are logical parts of the code which are internally coherent and surrounded by a (virtual) boundary. Once you are inside a domain, all components (domain data model, interfaces, services) “mean” the same thing. You don’t have to worry about context switching of ambiguous implementation. These are also directly connected to the relevant business function, so simplify communication with subject matter experts.
What happens between domains?
Domains are surrounded by a boundary that limits interaction between different domains. Given the same terminology can mean different things in different domains, this is for the better.
Eventually, communication between domains happens through translation services. These are dedicated programs written purely to ensure that the two teams on the two sides of the boundary agree on what they pass to each other and how they transform from their interpretation of their own data model to the other.
Of course, as with any cross-functional activity, this is error-prone, sensitive to changes, requires coordination and so on. You can imagine what would happen if this won’t be limited and boundary crossing could happen uninhibited…
The DDD book mentions some strategies for coordination (Part 4, Chapter 14, “Relationships Between BOUNDED CONTEXTS”, huh, it’s a really long book…), which I will detail later applied to the specific SWE-DS context most of us are working in.
What are parts of a domain?
In general, you think of all code and data as part of the domain. This includes the data model and the data objects using it. The services (service classes/objects) and interfaces (functions) between them that operate on this data. Database schemas are also understood to be a (different) version of the same domain model.
Three Domains
In general, we split a system into three general domains. Of course, in large organisations, these can be divided further, and we are more likely to speak about three _types_ of domains, but to simplify, I will use singular here:
Application Domain:
You can think of this as the “outside” realm. DSes work in some business environments, which usually already have their own data model to describe their purpose.
This is usually not changeable by the DS team (easily) but can change unannounced (within reason).
These require asynchronous coordination with the SWE team. More on this at the translation layers
Data Science Domains:
This is where DSes operate.
They have full control over how they represent their work, and maintaining it is paramount to their efficiency.
The domain is usually relatively slow-moving for well-established DS teams. New DS teams can think that efforts in this domain pay off well over a long time, so investments have good ROI.
This domain is for DSes to understand what’s going on, store business-relevant parts of the App domain complete it with information coming from the model domain.
This is the DS team’s control plane to coordinate with all stakeholders and components (SMEs, Execs, SWEs, Mathematical models, Data Engineers). It is important to use language as close to the subject matter as possible to simplify this coordination (also known as “ubiquitous language”). We are talking about communication; technical coordination happens through code in the translation layers.
Datasets are stored in this domain to minimise context switching for DSes and make sure that all data used by DSes are cleaned by the App→DS translation.
Model Domain:
Eventually, you need to leave the easily interpretable DS Domain, convert everything into some model consumable format and feed it to the models and interpret the results.
This happens in the Model Domain.
The purpose of this domain is to separate the Business relevant DS domain from the mathematical and technical details of actually doing DS.
Because mathematical objects are hard to interpret, work here should be restricted to the minimum and strict lineage must be maintained.
The path from exiting the DS domain, feature transformation at the boundary, the model processing, the output decoding and storage must be tagged with the ids of the components involved (entity ids, dataset ids, git commit hashes); otherwise, the computed elements won’t be interpretable later.
This domain also stores datasets in a computationally efficient format if something needs to be recalculated or updated. (With a lineage, of course)
These three domains enable you to have a clear separation in your head, assign concerns to the right parts, and only focus on what matters when looking at one at a time.
Two translation layers
Three domains require two translation layers, as the diagram shows as well. Let’s not forget that translation is a two-way process, so you need to think about translating out of each domain, not just in.
App→DS translation
When someone calls an API endpoint from the company’s system (and requests a response) or the DS loads data from a database. They are interfacing with the App domain.
These are usually outside of the control of the DS teams, so they need to clarify what kind of data they are getting.
This usually requires some conversation with the SWE teams and an understanding of their data model and their requirements.
This understanding is then codified into the translation layer so that the output is suitable for the DS team. For example, it only contains data relevant to their concerns, not everything else.
Often the translation layer includes code to correct for errors that are irrelevant to the App team but important to the DS team.
It also involves code to signal changes:
Required data missing - this can have catastrophic concerns, and DSes need to
Incorrect data
Schema changes
New data - may be a signal for a relevant change
These can cause the following events:
If the change is assumed to be recoverable, use corrective action (alternative source, default values etc.) We are talking about a piece of code, so “assumed to be recoverable” means that during the Business-SWE-DS conversations, these assumptions were added to the specification, validated and implemented, essentially written in code.
Fire a warning: Signal the DSes that something potentially relevant was changed.
Fire an error: Halt the processing and signal the requester (SWE for API calls, DS for batch data requests) that their needs can’t be satisfied and the reasons.
Log: Store anything that can be relevant in the future.
Corrective action is to fix it asynchronously or contact the SWE team to figure out an action plan. If that was fixed, run tests to validate that the change was absorbed into the system successfully and carry on.
DS→Model translation
This is the easier one as usually both sides of the boundary are controlled by the same (DS) team (This is not trivial given the proliferation of ML* teams, but certainly easier than an SWE-DS coordination).
The primary concern here is that the Model domain is not easily interpretable. If you stared at embedding vectors, not knowing where they came from or where they are going and seeing only a bunch of floating point numbers, you know what I mean.
The other issue is that the Model domain is not only cryptic but also needs to be numerically efficient, given the majority of computational resources will be deployed here.
This translation is also known as Feature Engineering (and Output Decoding, equally important but rarely discussed).
I won’t go here into the intricacies of FE but only one aspect. Once the features are generated, there is really no way to retrieve the original data unless there is strict lineage.
Tagging features with metadata about their creation is important. I don’t mean here information about the circumstances of collecting the data or forming the datasets (those are strict DS domain issues and should stay there) but the details of creating the numerical data itself.
Tag them with ids of (DS) domain model ids, git commit hashes of translation layer objects and model ids (for the outputs).
Because feature engineering is often model specific, you have many of these translation boundaries, and to a certain extent, you can think of each model as its own context. If similar models operate on the same features (for example, solving different NLP problems), you can put these models into the same domain. The idea is that even in the cryptic numerical form, the domain data objects (vectors) must mean the same thing for each of the models for the models to be in the same context.
Coordination is easier here because it comes naturally from the DS workflow, but maintaining these boundaries helps the DSes to have a high-level mental model of what they are dealing with and decompose their problems into this structure and tackle them individually.
Conclusion
In some sense, this article is only renaming already existing terminologies according to the DDD book, which is true. But on the other hand, it reflects the first principles of bounded context and domain-driven design (coherence, decomposition, ubiquitous language - same unified terminology across stakeholders in a single domain), which is beneficial for everyone to know.
Maintaining these high-level concerns allows the DS team to invest their efforts into the right area and achieve a high ROI on infrastructure work while at the same time reducing the cost of each experiment by reusing large parts of the
If you liked this article, subscribe. Or check out some of my other content: