Data Scientist is an umbrella term with many different interpretations. Recently there was a proliferation of job titles to clarify this, usually based on an arbitrary allocation of skills, functions and responsibilities. In this post, I define three core roles based on a top-down value-added approach and systematic thinking:
What makes an individual motivated and efficient in a role?
According to Self Determination Theory, for an individual to be continuously motivated in their role, three factors need to be present: autonomy, mastery and relatedness.
Autonomy means that they can accomplish their goals through a path they decide for themselves. Mastery means that during their path, they get better in a set of skills that will enable them to do their job better in the future. Relatedness means that they work as part of a group they feel they belong and can communicate their work to the group and the wider environment. Achieving all three of these makes your team motivated and more efficient in their subsequent projects.
If you fail to structure roles according to these principles, you will demotivate your employees, reduce their growth trajectory and risk losing them.
Three data roles in a data-driven company
As I wrote about it before in this post, a data-driven company has three main components: analytics function, product function and data function.
You can justify this categorisation by asking: Who is your customer? If your customer is an executive, you are doing analytics (DSA/BI). If you work on automated models, recommender systems, and your customers are your company’s clients, you belong to a data product function (DS/MLE etc.). If your customers are data roles from the previous two categories, then you belong to the data function (DWH/DE).
According to Self Determination Theory, you need to shape these roles so that they own the entire value chain from a stable infrastructure input down to their end customers. This allows them to solve their problems independently and measurably, leading to a high level of autonomy. By relying on a stable infrastructure and outsourcing rarely used functions to Engineering and DevOps, the team will be able to spend most of their time on their core skills, improving them on the job leading to mastery. Self-contained ownership and low friction communication with the rest of the company will lead to relatedness.
Let’s define each role and its relations to the rest of the enterprise environment:
Data Warehouse Engineers (Data Function)
Their core function is to enter data into the company’s data domain and record it safely.
Their job is to provide a homogenous data infrastructure that the other two functions can utilise. Their focus is on storing the data, not processing it. This trend was validated when DWHs moved from ETL to ELT, focusing on “Load” instead of “Transform” as transform is problem-specific, while loading (i.e. storing) is more straightforward now with cheaper storage costs.
DWH skillset is data engineering and operations, databases, workflow management. They work closely with engineering, and the DevOps teams support their infrastructure.
Data Science Analysts (Analytics Function)
Data Science Analysts are responsible for extracting meaningful information from collected data and present it to the executive aiding higher-level decision making.
They take data from DWH and transform it according to problem-specific needs. This is helped by moving to the ELT paradigm, as it allows better slicing of the Transform step between the three teams. After they transformed raw data into a consumable form, they can perform statistical analysis and modelling on it and present the results through custom visualisation. Their whole skillset is put into use for every project, and they are directly responsible for their outcome. They don’t need to wait for external teams to solve part of their value chain that might slow down or block their progress.
DSA’s skillset is data engineering, data visualisation, software engineering (mostly python), statistical analysis, statistical modelling and presentation. There are many names like Business Intelligence Analysts, Data Analysts, Data Engineers for this function, but the defining characteristic should be on their end goal and how they best achieve it.
Data Scientists (Product Function)
If you are creating features that must operate without supervision, you are part of the product team. Your clients are the company’s own clients, and you need to perform at a level of professionalism expected from software engineers.
Just like DSAs, you take raw data from the DWH and transform it further. On this more consumable data, you do feature engineering and model building. Your model is deployed straight into production without a rewrite either by you or a third party. The engineering team works closely with you to prepare the containers and interfaces that let the model interoperate with the rest of their system. You are responsible for creating the model in a production-ready form from day one.
DS’s skillset is data engineering, machine learning, deep learning, and they also have to perform DSA’s analytics capabilities to monitor model performance and explainability. They work closely with the engineering team and supported by DevOps/MLOps teams. This function is usually split into overlapping responsibilities like Data Engineering, Machine Learning Engineering, Data Science and MLOps Engineering.
Unfortunately, if you split or combine the role as described above, you violate at least one of the SDT principles, for example:
Make DSes work on Kubernetes (a DevOps skill): as K8S is a rarely used function, DSes will struggle to gain mastery in it.
Split Data Engineering from Data Science: Because DSes need to wait for DEs to produce the data, DSes will lose autonomy to reach their goals.
DSes making POCs siloed away from engineers: They don’t connect to the customer and the added value. They lose relatedness to the product team and their goals.
There are many examples of this. The above framework provides you with a system in which you can evaluate problems, skills and responsibilities and allocate them to the tech teams. You can maintain cohesion regarding the three factors of Self Determination across your data org.
Research and experimentation mindset
But wait for a second! Isn’t Data Science all about research and experimentation?
Well, not really. Most Data Science teams pride themselves on their ad-hoc POC workflows, but these are detrimental to productivity. Applied Data Science work is not research but applying best practices to an industrial setting with professional and ruthless efficiency and validate it in production.
Of course, execution might fail to yield results. Still, even in this case, the project must be conducted in a validatable and meaningful way. This is the only way to convince the organisation that it can move on because of a lack of opportunity.
Summary
Instead of the plethora of job titles, you need three and only three roles: Data Warehouse Engineer, Data Science Analyst and Data Scientist to be a full-stack data-driven organisation. Self Determination Theory helps you allocating skills and responsibilities to these three to maintain motivation and efficiency.
Thanks for your attention. Please subscribe here if you would like to read my posts in the future: