Unpopular Opinion: “Data-centric AI” is a straw man argument

2022-02-24

Feb 24, 2022

Industrial Machine Learning was never model-centric, but it should be.

There are many ways to validate this claim. Industrial questionnaires indicate that practising Data Scientists spend the majority of their working hours on “data cleaning”. The most important skills for Data Scientists are SQL and Statistics; neither has much to do with modelling. Indeed, once you enter a job your primary concerns are data, not models.

We can also look at this from the proliferation of MLOps. Both closed source solutions and open-source primarily focus on simple, easy to train models like linear regressions and random forests. For many MLOps tools, modelling is not even part of their value proposition (workflow engines, feature stores, data version controls). It is hard to argue that these companies missed that their main customers are obsessed with the latest deep learning research.

Even in Natural Language Processing, the real breakthrough happened when Explosion AI, Huggingface and the Allen Institute made their models available to the public, which is usually used as it is rather than industrial players trying to beat these models in performance.

The actual teams focused on models

Members of the following list are NOT typical industrial players:

Research groups at FAANG companies
Main AI research groups (DeepMind, FAIR, OpenAI)
NeurIPS participants
Academic research groups
Kaggle competitors!!!

They operate in a different problem space with different priorities that have very little to do with a typical company’s problem. They are, of course, very loud on social networks because that is a priority for them. But their popularity shouldn’t be confused with what’s going on elsewhere: “Twitter is not the real world.” as we all know.

Modelling is super relevant but for different reasons

Modelling is the most critical part of the ML value chain. Why? Ask anyone which part of their pipeline would they outsource to someone else? Data collection? Infrastructure? Storage? All of these can be done by someone else. Modelling is the bit that is closest to the customer, the most specific to the business itself (not even specific to the industry but the actual peculiarities of the concrete company). That’s the part where value is made, and you don’t want to depend on someone else doing that for you. If you outsource that, you might as well buy the services of an AI-first SaaS company that can sort the entire problem for you, but that’s a different question.

But industrial ML will never be model obsessed. Why? Because it is too problem-specific. All academic research happens on a few problems, so the results are comparable. But what are you comparing your solution if it is specific to your business? It is either feasible or not. The marginal benefit from improving it with more complex models is diminishing. This doesn’t mean that industrial models need to be simple. They need to be as simple as possible but not simpler to paraphrase Albert Einstein. The model’s “Essential Complexity” is set by the business problem and not an external comparison.

Why is the concept of “data-centric AI” so popular?

Machine Learning as a paradigm suffers from a productivity crisis. If you are in charge of a team that is not producing enough value, you want to take action to resolve this. It is easy to declare that too much time spent on models must be the problem. Creating models look like dark art to outsiders. On the other hand, data looks easy (narrator: it isn’t), it is measurable, it can be visualised, it is an “asset”. Plus, the best part of the last decade was spent on building Big Data systems that turned out to be less useful than originally thought. Can we solve two problems at once?

Unfortunately, most Data Science teams’ approach to handling data is equally haphazard as how they deal with modelling. Bad habits, hype and broken incentives do not help either.

ML’s productivity problem is not rooted in data. As a new paradigm, coherent processes that can deliberately solve problems are just experimented with. The industry lacks structured thinking on how to work through projects and deliver value.

Our company, Hypergolic, is deeply committed to resolving this problem by adapting techniques from other paradigms (e.g., product and project management, software engineering) into data-intensive products that lead to accelerated performance.

Could not have said it better. At first everyone was sold on big data, then it dawned maybe that is not sufficient and the elephant in the room was not needed. Then everyone was sold on big models but hiring a truckload of ML Phd and building the right tooling is not so easy. Now, the focus is on quality data (data centric) and the selling continues ... FOMO and the quest for the golden bullet is a great motivator

Expand full comment

Deliberate Machine Learning

Discussion about this post