Productionised machine learning is a full-stack problem
You can't automate without addressing each component in the ML pipeline, from data collection down to user interactions. Even in offline analysis, ML requires many features, and it's even worse in production. Once an ML model is greenlit (but hopefully earlier), the product team must figure out how to plug the necessary components into the company's already existing IT infrastructure.
ML solutions reflect the individual business problems they are solving. Moreover, every company has its own unique tech stack. Spice this with the novelty of ML and MLOps products and the DS team's relatively low engineering skills, and you have the perfect storm of all tech problems.
One-stop-shop to the rescue
BigTech was the first to experience this and throw considerable resources at it. Given their scale and their own unique issue, they answered it by building their own full-stack solution. Don't forget that their ML problems are unique to themselves as well, so unsurprisingly, their solution has a customer of one. Of course, even if they didn't release the solution to the public, they ensured everyone knew what they were doing.
Many tried to follow their path in a strange form of cargo-culting.
"Build vs Buy"
Well, if BigTech doesn't help, let's build our own solution. Considering most DSes have mathematical training, this is usually a recipe for disaster. MLOps startups answered this by selectively solving single steps of the stack, leaving the difficult task of solution architecture to the still engineeringly-challenged DS teams.
Looks like that "Build vs Buy" turns out to be "Build AND buy and a lot of ducktape".
"Build THEN Buy"
I wish I could give you an easier answer, but I can't. Most Data Science Teams need to learn more software engineering to architect a system at the level of complexity required to manage the continuous integration process of an ML system. I am not talking about if they know Kubernetes or Docker but whether they know DDD or how to maintain decoupling and levels of abstractions.
You will struggle to move forward unless you train your team in writing better code that can be refactored as the ML lifecycle requires new features. Then you can think of your solution and architecture as a continuous process. First, build a POC, then in the next stage, identify a need, buy a solution if it makes economic sense, but maintain the same solution throughout. Enable yourself to make conscious decisions rather than just defaulting to the diagram in the latest blog post.
If you liked this post take a look at yesterday's post about productivity issues related to teams:
https://laszlo.substack.com/p/causes-of-machine-learnings-productivity-7e2