I was suffering from writer’s block for the last couple of months hence the long hiatus. But now this is over: we are preparing to revamp our website, and I wrote this post about our core method to focus my thoughts. Please let me know in the comments what do you think about it and what would you like to read about next.
What is LeanML?
Lean Machine Learning (LeanML) is an organisational framework that allows Data Science teams to build business-oriented data products through a highly structured process. LeanML is aware of the difficulties of dealing with data-centric workflows and outcomes. It is inspired by techniques and knowhow from multiple disciplines such as quant trading in finance, business intelligence, agile software engineering and strategic consulting. It takes into account team and personal motivational principles of Autonomy-Mastery-Connectivity from the Self-Determination Theory (SDT). It builds well-operating teams that can react to unforeseen challenges and deliver value in tough circumstances.
The Planning - Foundation - Iteration phases
LeanML consists of three phases:
A short planning phase to determine the business value and the scope of the product.
A foundation phase to establish the product organisation and deliver an MVP.
Multiple iteration phases to react to discovered insights continuously and improve the product to the required quality.
The first phase establishes a problem relevant and valuable enough to solve with ML. Determine that there are no alternative solutions and that a statistically correct solution is feasible.
The goal of the second phase is to enable the iteration phase. Experience in productionised Machine Learning tells that you must expect to uncover unforeseen issues after a solution is in production.
The main body of work happens in the third phase. LeanML's primary goal is to enable this phase with an agile, closed-loop operation. This requires all three components of the People, Process, Technology triple to be in place in a structured form tailored to Machine Learning.
People
LeanML is executed by a Machine Learning Product Team consisting of a Product Manager, Machine Learning Engineers, Data Analysts and Domain Experts. If the ML product is part of a larger product or feature, they collaborate with that team, syncing their schedule.
The team enables transparency and communication by sharing their work throughout the process through code reviews and regular updates. Code reviews help new team members upskill themselves in software engineering and enable business continuity by sharing each solution's implementation details. Regular updates can be supplemented by written content and references to the codebase. As in modern software engineering, the code should always be in a state where it is mostly self-explaining to reduce documentation needs.
Technology
LeanML embraces MLOps techniques to achieve continuous integration. The team expects each step of model creation and deployment to happen several times throughout the product's lifecycle. They invest in high-quality tooling to simplify this and maintain each step to avoid collecting technical debt.
LeanML is not tied to any tool and adaptable to the existing infrastructure of the business. If adapted to an already existing data organisation, the repeated and iterative deployment of models will help identify costly and time-consuming parts of the already existing process and enable proper MLOps.
The team uses modern software design principles like SOLID and DRY to build a structure where the model design process is abstracted away from the actual ML infrastructure. DSes are enabled to work on business-facing problems and implement them right away.
The team uses generic ML libraries (e.g., TensorFlow and PyTorch) that enable ensemble techniques to add new features to an existing model in the form of sub-models and enable iterative improvements.
The team uses version control together with code reviews to rigorously maintain high-quality code standards. This helps to onboard new members, upskill DSes without CS background, and refactor the codebase to simplify future projects.
Process
LeanML relies on low friction communication with the rest of the organisation. The product team defines a context of responsibility where they are in charge of solving problems and interact with the teams outside of this area.
They interact with business leaders about performance and KPIs. They analyse performance data and report on progress at each iteration. They also report on discovered issues for additional directions. Not to forget that Machine Learning regularly runs into unforeseen issues, and identifying, eliminating or pivoting around them is a core function of an ML team.
They closely work with domain experts regarding product specification. The ML team's primary mission is to move domain knowledge that can only be defined through data and not code into ML models. They achieve this through regular data labelling and cleaning tasks together with the domain experts.
They interact with the software engineering and DevOps teams about model deployment and interfaces. LeanML team is responsible for the model's statistical performance and monitoring. This allows quick triaging of problems to identify who is best suited to respond to an incident.
Data Warehousing teams provide logs and data to be used for ground truth. This data is processed for data sets and to drive monitoring, analytics, and reports on the project's health. As with everything in LeanML, this is a continuously integrated process. There is no single time when the team relies on monitoring only. Monitoring, data labelling and model evaluation happens so close to each other that they are blended together. This technique is regularly used in quant trading as it allows anticipating changes rather than reactively suffering from them.
Workflow - What does everyone actually do?
The well-maintained codebase, sensible abstractions and simplified interactions with other teams allow the LeanML team to iterate in a straightforward process. The evaluation/monitoring provides them with both samples of insights to improve the model and statistical understanding of the impact of the insights. Rigorous version control of both data and code allows them to compare current performance to past ones. Armed with these insights, they propose new updates to the model. The close working relationship will enable them to validate these updates with the domain team and materialise it through labelled data.
Once the new data and the new feature is implemented as a sub-model in the model ensemble, DSes train the new model. They perform a regression test comparing the candidate's performance against previous versions. This test alone can provide new insights upon which the team can start a new test. If the candidate is validated, the model is deployed, which is a straightforward operation as it shares input and output interfaces with the previous model.
The deployed model is released into production through an A/B test or canary mode to make sure that business assumptions are validated as well. Given that this is an incremental improvement, this is a low-risk operation as performance is not expected to radically differ from the previous model.
This is the benefit of ensemble modelling. The alternative is that the team rely on an entirely new model and a full retrain. This invalidates agile software engineering's "small changes" iteration concept. Because the change is small and low risk, the testing period is shorter, and the team can combine it with the model evaluation preparing for a new iteration.
It is important to note that multiple of the above processes can run simultaneously, allowing the LeanML team to scale horizontally and attempt multiple insights at the same time. Versioning of data and code allows this just as CI/CD teams work parallel on the same codebase.
Summary
As we see, LeanML is nothing more than the clever adaptation of other production-grade technologies and processes from other industries that already proved themselves. The only difficulty is to adopt these processes with little friction to the traditional "Waterfall-like" process of Machine Learning.
The process allows teams to work on multiple issues at the same time and get better at executing by repeating the processes covering two main motivational principles of Autonomy and Mastery. Frequent communication with other team members and other external teams creates the third, Connectivity principle as well. LeanML creates a team that feels empowered to solve the task and feels ownership and motivation about their output.
A blog post is too short to write down all our experience and details about the process and justify every part of LeanML. If you would like to try it out yourself or need more convincing about its benefits, we wrote about it in further details in our free ebook that you can download at https://machinelearningproductmanual.com/.