Data Science Risk Categorisation

And one factor that is suspiciously missing...

Sep 15, 2020

Due to the cross-functional nature of Data Science projects, they are subject to many sources of risks. To paraphrase Tolstoy: “Successful data projects are all alike; every unsuccessful project is unsuccessful in its own way.” The only way to avoid failure is to enumerate, categorise and address all difficulties accordingly. Drawing from the experience of others, I maintain a list of reports and articles on the subject on GitHub, and in this blog post, I will describe the main themes emerging from it:

Organisational

Dealing with Data Science requires the right environment. Certain conditions must be met even before planning of a DS project can start. Difficulties that fall into the Organisational category are too large or out of scope to be fixed under one project and therefore require a top-down strategic vision from the company: Leadership must embrace and champion Data Science and own projects related to it. A cultural understanding must be built throughout the organisation to reduce resistance against new practices.

This will enable employees to be motivated about upskilling themselves and also to build the right human capital through hiring for the business that will be used across the organisation.

The third category that must be taken as given is infrastructural issues: dependency on legacy systems, lack of available compute or big data tools can all derailed forthcoming projects.

Intermediate

Lack of better name for the category I named it “Intermediate” as it fits right between organisational and project-related challenges. The issues are usually mandatory to deal with in specific projects to get a good outcome but can be overseen as they are tangential to the “real” business value. This cannot be further from the truth.

Violating the “Legal, Privacy, Bias, Security” factors in this category makes the project immediately unfeasible regardless of outcomes otherwise. These factors must be identified and incorporated into the project feasibility analysis and their impact on the eventual GO/NOGO decisions.

The other major and “fixable” category is transparency and communication. These might not sink a project on their own, but their problems will make everyone’s life harder. Standard communication protocols with all stakeholders and project participants must be set up early on and regularly updated.

Planning

Once all the organisational issues are cleared up, the project idea can be turned into a plan. First of all, the way to measure the business value should be identified and estimated. Reading numerous reports, it was startling how many times “lack of business value/business case” came up. Reporting on the project’s progress regarding business value should be a primary communication and transparency (see above) tool towards employees and leadership, which will result in increased involvement and ownership from them. Business value is also impacted by the “Legal, etc.” factors; hence they should be included. A project outcome that doesn’t deal with these will produce zero business value, therefore not supposed to progress to the next stage. Business value was already discussed in a previous post.

The next difficulty is the specification. Existing infrastructure and legal constraints, constraints from available human and other resources must be taken into account. Hopefully, this happens after the business value question of the project was cleared. I have no idea how you can create project specification without clear business goals but reading the reports; apparently, this is a common curse, don’t fall for it. All four major tasks (more on this later) in the project execution must be planned and budgeted for, which can lead to a GO/NOGO decision from leadership. Major risk factors are “BHAGs (Big Hairy Audacious Goals)”, underestimating technology complexities and being too ambitious. Reducing scope and start small seems common advice.

One-Off

Once the project is in full swing, several risk factors can thwart them. Project execution must address coordination and project structure issues (more on this see my previous blog on Separation of Concerns), coordination between multiple teams is crucial. The actual implementation is essential as well and could lead to a CHACHE (change anything, change everything) situation, a phenomenon that is well known in “fragile” software engineering projects.

Of course, data issues are numerous as well: noisy data, incorrectly labelled data, disorganised data, silos, to name a few. I would argue that a project shouldn’t reach this stage and be tripped because of deficient data. This should be handled either at the specification stage (with potential data quality tasks as part of the planning) or should reduce the scope of the project. Then run this less ambitious project and armed with a better estimation of precise ROI from this reduced project, the larger project can be authorised.

Ultimately modelling itself can fail as well, this is the toughest one of all because there is no way to estimate in advance whether this is possible or not. Various techniques: ensembles, continuous improvements, human-in-the-loop, active learning and others can be applied to avoid failure at this stage. This is a topic so broad and important that I will write about it frequently in the future so here I only mention it briefly, please subscribe if you want to read about it more:

Ongoing

Just because a working model was shipped, it doesn’t mean the project is over. Machine Learning models require ongoing maintenance to counter “drift”: systemic changes of the latent processes underlying the data. A significant amount of projects lack the necessary components for proper lifecycle management: monitoring, update and continuous delivery. These hopefully reuse existing infrastructure from above, and if not, they should be part of the specification as well. If you are interested in the ML Operations topic particularly, please join the MLOps Community.

Summary

After reading dozens of reports and more than 300 risk factors, one curious fact was puzzling. Almost none of the reports mention involvement with domain experts. As if Data Scientists are expected to make judgment calls on any possible topic purely drawing conclusions from the data available (labelled or otherwise). I consider this a major failure and partially responsible for the 87% failure rate of data projects. In the next couple of issues I will write more about this so if you don’t want to miss them, please subscribe:

Deliberate Machine Learning

Discussion about this post