Blueprinting and roadmapping in ML projects

2024-03-20

Mar 20, 2024

I wrote about how DSes struggle to manage complicated projects and a simple framework to help with this on [LinkedIn], but now I want to expand on the blueprint and roadmap parts.

So, the elements of a project are:

➡ Plan -> a webpage summarising what you plan to achieve, how you measure this, team, resources, connected stakeholders, meeting notes, unanswered questions, and everything you plan to take note of. Start with one page and extend later; Confluence/Notion/Gdocs doesn't matter, just start.

➡ Blueprint -> A diagram with boxes (Lucid/Excalidraw/etc)

➡ Roadmap -> A DAG in some plotting tool (Lucid/Excalidraw/etc)

➡ Backlog -> A Kanban (Trello/JIRA/Linear/Notion) board with tasks in priority order (don't overdo it, four columns (Backlog/TODO/In progress/Done), you can iterate it later)

Note from the future: I just came back from writing the entire post not mentioning how to work together with others, all the above components can (or should?) be done together with others. These act as reference points to any conversations inside a team or other stakeholders (tech or non-tech) and therefore making conversations about the project more meaningful.

The key aspect of building ML systems is that they are unspecifiable. You will never have enough information to make plans that will survive for long, so that's precisely what you need to plan for.

The further something is in the future, the more likely it will change or become obsolete, so don't try to plan for it (also, don't expect anyone to care about doing it).

So what can be done? That's the Blueprint:

Blueprint

This is a draft of how the system architecture should integrate with various existing data, ML, software, and stakeholder systems and persons.

Imagine that the project has been in production for 12 months and that all the black box models you plan to make have been worked out. What does your system look like?

I prefer pen and paper to draw the first iterations as there are usually too many moving parts and too much uncertainty to get a good overview. This is a kind of "active thinking". I am a visual type, so I typically list things that need to get on the Blueprint, draw boxes for them, connect them, and take notes. If it gets too messy, I redraw on a fresh sheet of paper the bits that I like. After a couple of iterations, I switch to Lucid and draw a wirechart (strictly black and white, no fancy fonts, mostly just rectangle boxes). I imagine how the data will flow between the subsystems and who will "call" who (function/API call or people using a UI). This is not a service architecture, as that will be an implementation detail, and it's too early for that. Just a target (a "desired state" in PM speak) to aim in its general direction).

This is similar to "Event Storming" in SWE, apart from being more data-heavy, speculative, and early-stage. ML systems usually fit into some business process but only part of it and have data-heavy backends where Event Storming will be little help.

Then comes the roadmap.

Roadmap

Most people think of a roadmap as a "road", a single line of activities, like an assembly line, with milestones and dates. That's not a roadmap (or a map) but a timeline. So here we will talk about what roadmaps actually are:

A directed acyclic graph (DAG) that starts from the current state (the source) and ends in the desired state (the sink, as described by the Blueprint), between them tasks that change the current state and branching points where you need to make decisions which direction you continue.

Armed with the fresh Blueprint, start from that and walk backwards from it. Why?

This is similar to when you try to figure out a labyrinth, and you find it easier to maintain direction if you start at the goal.

If you need a new component, how does it integrate with the rest of the system, and what will need to happen for that component to exist? What hypothesises need to be true and tested (these are all nodes in the DAG above; hypothesis tests are typically branching decision-making points)

This is, again, active thinking, using pen and paper and later moving to some diagramming software.

Once you are in a reasonable state, organise the DAG into a sequential roadmap. Identify the critical path (if anything is delayed on it, the entire project will be delayed). Identify how many decision points need to be made; those are all high-risk activities, as you must regroup each time the critical path changes.

The critical path is an important feature to maintain and be aware of. This is also something you want to communicate to teams you work with, as it affects their scheduling as well. Having a roadmap will benefit them as well as your conversations will be more meaningful.

Don’t worry about task sizes; that comes later. As I wrote before, the very end of the roadmap is very fuzzy because although you know what state you need to get to, the path is unclear and changing (again, ML systems are unspecifiable), so there is no point in estimating yet.

The focus should be on branching points as those are high-impact (in terms of project management) uncertainties. These are also usually (or partly) “research” problems rather than “engineering” problems.

Research vs Engineering Tasks

It is important to build a distinction between these two types of tasks: Research and Engineering.

The success of engineering tasks is typically better estimateable (and if not, are they really engineering tasks???). You know what you need to do, you need someone with the right skill end they will get to the result.

On the other hand, a research problem is uncertain. No matter how much skill you throw at it, the person doing it might come back with a negative outcome. Worse, if you don’t allocate enough resources, the hypothesis might remain invalidated: We tried, and it didn’t work, but we don’t know why. Maybe we didn’t try hard enough.

This is a terrible outcome as you can’t throw away a branch on your roadmap, and it will remain there, lingering that if we go back and retry, it will succeed. Why not try it seriously in the first go? By the way, this is why our company’s tagline is “Deliberate Machine Learning,” or, using the famous words of Master Yoda: Do or don’t, there is no try.

The goal is to have fewer research problems (but no less than necessary), and their engineering dependencies are clearly identified.

The crucial step here is to asses if a larger problem that branches the roadmap can be broken down into smaller components that depend on each other and if any of them are engineering ones.

Engineering ones are easier to assess, and you want the research tasks to be pure research, not a mix of engineering and research because the person doing the tasks together will have a hard time justifying the engineering cost as their outcome is the answer to a hypothesis (and helping the decision to choose the right branch on the roadmap and eliminate the others).

This starts to slide into creating tasks, so this will be done in the Backlog:

Backlog

The backlog is a prioritised list of tasks at the current state of the roadmap that will deliver the system in the Blueprint to achieve the plan's goal. (Huh, I managed to put all the components into one sentence.)

This is a continuation of the end of the roadmapping exercise. But here, the goal is to actually decide what to do first.

Clearly, the roadmap is a DAG, and the dependencies are clear. So, what’s the big deal here?

It’s predictability.

No one likes to work with unclear deliveries or goals. At the same time, they also don’t want to work on planning that has little chance of being implemented or take part in exercises (estimation) that have nothing to do with reality.

So what can be done?

Focus on the very beginning of the roadmap. Take the next larger element on the roadmap and assess what needs to be done and how difficult it is. Is it an engineering problem? Is it a research problem? Can you guess how much it will take to make it? For example, 1 to 7 days (let’s think in 1 week sprints). If you can’t be sure it can be done in a week, you probably want to break it down into smaller chunks. Why can’t you tell the answer? Maybe there is a research problem in there? Engineering problems are usually better decomposable, so look out for this.

The goal is not to be precise with estimation but to reveal whether something is difficult or not.

A more experienced team can be more comfortable making these guesses, and as they get into the flow, they will be able to sign up for larger tasks at once. In my experience, a typical “good chunk of work” is 2-3 days, so a DS can do 2 in a week (this accounts for “unproductive time”, like meetings, holidays, anything).

Call out statements like: "I don't know." "This can't be estimated." and "This is an art, not science."

If not, why not? Use 5Whys to identify root causes of uncertainty and keep breaking down the problem as far as possible until the people who will eventually do the task feel comfortable with it. No one size fits all. A junior person or team will need more handholding than a senior. New technology is uncertain on its own. Maybe introduce tasks (research tasks) to experiment with it to build confidence.

It is important that you only do this at tasks that you are about to do and not everything on the roadmap. Dealing with far-out tasks is a pointless exercise as you pretty much don’t know if they will be relevant at all.

However, analyse the roadmap for critical components even if they will be relevant only later. If a lot depends on a subsystem, you might want to plan/research and eventually do those right now. But again, that’s what DAGs are for. Each of the subsystems and tasks will see when they can be done and what their implications are. Use this information creatively.

Breaking down larger tasks into smaller engineering vs research tasks often reveals unidentified efficiencies. Sometimes, it turns out that multiple research problems rely on the existence of the same service (e.g. a database, data cleaning, etc.), and you can redraw the roadmap and potentially simplify the critical path.

This is the main driver in running these projects faster. If you can leverage engineering results that are more predictable to accelerate research tasks (that remove uncertainty) towards the ultimate goal of the project.

The PM’s (or Senior IC’s) job here is to facilitate this conversation so that the team members feel comfortable signing up for these to-dos. If you can’t conclude a research vs engineering question, it is the leader’s job to sign off on the task: “Try it anyway and regroup next week.” Sometimes, you need to stop planning and march forward.

Updating

As I wrote many times before, ML projects are unspecifiable, so updating these documents is a basic daily routine and the main job of the PMs. The typical order is:

Backlog → Roadmap → Blueprint → Plan

You update the Backlog each week as nodes on the Roadmap become available. If you figure out a junction point, you need to update the Blueprint, but you only very rarely update the Plan. Namely, you realise something is unfeasible, or you will need to deliver something else measured differently. (You should ask: Is this still the same project?)

As the project progresses, you update these documents and share them with other teams. You (as a PM) can plan the project better as you know that your team works on deliverable elements, and you can anticipate dependencies in the future and pre-plan for them with other teams (a notoriously complicated part of ML projects).

While the backlog only deals with tasks right now, you know (roughly) which tasks and dependencies will be relevant in the near future so that you can think about them.

Benefits

As I wrote before, the main efficiency gains come from correctly identifying things that are doable and leveraging them for more learning. My experience is that ML projects severely underinvest in engineering tasks (not just MLOps infra but in general EDA/POC stages as well). It is usually justified that there is no clear outcome yet, but it often ends up being the main reason that there will be no outcome at all.

The project management components are also a good reference point to bring everyone to the same page. “Here is my plan; if you don’t like it, how should I change it? But once we agree to it, we should stick to it.”

While it does not directly deal with estimating timelines, I found that as a side effect, relative timelines fell out of these exercises. Often, projects seem to be more “real” and “progressing” for managers than without them, which eases the pressure to clear deadlines (which are usually just arbitrary points, but failing them is a huge face loss).

The roadmap also allows a more granular approach. By analysing the dependency graph, you can identify tasks that can turn into GO/NOGO decisions (e.g., affect project commercial viability). Bringing these forward (as allowed by the DAG) can help the company make decisions.

Summary

This sounds like a lot of stuff, but the main task of project leaders is to unblock and manage others so this doesn’t come off the ICs’s timeline. It is far riskier if you try to run a project without a good plan in your head of where you are and where you are going.

At the same time, you should be aware of uncertainties and not be dogmatic about plans and execution. No one likes chores, but they want clarity on what they do and whether it is possible and matters.

The majority of this should be done in a short workshop (1-2 days for the PM, much less for everyone else). Fixing the backlog should be a meeting each week. The rest should be the PM’s “active thinking” work.

I hope you enjoyed this content, and please consider subscribing for similar:

Eli Tiutin

Apr 1, 2024

This practical guide is exceptional. Your definition of a roadmap resonated deeply with me. What's most impressive is the guide's incredibly straightforward method for keeping a comprehensive overview of the entire project and organizing all components within a unified framework. In contrast, I usually encounter the application of similar concepts in a far more fragmented and chaotic manner, which often results in the loss of information or necessitates frequent context switching, among other issues.

Expand full comment

Deliberate Machine Learning