You often hear about attempts to build text classifiers for this and that purpose. But what do you do when a client comes to you with a problem, a performance requirement unheard of, and your company’s survival depends on the solution?
“It’s 106 miles to Chicago, we’ve got a full tank, half pack of cigarettes, it’s dark out, and we’re wearing sunglasses. Hit it.”
It was a couple of years ago when a large information provider came to us with a problem they had: Find all articles in the world reporting about new ETF (Exchange Traded Fund) launches shortly after they are published.
Us being an NLP startup, can’t say no to a request like this. By this time, we had considerable experience in industrial-scale NLP and, of course, a well-maintained web crawler.
This, of course, doesn’t make the problem easy. The topic itself is niche enough that standard NLP classifiers will not going to work. Standard techniques rely on broad topics with a lot of terminology and fuzzy boundaries. This works for topics like “Sport” or “Finance” but certainly not for deciding “Does this article mentions Fund XYZ launched or actually closing it down?”
Did I mention that the SLA stated incredibly strict performance characteristics (both in Precision and in Recall), which to my knowledge, are unheard of in any NLP related tasks?
As a resource-strapped startup, we pretty much only had one shot at this, so we better bring our A-game.
So here is what we did
Our company had a DQA (Data Quality Assurance) team and a smaller team of SMEs (experts in ETF related issues) who can orchestrate the domain efforts.
We also had a long list of fund names and an industrial-strength NER tagger.
Gather the sources
It’s pretty easy to determine if an article should be classified as positive when you see it. But how do you classify an article that you have never seen??? And with this, you reached the crux of any information retrieval problems.
First of all, we need to find any sources that talk about any ETFs. We had a considerable corpus (tens of millions of articles) of finance-related content. We searched for any article mentioning any of the funds in the above-mentioned fund list and grouped them by relevance.
From these articles, we harvested links pointing to other ETF related sites and went to Common Crawl to gather more content. Tag those with funds, select the top N and go back to CC for more content. We kept repeating this until diminishing return.
We used heavy prioritisation, or you quickly end up trying to download the entire internet. Our understanding was that we should have everything major covered. If a fund is launched, it is very unlikely that it won’t be mentioned anywhere major eventually. The point of the exercise was that we shouldn’t have any major blind spots.
So at this point, we were quite confident that if an article mentioned a launch, we would have it in our database. But now we have another problem:
How do you find the relevant articles?
Also known as labelling. The DQA/SME team are a bunch of crack people, but they can’t label the millions of articles that we had.
So what do you do if you can’t label many?
Label a few.
We labelled about 2000 articles each by two people to build a consensus. If they disagreed, we reviewed the sample. At this level of models having a clear understanding of the labelling instructions is paramount. We iterated on both the instructions and both the actual labels. We started in batches of 200 with three labellers, reviewed them, updated the docs, and fixed the labels. And we kept iterating until we were confident that all labellers understood the instructions and that 2000 articles were pristine.
This is still nothing if you want to create a good quality model. We also created another 20000 articles labelled only once. These were (just like the above 2k) selected by importance sampling based on the article’s property: Coarse topic (Is it about finance at all?), how many funds are named in them, does it contain a list of relevant terms (with regex). So the article is heavier on the positive labels than pure random, which would be only ~0.01% positive.
Still not enough
Ok, this is still nothing. Mind you, we had millions of articles, a very low hit rate and a 95%+ F1 score SLA.
Time to bring out the big guns.
Label up to a million articles with synthetic labels through heuristics. Components include:
Key phrases
New Funds
Labels from previously trained models
Any heuristics you can think of
Use the synthetic labels to train a (deep) model. Use the 2k article for performance validation.
Of course, the model won’t be perfect, but it will have a good Recall with a low gating threshold. The model predicts probabilities, and it was up to us to set the threshold for whatever goal we wanted to achieve. If we set the threshold low, we will have high Recall but low Precision. At least we know we have all the relevant articles. Now run this model _on the entire corpus_ to find any article that can ever be relevant.
At this point, we were confident that we had all the articles from our corpus and Common Crawl that were relevant to us.
Still need more
Use the 20k article set to find new ideas on how to update the synthetic data generation. Select the articles where the model and the 20k article’s label differ. This should be a much smaller set and identify three things:
Incorrectly labelled articles: Either the labelling is plain wrong, the instructions didn’t cover an edge case, or the instructions are wrong. One thing I learned is that nothing is straightforward in NLP, and people are very creative in how they write about the same thing.
Errors in the assumptions of the synthetic labelling.
New ideas.
This turned out to be a very productive process, so we kept growing the original 20k articles (with single labels) and kept reviewing the ones where they differed from the machine’s labels.
At this point, this was still POC. About 2 DS, 5 DQA+SME and 2 SWEs spent about a month on it. We had a lot of kit to reuse, but most of the modelling and data pipeline code was custom (based on our own in house tools). You can imagine you needed to maintain high-quality coding practices to achieve a convoluted workflow like this, or you spend all of your time tearing out your hair struggling with pandas in notebooks. We needed to figure out Common Crawl, and engineering was actually required to set up the crawls for the live sites.
Are we there yet?
So that’s nice, but how do we know we pass the very stringent SLA? Technically the client can’t check it because they don’t have a better model, but they will _do_ ask us at the meetings, and we better have some evidence.
So we took a period of 5 days where we knew it was “Launch Season”, so we expected a lot of relevant articles. We selected all the articles that had “ETF” or “Exchange Traded Fund” (literally not even regex, just plain old python “in”) and hand-labelled all of them.
This was not that much article, but still a considerable effort (~50k if I remember well). It was also very biased as they came from the same period, so we can’t use them for future training. We only did this once during the entire project.
The results were beyond our expectations. The performance indicated that we have an almost perfect Recall during the period at a reasonable Precision. But we will deal with Precision later.
Time to go live
From all of the above, you can imagine that our main headache was Recall. You can’t label an article positively that you have never seen. This left Precision a secondary goal, but that still needs to get tackled.
We needed much more labelled data to fix the Precision problem (if it is possible at all), and we were pressed to go live ASAP.
Luckily the problem is not very time-sensitive; this is not high-frequency trading. This allowed us to use partial automation, also known as “Human-In-The-Loop”. The model will assign probabilities to each article. We will select the ones above a certain (low) threshold. It will still take a lot to get all positive ones (high Recall) but still feasible to label manually each day. This was originally maximum low hundreds a day; fund launches are actually pretty rare. We assume that the human labelled articles have a near 100% Precision. The DQA/SME team were pretty good at this by this time.
Our DSes were analysing the daily sets and the ones below the selection threshold. These labelled articles were added to the original 20k articles and used at subsequent synthetic label generation and model iteration.
After a short period of time, we were able to assess the cost of the daily labelling and set it against our business objectives. Setting a low threshold for selection has a high Recall (key business objective) but selected many irrelevant articles that the DQA team needed to remove manually, leading to higher ongoing labour costs. Through careful analysis, we understood we could raise this threshold without hurting Recall and spare a lot of time for the DQA team. As the labelled set grew and the DS team understood the edge cases better, the model performance improved (each time it went through the process above).
What about MLOps?
This was pretty much before the MLOps era. We used a very early stage Prefect+Dask for workflow orchestration on very large instances (on AWS with dozens of cores and hundreds of gigabytes of memory). This allowed us to avoid dealing with orchestration. The workflow was not different from launching a script on your local machine.
All code was versioned and in very high quality, as you can figured it out by now from the topics in this blog. Data and models are immutable; these were identified by the git hash of the code generated them. Manually labelled data was treated as “Slowly Changing Dimensions”, essentially as temporal data.
The models were handed to Engineering, who deployed them as regular services and logged a lot of information (particularly the git hash of models for data lineage) at our request for performance tracking.
Conclusions
Before the product launch, the client severely underestimated the existing content (mind you, this was their job) that’s out there and the amount of resource they needed. “Two dudes here and in New York” was the exact quote. We estimated that you need at least 20 people to manually review all the major sources, which still restricts you to a small part of the universe. There is just no chance of manually checking hundreds or even thousands of web pages and finding small detailed information.
Our partial automation resolved this problem in 1-2 hours of manual work for one person. The client regularly received articles from us they had never heard before, identified new opportunities and, in general, had the confidence and peace of mind of a “complete picture” of the topic. The core project took about two months from start to finish. It took longer to negotiate the contract than actually do the deeds.
We had many many of these models eventually on different topics, once we turned this into a repeatable process with similar success. The team’s (including the DSes) muscle memory on how to solve these problems improved over time until we could pick up these incredibly difficult jobs with high predicted success.
It was excellent work.