In online conversations, “rule-based systems” have been raised as a good starting point for an ML project. This is done often in opposition to a “fancy” and “complicated” model.
I understand that anything goes in online conversations, but this points to troubling implications. We will examine them in the post:
Solution rather than problem based thinking
Data Scientists’ job is to answer questions through statistical tools.
Sometimes these answers manifest in the form of models, but that is not a necessity. When a question is raised, Data Scientists concern themselves first with the problem and not the answer. This requires them to build a framework (out of code and data) where subsequent answers can be validated.
Stating upfront that a rule-based system will suffice as a baseline misses the point of building this framework. Just like focusing on “fancy” models, it jumps straight to conclusions.
Rule-based systems are simple
They sound like such a simple idea, don’t they?
You start with something like “price > $1000 then TRUE else FALSE”. Then you add logical operators like AND, OR and NOT. Then you tweak the constants for a presentation. Then you add loops and branching to simplify the logical expressions.
And Bang! Your DSL (Domain Specific Language) is Turing Complete. Raise your hand if you haven’t done this before. Is this what rule-based systems are? I was promised simplicity and ended up reimplementing python. I might hand the project over to the Software Engineering team; they are much better at writing code. They will surely be able to resist the urge to write a new language (Who am I kidding, they won’t, but then it will be their problem).
Again Data Science is about statistical analysis, not making a model: No statistical questions? You are probably working on the wrong problem.
Models are hard
Ok, let’s say you prepared and built a framework where you can validate your models. You start looking for solutions. Someone else must have had this problem before?
Implementing models from scratch is indeed hard. You need to deal with edge cases, computational performance, interfaces and so on. On the other hand, base models are highly reusable, and for the last 5-10 years, a vast amount of resources went into implementing these.
The biggest advance in the field recently was the easy accessibility of high-quality standardised models.
What are your options then?
You can use pre-trained ones like spacy, huggingface, or imagenet to solve standard tasks. Or ones primed for transfer learning where you only need to add a finishing layer to adapt it to your problem.
You have the *BOOST models for structured data, and there is nothing wrong with building a baseline model with sklearn’s logistic regression.
In general, the world moved on, and something that qualified as a baseline model 5 years ago doesn’t cut it now. In a rapidly progressing field as machine learning, you not only need to update your skills but also your beliefs of what you accept as a reasonable first try.