TDD or not TDD: How to write tests in Data Science

2022-11-29

Nov 29, 2022

This post was inspired by Michaël Azerhad's post on TDD on LinkedIn [link] because it also matches my experience, yet the commenting professional SWEs disagree entirely with this.

When I was working on mobile games, we were trained by a company on the psychology of gaming. That's where my knowledge of Self Determination Theory comes from.

Self Determination Theory tells you that to stay motivated on a subject; three things need to happen: Mastery, Autonomy and Relatedness. Mastery is that while you are doing the activity, you need to sense that you are getting better at it (this is what you lose, for example, with frustrating UIs). Autonomy means you can decide how to achieve your goals. Relatedness implies that you are able to feel connected to the activity and the purpose and communicate that to a broader audience. "Be proud of what you are doing".

Now let's take, for example, an open-world RPG. Very hard to play and needs a lot of knowledge, skills etc. How do you allow new players to experience the three components? They will be frustrated with their incompetence and be drowning in the content to learn and practice. If you don't keep them motivated early on, they will lose interest and churn out before they reach the part of the game where the three component kicks in and the richest gaming experience reside.

So what do the game designers do? "Fake it till you make it" They artificially design the early stage of the game (not just the onboarding) in a way that stuff to learn arrives in a consumable sequence, and achievements are inflated artificially, generating a "fake" Mastery effect.

Why is this relevant in relation to TDD? When someone starts learning modern programming techniques, every single problem is a problem all at once: testing, architecture, SOLID, decoupling, design patterns, readability, naming, conventions etc. etc. etc. All of them.

This is hugely frustrating and puts off many people, and they just go back to their old habits. After all, technical debt is someone else's problem down the line.

If we want to distribute this knowledge, we need to think about Self Determination Theory. Which parts are the best to start with? Which one is omittable until the practitioner matures and can learn it by themselves?

Writing tests is usually the first thing that is introduced. But these are not on their own will deliver value. One must know about layers of abstraction, decoupling and structuring to see immediate tangible value. So that's where we should start, and "fake" testing by alternative means. [more on that in the next part, follow me to be notified]

In our (free) CQ4DS community (550+ members and growing!), we design study material in a way that immediately brings value to you, and you can pick up the details later. Join today:

https://discord.com/invite/8uUZNMCad2

Subscribe for more posts on the topic:

Or read more on Code Quality For Data Science:

Chase Sommer

Dec 22, 2022

Not sure if you use dbt much at all, but there’s a decent amount of testing baked into the modeling. I also found a dope package that extends the testing functionality of your data models: https://github.com/calogica/dbt-expectations

Not directly data science per say, but it’s testing the underlying data.

Expand full comment

Deliberate Machine Learning

Discussion about this post