How can a Data Scientist refactor Jupyter…

Oct 24, 2021

The goal of maintaining production-quality code is that you can change your codebase with confidence, knowing it still adheres to already assumed specifications. This allows you to respond to new requests, attempt new solutions and back down from them if you don’t get it right.

Read →

7 Comments

Siggi Smarason

Jun 4, 2022

Thanks for mentioning Typer! I came across it years ago when I was first starting out with FastAPI but never thought about it too much. I should have given it more attention given the quality of what tiangolo does but this was a welcome reminder to re-aquaint myself with it.

Expand full comment

Reply (1)

Laszlo Sragner

Jun 4, 2022

It's the first thing I do in any projects.

Expand full comment

Alex

Jun 11, 2022Edited

Hello, Laszlo. Thank you for the material, I have a couple of questions.

1. What exactly do you assert, comparing "y" with the optimal model results pkl file? Model can produce metrics, feature importances, predicted values, etc. What should I consider?

2. In your assertion code "y" stands for predicted labels, I assume. So it returns false if one of predicted labels is not equal to the one from the ideal-model (pkl file). But how can it help in testing? I won't even know which value is wrong and, especially, why is it so.

3. You have get_names methods in your TestNameLoader and SQLNameLoader classes. I am pretty sure, it breaks one of the OOP principles.

4. Same for the Process's run() and the Typer's run() methods.

5. I don't see the purpose of the TestNameLoader class. It only saves the dataset into a pickle file and returns 0-sampleCount rows from it. Shouldn't you name it "Dataset_file_loader" instead?

Maybe I misconcept something.

Expand full comment

Reply (1)

Laszlo Sragner

Jun 11, 2022

1) Anything you want to make sure that needs to stay the same if you refactor (which means changing the code without changing its behaviour). This also needs to be deterministic and computationally stable. If it is not, pick a different value or even do a processing step on it.

2) I am not comparing the predicted labels to the ground truth but _to a previous run of the same code_ a very important difference. Refactoring aims to make your code more flexible and easier to read, not to improve model performance. Why do you do it anyway? Because it is easier to make _model improvements_ on clean and refactored code.

3) This is the whole point of interface programming that the "Process" class doesn't know which is plugged in. It just knows that it has a "get_names" function that will give it "names" objects. So it is not an error. They _must_ have the same name.

4) This is not mandatory as the two classes have different roles. They are not implementing the same interface. You can argue that it is sloppy to name them the same, but I think it is pretty descriptive of what they do: "They run the main body of each class".

5) The point of TestNameLoader is that you can get names from a file and don't need to go to the database (unless you don't want to). The point of refactoring is to run the refactored code continuously at every minor change. This needs that the main part of the code runs fast. This is impossible if you need to connect to a database each time. So you need to create an alternative source. It also has the additional benefit that the main part of your code will not be dependent on where the data is coming from. This is a desirable property called "decoupling".

Great questions. Thanks for your feedback. The best way to explain these concepts is through reader questions. Unfortunately, I get very few, so I celebrate each one.

Expand full comment

Reply (2)

Alex

Jun 11, 2022

Thank you.

1) Agree

2) I still can't see the usability of the assert_almost_equal() method. This may be a very basic approach, just to know that some results are different. But I suppose, there should be a more sophisticated, quantitative implementation of the comparison (e.g. I have ROCAUC and F1 scores in my pkl file and I can compare how much my new results are different from the baseline).

3) I see now

4) I see your arguments, I think it breaks the "Explicit is better than implicit" principle. No matter.

5) Yes, I can see that. Isn't just the class's name misleading? This class deals with files, not with testing, ain't it?

Expand full comment

Reply (1)

Laszlo Sragner

Jun 11, 2022

2) This is purely to see that between too changes of the code the output didn't change. Just like when you rerun unittests. How do you know it still does what it did before? Because the tests are passing, they create the same output as compared in assertions.

Now if you change your code to improve the behaviour you _expect_ these tests fail. So given which one is expected to fail you can verify that you changed your code just like you wanted and the improvement (or deterioration) was the result of your last change.

Let's say you want to improve results by changing a processing function in post-processing. Suddenly a test comparing inputs after _pre-processing_ fails. That is definitely strange and not as intended. Now you need to investigate what's the reason. (For example your pre and post processing is coupled through some functions).

But if you verify that only the right tests failed and the overall performance improved then you can delete the relevant cached pkl-s and rerun your code (without any changes). This will recreate your reference files to compare against and now you are in a stable state again.

And the cycle continue.

5) Yes, I can name it "MockNameLoader" or "FileNameLoader" these are indeed probably better names. (Which can be easily done if you set this up as renaming is trivial in this framework :)

Expand full comment

Laszlo Sragner

Jun 11, 2022

I share your questions and my answers as a post if you don't mind.

Expand full comment

Deliberate Machine Learning

How can a Data Scientist refactor Jupyter…