CQ4DS - Python project from scratch with poetry, black, ruff, pytest, pre-commit-hooks and GitHub Actions in 15 minute tops
2023-08-03
Data Scientists often struggle with setting up their own projects and then miss out on the convenience and automation that brings to their workflow. They also don't build the right muscle memories to practice these tools, which will hinder them long-term.
So here is our set of tools to set up:
poetry: Sort out virtual environments for good, all your project definitions in pyproject.toml.
black: You can have any code style until it is black. Stop worrying about formatting your code.
ruff: New blazing fast linter, anything that black doesn't care about will be fixed here
pytest: Testing, duh...
pre-commit-hooks: Automate all of the above and forget about them.
GitHub Actions: Run these on the remote as well, just to be sure.
It doesn't take more than 15 minutes to set these up, and you will benefit from them for the rest of your project. Don't wait until your project matures and you have more problems.
Poetry - virtual environment
poetry new -n --src <project>
cd <project>
poetry config virtualenvs.in-project true
poetry env use python3.10
source .venv/bin/activate
poetry add black ruff pytest pre-commit
This should sort out the basics and create a virtual environment.
You can run your Python scripts with:
poetry run python <python_script.py>
Black and Ruff - code formatting and linting
To set up black and ruff, add this to pyproject.toml.
[tool.black]
skip-string-normalization = true
line-length = 120
[tool.ruff]
# Same as Black.
line-length = 120
exclude = ["jupyter_notebook_config.py"]
select = [
"E", # pycodestyle errors (settings from FastAPI, thanks, @tiangolo!)
"W", # pycodestyle warnings
"F", # pyflakes
"I", # isort
"C", # flake8-comprehensions
"B", # flake8-bugbear
]
ignore = [
"E501", # line too long, handled by black
"C901", # too complex
]
[tool.ruff.isort]
order-by-type = true
relative-imports-order = "closest-to-furthest"
extra-standard-library = ["typing"]
section-order = ["future", "standard-library", "third-party", "first-party", "local-folder"]
known-first-party = []
This should take care of the basics of linting and formatting. Set line length to your convenience. The above shows examples of some specific options; extend them based on reading their documentation.
Git - version control
Create .gitignore and add this (there are better examples on GitHub).
.env
.venv/
__pycache__/
Then create an empty repository on GitHub and continue with the following commands:
git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin git@github.com:<username>/<project>.git
git push -u origin main
Pytest - testing
Your GitHub workflows will fail if you have no tests, so create a “tests” directory and a “test_hello_world.py” file and add this mock test to it:
from unittest import TestCase
class TestHelloWorld(TestCase):
def test_upper(self):
self.assertEqual("hello world!".upper(), "HELLO WORLD!")
You can run your tests with:
poetry run pytest
Pre-commit hooks - automation
Create a “.pre-commit-config.yaml” file in your main repository directory and add this content to it:
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/python-poetry/poetry
rev: '1.5.1'
hooks:
- id: poetry-check
- repo: local
hooks:
- id: black
name: black
entry: poetry run black
language: system
types: [python]
- id: ruff
name: ruff
entry: poetry run ruff . --fix
language: system
types: [python]
This will run black and ruff each time you make a commit and fix the errors. It also handles annoying chores like fixing end-of-line whitespaces and end-of-file newlines.
Install these by running this command in your terminal:
pre-commit install
GitHub Actions - automate remote
This is the last step. Once everything is running on your machine, make sure it works on other’s as well. Create a “.github” directory and a “workflows” directory. Then create a “python-app.yml” file in it and fill it with the following:
name: Python application
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
permissions:
contents: read
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: "3.10"
- name: Install poetry
run: |
python -m pip install poetry
- name: Configure poetry
run: |
poetry config virtualenvs.in-project true
- name: Cache the virtualenv
uses: actions/cache@v2
with:
path: ./.venv
key: ${{ runner.os }}-venv-${{ hashFiles('**/poetry.lock') }}
- name: Install dependencies
run: |
poetry install
- name: Lint with ruff
run: |
poetry run ruff .
- name: Run tests
run: |
poetry run pytest
And that’s it. Next time you create a PR or push code, it will run the linter and the tests on GitHub’s servers to ensure your code is replicateable.
Summary
In this short post, you learnt how to setup: poetry, black, ruff, git, pytest, pre-commit, GitHub Actions. With some practice, it shouldn’t take more than 15 minutes, and you have the entire project lifecycle to enjoy the benefits.
If you enjoyed this content, take a look at my presentations on code quality for Data Scientists:
PyData London 2023: Code Smells in Data Science: What can we do about them?
PyData London 2022: Clean Architecture: How to structure your ML projects to reduce technical debt
Or join our community on the topic at: https://cq4ds.com/
So very close to how I set things up. I hadn't really considered extending line-length from 88...but I might, now! Lots of little things I hadn't seen, like half the ruff configs.
A few things I do differently:
- install all those tools with `--group dev`; I don't want pytest, ruff etc. in production builds, even if they don't add much
- with pre-commit in the environment, you don't need `poetry run` in the commit-hooks
- does your ruff hook work as intended? does the hook fail if it finds an error it can't autofix? I ended up needing `ruff check --force-exclude --fix --exit-non-zero-on-fix` as well as `require_serial: true` though I can't remember why I needed force-exclude or require_serial
- let unittest stay dead. I still use classes with pytest, but I don't need to inherit a special class, and I can use plan `assert`s instead of the weird unittest functions
You can replace "black" with "ruff-format", as it's built in to ruff (can configure in pyproject.toml):
https://docs.astral.sh/ruff/formatter/#black-compatibility
Then you can replace the black pre-commit hook with this one:
```
- id: ruff-format
```