Inspired by the original cookiecutter-data-science template. This template is more opinionated regarding tools used. It has been updated to better fit machine learning-based projects and is being used as the core template in this MLOps course.
- Python 3.11 or higher
- cookiecutter version 2.4.0 or higher
Start by creating a repository either using the GitHub GUI in the web browser or alternatively you can use the GitHub command line interface if you have set it up:
gh repo create <repo_name> --public --confirm
Afterwards on your local machine run
cookiecutter https://github.com/SkafteNicki/mlops_template
and input starting values for the project. When asked for the repository name when creating the template, input the same name as when you created the repository. Note that when asked for the project name, you should input a valid Python package name. This means that the name should be all lowercase and only contain letters, numbers and underscores. The project name will be used as the name of the Python package. This will automatically be validated by the template.
To commit to the remote repository afterwards execute the following series of commands:
cd <repo_name>
git init
git add .
git commit -m "init cookiecutter project"
git remote add origin https://github.com/<username>/<repo_name>
git push origin master
When the project is created, the repository will have the following structure:
├── .github/ # Github actions and dependabot
│ ├── dependabot.yaml
│ └── workflows/
│ └── tests.yaml
├── configs/ # Configuration files
├── data/ # Data directory
│ ├── processed
│ └── raw
├── dockerfiles/ # Dockerfiles
│ ├── api.Dockerfile
│ └── train.Dockerfile
├── docs/ # Documentation
│ ├── mkdocs.yml
│ └── source/
│ └── index.md
├── models/ # Trained models
├── notebooks/ # Jupyter notebooks
├── reports/ # Reports
│ └── figures/
├── src/ # Source code
│ ├── project_name/
│ │ ├── __init__.py
│ │ ├── api.py
│ │ ├── data.py
│ │ ├── evaluate.py
│ │ ├── models.py
│ │ ├── train.py
│ │ └── visualize.py
└── tests/ # Tests
│ ├── __init__.py
│ ├── test_api.py
│ ├── test_data.py
│ └── test_model.py
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── pyproject.toml # Python project file
├── README.md # Project README
├── requirements.txt # Project requirements
├── requirements_dev.txt # Development requirements
└── tasks.py # Project tasks
In particular lets explain the structure of the src
folder as that is arguably the most important part of the
repository. The src
folder is where the main code of the project is stored. The template divides the code into five
files, shown in the diagram below with their respective connections:
data.py
: this file is responsible for everything related to the data. This includes loading, cleaning, and splitting the data. If the data needs to be pre-processed then running this file should process raw data in thedata/raw
folder and save the processed data in thedata/processed
folder.model.py
: this file contains one or model definitions.train.py
: this file is responsible for training the model. It should import the training/validation data interface fromdata.py
and the model definition frommodel.py
.evaluate.py
: this file is responsible for evaluating the model. It should import the test data interface fromdata.py
and load the trained model from themodels
folder. Output should be performance metrics of the trained model.api.py
: this file is responsible for serving the model. It should import the trained model from themodels
folder and provide an interface for making predictions.visualize.py
: this file is responsible for visualizing the data and model. It should import the training/validation/ test data interface fromdata.py
and the trained model from themodels
folder. Output should be visualizations of the data and model.
At some point one or more of the files may have grown too large and complicated. At this point it is recommended to
split the file into multiple files and move into a folder of the same name. As an example consider the model.py
containing many models. In this case it would be a good idea to refactor into
src/
└── project_name/
├── __init__.py
├── models/
│ ├── __init__.py
│ ├── model1.py
│ └── model2.py
├── ...
🐍 Python projects using pyproject.toml
🔥 Models in Pytorch
📦 Containerized using Docker
📄 Documentation with Material Mkdocs
👕 Linting and formatting with ruff
✅ Checking using pre-commit
🛠️ CI with GitHub Actions
🤖 Automated dependency updates with Dependabot
📝 Project tasks using Invoke
and probably more that I have forgotten...
If you enjoy using the template, please consider giving credit by citing it. You can use the following BibTeX entry:
@misc{skafte_mlops_template,
author = {Nicki Skafte Detlefsen},
title = {MLOps template},
howpublished = {\url{https://github.com/SkafteNicki/mlops_template}},
year = {2024}
}