Skip to content

SkafteNicki/mlops_template

Repository files navigation

🍪 A up-to-date Cookiecutter template for MLOps

Inspired by the original cookiecutter-data-science template. This template is more opinionated regarding tools used. It has been updated to better fit machine learning-based projects and is being used as the core template in this MLOps course.

✋ Requirements to use the template:

  • Python 3.11 or higher
  • cookiecutter version 2.4.0 or higher

🆕 Start a new project

Start by creating a repository either using the GitHub GUI in the web browser or alternatively you can use the GitHub command line interface if you have set it up:

gh repo create <repo_name> --public --confirm

Afterwards on your local machine run

cookiecutter https://github.com/SkafteNicki/mlops_template

and input starting values for the project. When asked for the repository name when creating the template, input the same name as when you created the repository. Note that when asked for the project name, you should input a valid Python package name. This means that the name should be all lowercase and only contain letters, numbers and underscores. The project name will be used as the name of the Python package. This will automatically be validated by the template.

To commit to the remote repository afterwards execute the following series of commands:

cd <repo_name>
git init
git add .
git commit -m "init cookiecutter project"
git remote add origin https://github.com/<username>/<repo_name>
git push origin master

🗃️ Repository structure

When the project is created, the repository will have the following structure:

├── .github/                  # Github actions and dependabot
│   ├── dependabot.yaml
│   └── workflows/
│       └── tests.yaml
├── configs/                  # Configuration files
├── data/                     # Data directory
│   ├── processed
│   └── raw
├── dockerfiles/              # Dockerfiles
│   ├── api.Dockerfile
│   └── train.Dockerfile
├── docs/                     # Documentation
│   ├── mkdocs.yml
│   └── source/
│       └── index.md
├── models/                   # Trained models
├── notebooks/                # Jupyter notebooks
├── reports/                  # Reports
│   └── figures/
├── src/                      # Source code
│   ├── project_name/
│   │   ├── __init__.py
│   │   ├── api.py
│   │   ├── data.py
│   │   ├── evaluate.py
│   │   ├── models.py
│   │   ├── train.py
│   │   └── visualize.py
└── tests/                    # Tests
│   ├── __init__.py
│   ├── test_api.py
│   ├── test_data.py
│   └── test_model.py
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── pyproject.toml            # Python project file
├── README.md                 # Project README
├── requirements.txt          # Project requirements
├── requirements_dev.txt      # Development requirements
└── tasks.py                  # Project tasks

In particular lets explain the structure of the src folder as that is arguably the most important part of the repository. The src folder is where the main code of the project is stored. The template divides the code into five files, shown in the diagram below with their respective connections:

diagram

  • data.py: this file is responsible for everything related to the data. This includes loading, cleaning, and splitting the data. If the data needs to be pre-processed then running this file should process raw data in the data/raw folder and save the processed data in the data/processed folder.
  • model.py: this file contains one or model definitions.
  • train.py: this file is responsible for training the model. It should import the training/validation data interface from data.py and the model definition from model.py.
  • evaluate.py: this file is responsible for evaluating the model. It should import the test data interface from data.py and load the trained model from the models folder. Output should be performance metrics of the trained model.
  • api.py: this file is responsible for serving the model. It should import the trained model from the models folder and provide an interface for making predictions.
  • visualize.py: this file is responsible for visualizing the data and model. It should import the training/validation/ test data interface from data.py and the trained model from the models folder. Output should be visualizations of the data and model.

At some point one or more of the files may have grown too large and complicated. At this point it is recommended to split the file into multiple files and move into a folder of the same name. As an example consider the model.py containing many models. In this case it would be a good idea to refactor into

src/
└── project_name/
    ├── __init__.py
    ├── models/
    │   ├── __init__.py
    │   ├── model1.py
    │   └── model2.py
    ├── ...

📚 The stack

🐍 Python projects using pyproject.toml

🔥 Models in Pytorch

📦 Containerized using Docker

📄 Documentation with Material Mkdocs

👕 Linting and formatting with ruff

✅ Checking using pre-commit

🛠️ CI with GitHub Actions

🤖 Automated dependency updates with Dependabot

📝 Project tasks using Invoke

and probably more that I have forgotten...

https://www.pyinvoke.org/

❕ License

If you enjoy using the template, please consider giving credit by citing it. You can use the following BibTeX entry:

@misc{skafte_mlops_template,
    author       = {Nicki Skafte Detlefsen},
    title        = {MLOps template},
    howpublished = {\url{https://github.com/SkafteNicki/mlops_template}},
    year         = {2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •