This opinionated template is intended to accelerate your deep learning research. The full stack includes:
- Python >= 3.10
- CUDA >= 12.1
- pytorch >= 2.2.1
- hydra >= 1.3.2
- poetry >= 1.8.1
- wandb >= 0.16.3
Steps to initialize your project:
- Create your project repo using this template
cd
into the project working directory and runinit_proj.py
- Respond to prompts from CLI. Your responses will be customize the template to your information and project specifics. You will be prompted to provide the following information:
- <<name>>: Name for git config
- <<email>>: Email for git config and/or slurm begin/end/fail notices
- <<drac-account>>: Digital Research Alliance of Canada compute account (def-xxxx). Typically your PI's account.
- <<working-dir>>: Full path to working directory. Will be used in local as well as docker containers.
- <<project-name>>: Name of project and python package
- <<repo-url>>: The URL of the project on github (www.github.com/`
/
`) - <<wandb-project>>: Weights and bias' project name
- <<wandb-entity>>: Weights and bias' entity name
- <<wandb-api-key>>: Weights and bias' API key. NOTE: This information is stored only in
.env
which is not tracked usinggit
- Create your working environment using
venv
ORdocker
:venv
: Simply run the following:python3 -m venv .venv source .venv/bin/activate pip3 install --upgrade pip pip3 install poetry==${poetry-version} poetry install -vvv git submodule init && git submodule update
docker
:- vscode:
- Populate
customizations.mounts
for any data that you want mounted to dev container - Run
Dev Containers: Reopen in container
in vscode
- Populate
- CLI: run the following:
docker build --file ./Dockerfile.dev -t ${your-tag-name} --shm-size=16gb . docker create --env-file .env --gpus all -i -t --name ${your-container-name} ${your-tag-name}:latest docker start -itd --env-file ./.env --mount source=${path-to-scratch},target=\scratch,type=bind \ --gpus all --shm-size 16G ${your-container-name}
- vscode:
.devcontainer
: Contains .json file for using vscode's devcontainer functionality.github
: CI/CD pipeline definitions.vscode
: vscode settings to play nice with our dev environment including integration withblack
andflake8
artifacts
: Store your model outputs, logs, and other artifacts generated during training hereconfigs
: hydra configs for reproducible and extensible experimentation. Config groups match structured of python package subdirectories to help users quickly find configs as required. Groups can be extended with multiple configurations and selected usingconfig.yaml
defaults list.configs/sweep
is intended forwandb
sweep configurations and is not used byhydra
.data
: small datasets such as MNIST and CIFAR-10 can go here. Larger datasets are assumed to be stored outside the project directory such as/scratch
and mounted to docker containers.notebooks
: Jupyter notebooks and some suggested style files for nicematplotlib
plots. Also includesmain.ipynb
which is intended to be synced withmain.py
for easy debugging and hacking of main script. Consider usingjupytext
to sync.py
files to github rather than.ipynb
. Note that themain.ipynb
file can be regenerated anytime by running./scripts/generate_main_notebook.sh
.scripts
: Various helper scripts for building your project on a new host or on Digital Alliance of Canada nodes. Also includes utilities for downloadingwandb
run tables andimagenet
. You will also findinit_proj.py
here which is the starter script for initializing the template.src
: Source files for your project. It is expected that at least one python package will be created in ./src/<<package-name>>. However, you can have as many packages as you like. Simply add any other packages topyproject.toml::packages
tests
: Unit tests usingpytest
. Predefined tags includeslow
,integration
, anddist
to help limit your CI/CD to applicable tests.third-party
: Third party dependencies managed withgit submodule
The base image is NVIDIA's nvcr.io/nvidia/pytorch:24.02-py3
image. For development, there are a few considerations. First, we need to create a user in the container that matches the UID/GID for our user on the host to avoid breaking file permissions during development. Since we cannot run the container as the root user, we also need to use a venv
to get around permissions issues in /usr
, where the system python is installed. Finally, resolving dependencies using poetry
can greatly slow down build times. Therefore, we make sure we mount the pyproject.toml
and, optionally, poetry.lock
files separately from the rest of our source code. This ensures that the docker
builder will cache the poetry install layer and skip the install step, provided that no changes to these two files have been made since the last build. To avoid conflicts with a potential local .venv
directory, we build the container's venv
in ~/<your-user-name>/build/
rather than the project working dir. This environment will be active at startup as it is the prepended to the container's PATH
env var.
For our production container, many of these issues are resolved since we do not bind mount our project source code. Therefore, we can simply install our application dependencies over the existing system python.
DRAC pre-builds many python packages into wheels that are stored in a local wheelhouse. It is best practice to use these wheels rather than use package distributions from PyPI. Therefore, consider pinning dependencies in pyproject.toml
to match the pre-built wheels avaialble from DRAC. You can search the pre-built wheels with avail_wheels
command on DRAC servers.
Unfortunately, poetry
does not yet support use of a local directory as a wheel repository. Therefore, clone your poetry
venv
to a requirements.txt
file by running this command:
poetry export --format "requirements.txt" --without-hashes --without-urls -vvv >> requirements.txt
For simplicity, a bash script for installing the project and dependencies is included, see: ./scripts/build_cc_venv.sh
. Simply run this script from the project working directory after cloning the project from github. You can also use the ./scripts/slurm/batch_build.sh
to submit the build as slurm
batch job.
This repository uses pytest
. Run tests using pytest