Monorepo example for Data Science teams.
- Check out the original article here to understand the underlying principles of the design.
- Check out the getting started guide for more information on how to use this project.
The project is automated using make
and the Makefile
In the first run, use make setup
to bootstrap the project setup in your local. This will install dependencies using pipenv and create the necessary files. It will also run make all
for you so you can verify that everything is working in your local.
This is a ML Monorepo. Multiple modules live in the ./src
directory. Current ones are:
src
├── config # Base configuration objects
├── data_access_layer # Data Access layer helpers
├── datasets # Schema for different datasets using pandera
├── feature_store # Common interfaces for model features
└── models # Code to generate different models
Each module can implement different runs, but in general they should be runable as a module. For eample:
python -m models.diabetes.features
When running the project as a docker image, you must specify the module to run as the docker command. We have implemented a helper script to run the docker build image with the latest project files and AWS credentials setup
./scripts/docker-run models.diabetes.features
-
Install the project dependencies in the pipenv environment.
-
Run mlflow in local with model registry:
$ pipenv run mlflow [INFO] Starting gunicorn 20.1.0 [INFO] Listening at: http://127.0.0.1:5000
-
Run the feature extraction:
$ pipenv run python -m models.diabetes.features \ --dst tmp/data/diabetes_features.parquet features.py:main:14 INFO: Start | run features.py:main:14 INFO: End | run | (result=sklearn_dataset='diabetes' dst='tmp/data/diabetes_features.parquet')
-
Run the data preprocessing:
$ pipenv run python -m models.preprocess \ --src_features=tmp/data/diabetes_features.parquet \ --dst_x_train=tmp/data/x_train.parquet \ --dst_y_train=tmp/data/y_train.parquet \ --dst_x_test=tmp/data/x_test.parquet \ --dst_y_test=tmp/data/y_test.parquet preprocess.py:<module>:46 INFO: Start | run preprocess.py:<module>:46 INFO: End | run | (result=src_features='tmp/data/diabetes_features.parquet' dst_x_train='tmp/data/x_train.parquet' dst_y_train='tmp/data/y_train.parquet' dst_x_test='tmp/data/x_test.parquet' dst_y_test='tmp/data/y_test.parquet')
-
Run the model training (note the output model uri)
$ pipenv run python -m models.diabetes.train \ --src_x_train=tmp/data/x_train.parquet \ --src_y_train=tmp/data/y_train.parquet \ --src_x_test=tmp/data/x_test.parquet \ --src_y_test=tmp/data/y_test.parquet train.py:<module>:72 INFO: Start | run train.py:<module>:72 INFO: End | run train.py:<module>:75 INFO: Model saved to runs:/2099249145894ae3b16b7a37653cec06/model
-
Run a predictions using the previous logged model in mlflow
$ pipenv run python -m models.predict \ --src_features=tmp/data/x_test.parquet \ --src_model=runs:/2099249145894ae3b16b7a37653cec06/model \ --dst_y_hat=tmp/data/y_hat_test.parquet predict.py:<module>:126 INFO: src_features='tmp/data/x_test.parquet' src_model='runs:/2099249145894ae3b16b7a37653cec06/model' flavour='sklearn' parallel_backend='threading' n_jobs=-1 batch_predictions=False batch_size=10000 progress_bar=True dst_y_hat='tmp/data/y_hat_test.parquet' mlflow=MLFlowConfig(experiment_name='predict', run_name='run-2023-06-20T18-09-48', tracking_uri=SecretStr('**********'), flavor='sklearn', tags=None) execution_date='2023-06-20T18:09:48Z' predict.py:<module>:131 INFO: Reading data from tmp/data/x_test.parquet predict.py:<module>:133 INFO: Loading model from runs:/2099249145894ae3b16b7a37653cec06/model predict.py:<module>:137 INFO: Writing y_hat to tmp/data/y_hat_test.parquet
Documentation uses mkdocs
.
To run the documentation locally, run make mkdocs
and open http://localhost:8000
To expand the documentation, edit the files in the ./docs
directory. Any markdown file can be added there and will be rendered in the documentation with navigation and search support.