Skip to content

Latest commit

 

History

History
36 lines (30 loc) · 12.3 KB

BEST_PRACTICES.md

File metadata and controls

36 lines (30 loc) · 12.3 KB

Best Practices

Training and Inference Pipeline

Best Practice Description
Separate definition and runtime for training/inference pipelines. One anti-pattern we have observed is that of the unicorn data scientist. This person is asked to do everything, from seeking business value, to developing models, to provisioning infrastructure. Creating a seperation of concerns between the individuals responsible for engineering the pipeline and individuals responsible for building the model recognizes the unique value that Data Scientist and ML Engineer provides while creating a work environment that stimulates collaboration.
Use the same code for pre-processing in training and inference The majority of time in most ML projects is allocated to data collection, data engineering, label and feature engineering. It is vital to ensure that the same code/logic is used by both the training and inference pipeline so that there are no opportunities for human errors and inconsistencies.
Traceability between components of the training and inference pipelines. A well-governed machine learning pipelines provide clear lineage for all components of the training and inference pipeline definition and runtime. For the runtime traceability training pipeline and inference pipelines. Which data sources were used for training? What modelling techniques were used? What type of preprocessing was used? Where are the artifact’s stored? From the perspective of the pipeline definition, changes applied to pipelines should be tracked using versioning.
Code should provide consistent results wherever it is run. Data Scientists are used to programming in an environment that allows them interactively explore the data and get rapid results. Data engineers are used to submitting jobs that run end-to-end, anticipating what might happen and including appropriate measures for ensuring that the process is robust. It is important to keep both of these divergent approaches in mind when considering how to design a training and inference pipeline. A well designed Training and Inference pipeline will make it relatively easy to run the code and get consistent results in both scenarios.
Promote modularity in the development of Training and Inference solutions. Training and inference pipelines can be broken down into several components such as data validation, data preprocessing, data combination, etc.... When designing training and inference pipelines it is important that these components are created such that they can be developed, tested, and maintained independently from one another.
Use on-demand compute, only paying for it when you need it for a specific job. One of the anti-patterns we have observed is the attempt to treat the cloud as the same as an on-premise cluster of compute resources. But the cloud allows for fine-grained security and cost-control measures to decentralize the utilization of cloud resources. This enable cost-savings resulting from a pay-as-you-go model and empowers a growing Data Science organization to harness the power of cloud computing.
Automate the reporting for model, feature, and hyper-parameter searching. Particularly in regulated environments, there is a requirement to document the process for searching for the final model that is being used for inference. A well-designed ML pipeline will track the various experiments that were run while preserving traceability within the training and inference pipelines and saving the Data Scientist unnecessary/onerous reporting requirements.
Provision infrastructure using code using CI/CD pipeline. Automation is quite important to the implement of the ML pipeline. Although you can set up a ML pipeline manually via AWS console or CLI, in practice, it is recommended to minimize human touch points in the deployment of ML pipelines to ensure that ML models are consistently and repeatedly deployed. One option is to use continuous integration (CI)/continuous deployment (CD) to dev/staging/prod environments.

Developer Environment

Best Practice Description
Provide a flexible set of baseline services. Data Scientists should be enabled to develop code in any environment capable to run remote python interpreter: vscode, pycharm, vim, emacs, notebook. There should not be a single, "one-size-fits-all" approach.
Utilize basic software development best practices. Having a branching/colloboration strategy, using semantic versioning, using PRs, organizing the code repository, working effectively with notebooks, writing code that works on dummy data for testing as well as in pipeline, consistent formatting, type-checking, are some of the techniques that Data Scientists can borrow from software developers.
Establish scaleable patterns for Data Scientists to access public libraries. How do you migrate code repositories into your private cloud? One-off requests/tickets for each new package is difficult to scale. Creating seperate kernals for common ML frameworks, allowing data scientists to install from internal repos.
Decentralize the ownership of modelling choices. The organization that selects the features and modelling techniques should own the results of the model and self-regulate that they are following best practices for model development. Domain-specific review by outside teams should be optional and at the request of owner of the model. The auditing ability should be in place to go back and check how models were trained in case of issues.
Utilize automation and integrated version control for documentation. How many disparate sources of information would it take to review before a model could be transferred from one data scientist to another (e.g., confluence, GitHub, word document, etc...). Documentation should be semi-automated from the code itself, and should be contained in a single repository to ensure that it is always up-to-date.
Clear and consistent approval mechanism for authorization to use data. If a data scientist has access to particular data, it should be because she/he has approval to use that dataset for modelling. In regulated environments it is helpful to have a pre-approved list of features that can be re-used for a lot of projects in a specific domain, e.g., marketing, fraud, finance, etc...
Provision infrastructure using code. Automation is quite important to the implementation of the developer environment. Although you can set up a developer environment manually via AWS console or CLI, in practice, it is recommended to minimize human touch points in the deployment of developer environments to ensure consistency and repeatability in deployment.
Operate in a secure environment (isolated, encrypted, authorized, auditable). This involves network isolation, encryption of data in transit and at rest, and secure authentication and authorization. It should be possible to provide an audit trail for the work of Data Scientists and Developers.

Monitoring and Alerting

Best Practice Description
Check for development/production skew. The model should get the same accuracy in dev, stage, and production environments. This check should be a part of the normal promotion process.
Monitor the data of models in production. Measurement of how the distribution of features change over time compared with the distribution of the data the model was trained on.
Monitor the performance of models in production. Measurement of concept drift, estimated by the degree to which the ongoing model accuracy has decreased from the model when it was trained.
Provide remediation action for ongoing operation. Having a strategy for re-training models as fresh data becomes available, including preventative and responsive measures.