OpenPAI is an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale.
OpenPAI v0.14.0 has been released!
- When to consider OpenPAI
- Why choose OpenPAI
- Get started
- Deploy OpenPAI
- Train models
- Administration
- Reference
- Get involved
- How to contribute
- When your organization needs to share powerful AI computing resources (GPU/FPGA farm, etc.) among teams.
- When your organization needs to share and reuse common AI assets like Model, Data, Environment, etc.
- When your organization needs an easy IT ops platform for AI.
- When you want to run a complete training pipeline in one place.
The platform incorporates the mature design that has a proven track record in Microsoft's large-scale production environment.
OpenPAI is a full stack solution. OpenPAI not only supports on-premises, hybrid, or public Cloud deployment but also supports single-box deployment for trial users.
Pre-built docker for popular AI frameworks. Easy to include heterogeneous hardware. Support Distributed training, such as distributed TensorFlow.
OpenPAI is a most complete solution for deep learning, support virtual cluster, compatible with Kubernetes eco-system, complete training pipeline at one cluster etc. OpenPAI is architected in a modular way: different module can be plugged in as appropriate. Here is the architecture of OpenPAI, highlighting technical innovations of the platform.
Targeting at openness and advancing state-of-art technology, Microsoft Research (MSR) and Microsoft Search Technology Center (STC) had also released few other open source projects.
- NNI : An open source AutoML toolkit for neural architecture search and hyper-parameter tuning. We encourage researchers and students leverage these projects to accelerate the AI development and research.
- MMdnn : A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models. The "MM" in MMdnn stands for model management and "dnn" is an acronym for deep neural network.
- NeuronBlocks : An NLP deep learning modeling toolkit that helps engineers to build DNN models like playing Lego. The main goal of this toolkit is to minimize developing cost for NLP deep neural network model building, including both training and inference stages.
- SPTAG : Space Partition Tree And Graph (SPTAG) is an open source library for large scale vector approximate nearest neighbor search scenario.
OpenPAI manages computing resources and is optimized for deep learning. Through docker technology, the computing hardware are decoupled with software, so that it's easy to run distributed jobs, switch with different deep learning frameworks, or run other kinds of jobs on consistent environments.
As OpenPAI is a platform, deploy a cluster is first step before using. A single server is also supported to deploy OpenPAI and manage its resource.
If the cluster is ready, learn from train models about how to use it.
Follow this part to check prerequisites, deploy and validate an OpenPAI cluster. More servers can be added as needed after initial deployed.
It's highly recommended to try OpenPAI on server(s), which has no usage and service. Refer to here for hardware specification.
- Ubuntu 16.04 (18.04 should work, but not fully tested.)
- Assign each server a static IP address, and make sure servers can communicate each other.
- Server can access internet, especially need to have access to the docker hub registry service or its mirror. Deployment process will pull Docker images of OpenPAI.
- SSH service is enabled and share the same username/password and have sudo privilege.
- NTP service is enabled.
- Recommend not to install docker or docker's version must be higher than 1.26.
- OpenPAI reserves memory and CPU for service running, so make sure there are enough resource to run machine learning jobs. Check hardware requirements for details.
- Dedicated servers for OpenPAI. OpenPAI manages all CPU, memory and GPU resources of servers. If there is any other workload, it may cause unknown problem due to insufficient resource.
The Deploy with default configuration part is minimum steps to deploy an OpenPAI cluster, and it's suitable for most small and middle size clusters within 50 servers. Base on the default configuration, the customized deployment can optimize the cluster for different hardware environments and use scenarios.
For a small or medium size cluster, which is less than 50 servers, it's recommended to deploy with default configuration. if there is only one powerful server, refer to deploy OpenPAI as a single box.
For a large size cluster, this section is still needed to generate default configuration, then customize the deployment.
As various hardware environments and different use scenarios, default configuration of OpenPAI may need to be optimized. Following Customize deployment part to learn more details.
After deployment, it's recommended to validate key components of OpenPAI in health status. After validation is success, submit a hello-world job and check if it works end-to-end.
The common practice on OpenPAI is to submit job requests, and wait jobs got computing resource and executed. It's different experience with assigning dedicated servers to each one. People may feel computing resource is not in control and the learning curve may be higher than run job on dedicated servers. But shared resource on OpenPAI can improve utilization of resources and save time on maintaining environments.
For administrators of OpenPAI, a successful deployment is first step, the second step is to let users of OpenPAI understand benefits and know how to use it. Users can learn from Train models. But below section of training models is for various scenarios and maybe users don't need all of them. So, administrators can create simplified documents as users' actual scenarios.
If there is any question during deployment, check here firstly.
If FAQ doesn't resolve it, refer to here to ask question or submit an issue.
As all computing platforms, OpenPAI is a productive tool and to maximize utilization of resources. So, it's recommended to submit training jobs and let OpenPAI to allocate resource and run jobs. If there are too many jobs, some jobs may be queued until enough resource available. This is different experience with running code on dedicated servers, so it needs a bit more knowledge about how to submit and manage jobs on OpenPAI.
Note, besides queuing jobs, OpenPAI also supports to allocate dedicated resources. Users can use SSH or Jupyter Notebook like on a physical server, refer to here for details. Though it's not efficient to use resources, but it also saves cost on setup and managing environments on physical servers.
Follow the job submission tutorial to learn more how to train models on OpenPAI. It's a good start to learn How to use OpenPAI.
OpenPAI VS Code Client is a friendly, GUI based client tool of OpenPAI, and it's highly recommended. It's an extension of Visual Studio Code. It can submit job, simulate jobs locally, manage multiple OpenPAI environments, and so on.
Web UI and job log are helpful to analyze job failure, and OpenPAI supports SSH for debugging.
Refer to here for more information about troubleshooting job failure.
- Client tool
- Use Storage
- Job configuration
- RESTful API
- Design documents could be found here if you are curious.
- Stack Overflow: If you have questions about OpenPAI, please submit question at Stack Overflow under tag: openpai
- Gitter chat: You can also ask questions in Microsoft/pai conversation.
- Create an issue or feature request: If you have issue/ bug/ new feature, please submit it to GitHub.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
We are working on a set of major features improvement and refactor, anyone who is familiar with the features is encouraged to join the design review and discussion in the corresponding issue ticket.
- OpenPAI virtual cluster design. Issue 1754
- OpenPAI protocol design. Issue 2007
- Folks who want to add support for other ML and DL frameworks
- Folks who want to make OpenPAI a richer AI platform (e.g. support for more ML pipelines, hyperparameter tuning)
- Folks who want to write tutorials/blog posts showing how to use OpenPAI to solve AI problems
One key purpose of PAI is to support the highly diversified requirements from academia and industry. PAI is completely open: it is under the MIT license. This makes PAI particularly attractive to evaluate various research ideas, which include but not limited to the components.
PAI operates in an open model. It is initially designed and developed by Microsoft Research (MSR) and Microsoft Search Technology Center (STC) platform team. We are glad to have Peking University, Xi'an Jiaotong University, Zhejiang University, and University of Science and Technology of China join us to develop the platform jointly. Contributions from academia and industry are all highly welcome.