Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
[VS Code] view container (#2301)
Browse files Browse the repository at this point in the history
* add a dashboard in grafana to list all tasks in node (#2197)

* Fix format in issue templates (#2233)

Fix format in issue templates:
- remove trailing spaces
- change chinese colon into english

* Fix auto retries when out of memory. (#1108)

* Distinguish cgroup OOM from dmesg.

* Remove cgroup OOM detection

Make all OOM cause exiting by 5

* Exit 55 when OOM

* Refine homepage for new users (#2155)

Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform.

* Fix yarn container failed when docker container exited quickly. (#2256)

* REST server: remove expires in JWT payload of unit test (#2263)

* Deploy: add explicit config field in webportal  plugin (#2251)

* Deploy: add explicit config field in webportal  plugin

* Fix json.dumps

* t

* fix

* Update PLUGINS.md

* Update webportal.md

* alert on unhealthy gpu (#2209)

* Pylon: fix double start query in yarn redirect (#2258)

* Pylon: fix double start query in yarn redirect

* Hide debug info in docker-compose.yaml

* adapt user transfer script to new config (#2266)

* Webportal: add pai-version attribute to <pai-plugin> (#2245)

* Webportal: add pai-version attribute to <pai-plugin>

* Use preprocess to apply window.PAI_VERSION

* set version in layout.html

* Fix ib drivers bug (#2269)

* FIx ib installation script bug (#2271)

* [BUG] Fix hadoop ai build path (#2262)

* fix hadoop ai build bugs

* refine

* Web portal submit job: support init json from sessionStorage. (#2253)

* YARN and HDFS log persistence  (#2244)

* rm log persist

* change log dir to host

* persist nm log to host

* resolve conflict

* persist namenode log

* persist data node log

* add comments

* move log path to common pai storage

* use twisted in yarn-exporter (#2273)

* [Job Debugging] Basic Implement Of Job Debugging. (#2272)

* Refine document for new user to submit job (#2278)

1. add new guidance to submit job for beginners.
2. refine homepage to connect with new guidance.
3. reorganize content of troubleshooting for next refactoring.
4. fix links in md files.

* [Drivers] Fix the issue when installing IB drivers.  (#2275)

* fix can not report zombie process using gpu error (#2279)

* fix external process error

* add debug log

* fix short ID and long ID do not match

* use time based atomic ref to exchange info between threads

* add test case for AtomicRef

* fix bug in file remove (#2288)

* fix hadoop build error (#2296)

* export vc/node related metrics from yarn (#2289)

* 720

* open hdfs explorer in view container
enable tslint rule "ordered-imports"

* add tslint rule for indent

* add home button to hdfs explorer's navigation;
adjust octicon's color

* fix lint error

* [VS Code] Add job list (#2160)

* add job list view to pai extension

* [VS Code] joblist fix (#2185)

* eager load recent jobs when job submitted

* avoid eager getChildren, and let vscode treeview.reveal do it implicitly

* fix lint error
  • Loading branch information
sunqinzheng authored Mar 20, 2019
1 parent c758e19 commit 021a645
Show file tree
Hide file tree
Showing 146 changed files with 4,481 additions and 822 deletions.
10 changes: 5 additions & 5 deletions .github/ISSUE_TEMPLATE/bug-report.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,25 +8,25 @@ about: Report an issue or question while using/operating OpenPAI instance (deplo
<!-- Please use this template while reporting an issue and provide as much info as possible. Not doing so may result in your bug not being addressed in a timely manner. Thanks!-->


**Organization Name**:
**Organization Name**:

<!--This information is optional, but you are highly encourage to leave this reference info for us to know better about the context.!-->

**Short summary about the issue/question**:

**Brief what process you are following**:
**Brief what process you are following**:

<!--deployment related issues
Please fill this for deployment related issues:
Please fill this for deployment related issues:
- Operating type: Initial deployment / upgrading / operating etc.
- Brief what deployment process you are following -->

<!--User job related issues
GitHub is not the right place for support requests.
If you're looking for help, check [Stack Overflow](https://stackoverflow.com/questions/tagged/openpai) and the [troubleshooting guide](https://github.com/Microsoft/pai/blob/master/docs/job_log.md and https://github.com/Microsoft/pai/blob/master/docs/job_tutorial.md#debug).
If you're looking for help, check [Stack Overflow](https://stackoverflow.com/questions/tagged/openpai) and the [troubleshooting guide](https://github.com/Microsoft/pai/blob/master/docs/job_log.md and https://github.com/Microsoft/pai/blob/master/docs/job_tutorial.md#how-to-debug-a-job).
-->

**How to reproduce it**:
**How to reproduce it**:

<!--Fill the following information if your issue need diagnostic support from the team, as minimally and precisely as possible!-->

Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@ about: Suggest an enhancement to the OpenPAI project

**Why is this needed**:

**Without this feature, how does the current module work**
**Without this feature, how does the current module work**:

**Components that may involve changes**:
9 changes: 9 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,15 @@ matrix:
script:
- python3 -m unittest discover .

- language: python
python: 3.6
before_install:
- cd src/yarn-exporter/test
install:
- pip install prometheus_client twisted requests
script:
- python3 -m unittest discover .

- language: python
python: 2.7
install:
Expand Down
167 changes: 109 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,27 @@

OpenPAI is an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale.

# Table of Contents
## Table of Contents

1. [When to consider OpenPAI](#when-to-consider-openpai)
2. [Why choose OpenPAI](#why-choose-openpai)
3. [How to deploy](#how-to-deploy)
4. [How to use](#how-to-use)
5. [Resources](#resources)
6. [Get Involved](#get-involved)
7. [How to contribute](#how-to-contribute)
1. [Why choose OpenPAI](#why-choose-openpai)
1. [Get started](#get-started)
1. [Deploy OpenPAI](#deploy-openpai)
1. [Train models](#train-models)
1. [Administration](#administration)
1. [Reference](#reference)
1. [Get involved](#get-involved)
1. [How to contribute](#how-to-contribute)

## When to consider OpenPAI

1. When your organization needs to share powerful AI computing resources (GPU/FPGA farm, etc.) among teams.
2. When your organization needs to share and reuse common AI assets like Model, Data, Environment, etc.
3. When your organization needs an easy IT ops platform for AI.
4. When you want to run a complete training pipeline in one place.


## Why choose OpenPAI

The platform incorporates the mature design that has a proven track record in Microsoft's large-scale production environment.

### Support on-premises and easy to deploy
Expand All @@ -38,69 +42,112 @@ Pre-built docker for popular AI frameworks. Easy to include heterogeneous hardwa

### Most complete solution and easy to extend

OpenPAI is a most complete solution for deep learning, support virtual cluster, compatible Hadoop / kubernetes eco-system, complete training pipeline at one cluster etc. OpenPAI is architected in a modular way: different module can be plugged in as appropriate.
OpenPAI is a most complete solution for deep learning, support virtual cluster, compatible Hadoop / Kubernetes eco-system, complete training pipeline at one cluster etc. OpenPAI is architected in a modular way: different module can be plugged in as appropriate.

## Related Projects

Targeting at openness and advancing state-of-art technology, [Microsoft Research (MSR)](https://www.microsoft.com/en-us/research/group/systems-research-group-asia/) had also released few other open source projects.

* [NNI](https://github.com/Microsoft/nni) : An open source AutoML toolkit for neural architecture search and hyper-parameter tuning.
We encourage researchers and students leverage these projects to accelerate the AI development and research.
* [MMdnn](https://github.com/Microsoft/MMdnn) : A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models. The "MM" in MMdnn stands for model management and "dnn" is an acronym for deep neural network.

## How to deploy
#### 1 Prerequisites <a name="ref_prerequisites"></a>
Before start, you need to meet the following requirements:
## Get started

OpenPAI manages computing resources and optimizing for machine learning. Through docker technology, the computing hardware are decoupled with software, so that it's easy to run distributed computing, switch with different deep learning frameworks, or run jobs on consistent environments.

As OpenPAI is a platform, [deploy a cluster](#deploy-a-cluster) is first step before using. A single server is also supported to deploy OpenPAI and manage its resource.

If the cluster is ready, learn from [train models](#train-models) about how to use it.

## Deploy OpenPAI

Follow this part to check prerequisites, deploy and validate an OpenPAI cluster. More servers can be added as needed after initial deployed.

It's highly recommended to try OpenPAI on server(s), which has no usage and service. Refer to [here](https://github.com/Microsoft/pai/wiki/Resource-Requirement) for hardware specification.

### Prerequisites and preparation

* Ubuntu 16.04 (18.04 should work, but not fully tested.)
* Assign each server a static IP address, and make sure servers can communicate each other.
* Server can access internet, especially need to have access to the docker hub registry service or its mirror. Deployment process will pull Docker images of OpenPAI.
* SSH service is enabled and share the same username/password and have sudo privilege.
* NTP service is enabled.
* Recommend not to install docker or docker's version must be higher than 1.26.
* OpenPAI reserves memory and CPU for service running, so make sure there are enough resource to run machine learning jobs. Check [hardware requirements](https://github.com/Microsoft/pai/wiki/Resource-Requirement) for details.
* Dedicated servers for OpenPAI. OpenPAI manages all CPU, memory and GPU resources of servers. If there is any other workload, it may cause unknown problem due to insufficient resource.

### Deploy

The [Deploy with default configuration](#Deploy-with-default-configuration) part is minimum steps to deploy an OpenPAI cluster, and it's suitable for most small and middle size clusters within 50 servers. Base on the default configuration, the customized deployment can optimize the cluster for different hardware environments and use scenarios.

#### Deploy with default configuration

For a small or medium size cluster, which is less than 50 servers, it's recommended to [deploy with default configuration](docs/pai-management/doc/distributed-deploy.md). if there is only one powerful server, refer to [deploy OpenPAI as a single box](docs/pai-management/doc/single-box.md).

For a large size cluster, this section is still needed to generate default configuration, then [customize the deployment](#customize-deployment).

#### Customize deployment

- Ubuntu 16.04
- Assign each server a static IP address. Network is reachable between servers.
- Server can access the external network, especially need to have access to a Docker registry service (e.g., Docker hub) to pull the Docker images for the services to be deployed.
- All machines' SSH service is enabled, share the same username / password and have sudo privilege.
- Need to enable NTP service.
- Recommend no Docker installed or a Docker with api version >= 1.26.
- See [hardware resource requirements](https://github.com/Microsoft/pai/wiki/Resource-Requirement).
As various hardware environments and different use scenarios, default configuration of OpenPAI may need to be updated. Following [Customize deployment](docs/pai-management/doc/how-to-generate-cluster-config.md#Optional-Step-3.-Customize-configure-OpenPAI) part to learn more details.

#### 2 Deploy OpenPAI
### Validate deployment

If you have a cluster which contains more than 2 machine and want to deploy pai on it. Please choose ```Distributed deploy``` following.
After deployment, it's recommended to [validate key components of OpenPAI](docs/pai-management/doc/validate-deployment.md) in health status. After validation is success, [submit a hello-world job](docs/user/training.md) and check if it works end-to-end.

If you only have one mahince, and want to deploy pai on it. Please choose ```Single Box deploy``` following.
### Train users before "train models"

The common practice on OpenPAI is to submit job requests, and wait jobs got computing resource and executed. It's different experience with assigning dedicated servers to each one. People may feel computing resource is not in control and the learning curve may be higher than run job on dedicated servers. But shared resource on OpenPAI can improve productivity significantly and save time on maintaining environments.

##### 2.1 [Distributed deploy](./docs/pai-management/doc/distributed-deploy.md)
##### 2.2 [Single Box deploy](./docs/pai-management/doc/single-box.md)
For administrators of OpenPAI, a successful deployment is first step, the second step is to let users of OpenPAI understand benefits and know how to use it. Users of OpenPAI can learn from [Train models](#train-models). But below content is for various scenarios and may be too much to specific users. So, a simplified document based on below content is easier to learn.

### FAQ

If there is any question during deployment, check [here](docs/faq.md#deploy-and-maintenance-related-faqs) firstly.

## How to use
### How to train jobs
- How to write OpenPAI jobs
- [Quick start: how to write and submit a CIFAR-10 job](./examples/README.md#quickstart)
- [Write job from scratch in depth](./docs/job_tutorial.md)
- [Learn more example jobs](./examples/#offtheshelf)
- How to submit OpenPAI jobs
- [Submit a job in Web Portal](./docs/submit_from_webportal.md)
- [Submit a job in Visual Studio](https://github.com/Microsoft/vs-tools-for-ai/blob/master/docs/pai.md)
- [OpenPAI VS Code Extension](./contrib/pai_vscode/VSCodeExt.md)
- How to run AutoML jobs on OpenPAI
- [Submit a job in Neural Network Intelligence](https://github.com/Microsoft/nni/blob/master/docs/PAIMode.md)
- How to request on-demand resource for in place training
- [Launch a jupyter notebook and work in it](./examples/jupyter/README.md)
If FAQ doesn't resolve it, refer to [here](#get-involved) to ask question or submit an issue.

### Cluster administration
- [Deployment infrastructure](./docs/pai-management/doc/distributed-deploy.md)
- [Cluster maintenance](https://github.com/Microsoft/pai/wiki/Maintenance-(Service-&-Machine))
- [Monitoring](./docs/webportal/README.md)
## Train models

## Resources
Like all machine learning platforms, OpenPAI is a productive tool. To maximize utilization, it's recommended to submit training jobs and let OpenPAI to allocate resource and run it. If there are too many jobs, some jobs may be queued until enough resource available, and OpenPAI choose some server(s) to run a job. This is different with run code on dedicated servers, and it needs a bit more knowledge about how to submit/manage training jobs on OpenPAI.

- The OpenPAI user [documentation](./docs/documentation.md) provides in-depth instructions for using OpenPAI
- Visit the [release notes](https://github.com/Microsoft/pai/releases) to read about the new features, or download the release today.
- [FAQ](./docs/faq.md)
Note, OpenPAI also supports to allocate on demand resource besides queuing jobs. Users can use SSH or Jupyter to connect like on a physical server, refer to [here](examples/jupyter/README.md) about how to use OpenPAI like this way. Though it's not efficient to resources, but it also saves cost on setup and managing environments on physical servers.

### Submit training jobs

Follow [submitting a hello-world job](docs/user/training.md), and learn more about training models on OpenPAI. It's a very simple job and used to understand OpenPAI job definition and familiar with Web portal.

### OpenPAI VS Code Client

[OpenPAI VS Code Client](contrib/pai_vscode/VSCodeExt.md) is a friendly, GUI based client tool of OpenPAI. It's an extension of Visual Studio Code. It can submit job, simulate job running locally, manage multiple OpenPAI environments, and so on.

### Troubleshooting job failure

Web portal and job log are helpful to analyze job failure, and OpenPAI supports SSH into environment for debugging.

Refer to [here](docs/user/troubleshooting_job.md) for more information about troubleshooting job failure. It's recommended to get code succeeded locally, then submit to OpenPAI. It reduces posibility to troubleshoot remotely.

## Administration

* [Manage cluster with paictl](docs/paictl/paictl-manual.md)
* [Monitoring](./docs/webportal/README.md)

## Reference

* [Job definition](docs/job_tutorial.md)
* [RESTful API](docs/rest-server/API.md)
* Design documents could be found [here](docs).

## Get involved

* [Stack Overflow](./docs/stackoverflow.md): If you have questions about OpenPAI, please submit question at Stack Overflow under tag: openpai
* [Gitter chat](https://gitter.im/Microsoft/pai): You can also ask questions in microsoft/pai conversation.
* [Create an issue or feature request](https://github.com/Microsoft/pai/issues/new/choose): If you have issue/ bug/ new feature, please submit it to GitHub.

## Get Involved
- [StackOverflow:](./docs/stackoverflow.md) If you have questions about OpenPAI, please submit question at Stackoverflow under tag: openpai
- [Report an issue:](https://github.com/Microsoft/pai/wiki/Issue-tracking) If you have issue/ bug/ new feature, please submit it at Github
## How to contribute
#### Contributor License Agreement

### Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.microsoft.com.
Expand All @@ -113,17 +160,21 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

#### Call for contribution
### Call for contribution

We are working on a set of major features improvement and refactor, anyone who is familiar with the features is encouraged to join the design review and discussion in the corresponding issue ticket.
- PAI virtual cluster design. [Issue 1754](https://github.com/Microsoft/pai/issues/1754)
- PAI protocol design. [Issue 2007](https://github.com/Microsoft/pai/issues/2007)

#### Who should consider contributing to OpenPAI?
- Folks who want to add support for other ML and DL frameworks
- Folks who want to make OpenPAI a richer AI platform (e.g. support for more ML pipelines, hyperparameter tuning)
- Folks who want to write tutorials/blog posts showing how to use OpenPAI to solve AI problems
* PAI virtual cluster design. [Issue 1754](https://github.com/Microsoft/pai/issues/1754)
* PAI protocol design. [Issue 2007](https://github.com/Microsoft/pai/issues/2007)

### Who should consider contributing to OpenPAI

* Folks who want to add support for other ML and DL frameworks
* Folks who want to make OpenPAI a richer AI platform (e.g. support for more ML pipelines, hyperparameter tuning)
* Folks who want to write tutorials/blog posts showing how to use OpenPAI to solve AI problems

### Contributors

#### Contributors
One key purpose of PAI is to support the highly diversified requirements from academia and industry. PAI is completely open: it is under the MIT license. This makes PAI particularly attractive to evaluate various research ideas, which include but not limited to the [components](./docs/research_education.md).

PAI operates in an open model. It is initially designed and developed by [Microsoft Research (MSR)](https://www.microsoft.com/en-us/research/group/systems-research-group-asia/) and [Microsoft Search Technology Center (STC)](https://www.microsoft.com/en-us/ard/company/introduction.aspx) platform team.
Expand Down
1 change: 1 addition & 0 deletions contrib/pai_vscode/VSCodeExt.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,5 @@ To install the OpenPAI Client:
![Extension](./assets/ext-install-1.png)

## Next steps

Learn how to [use OpenPAI VS Code Client](./README.md)
20 changes: 18 additions & 2 deletions contrib/pai_vscode/i18n/common.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,18 @@
"treeview.node.edit": "Edit Configuration...",
"treeview.node.openhdfs": "Open HDFS...",
"treeview.node.openPortal": "Open Web Portal...",
"treeview.node.listjob": "List Jobs...",
"treeview.node.listjob": "List Jobs Externally...",
"treeview.node.create-config": "Create Job Config...",
"treeview.node.submitjob": "Submit Job...",
"treeview.node.simulate": "Simulate Job Running...",
"treeview.hdfs.select-cluster.label": "Double click to connect to a PAI cluster's HDFS...",
"treeview.joblist.recent": "Recent Submitted Jobs from VS Code",
"treeview.joblist.all": "All Jobs",
"treeview.joblist.view": "View Job Detail",
"treeview.joblist.more": "View More...",
"treeview.joblist.error": "Failed to load job list: {0}",
"container.hdfs.mkdir.prompt": "Please enter a folder name",
"container.hdfs.mkdir.cancelled": "Cancelled creating new folder",
"hdfs.workspace.title": "HDFS Explorer - {0}",
"hdfs.progress": "Transfering file - {0}% ({1} bytes / {2} bytes)",
"hdfs.downloading": "Downloading {0}",
Expand Down Expand Up @@ -93,10 +101,18 @@
"treeview.node.edit": "编辑配置...",
"treeview.node.openhdfs": "打开 HDFS...",
"treeview.node.openPortal": "打开 OpenPAI 门户...",
"treeview.node.listjob": "打开任务列表...",
"treeview.node.listjob": "在浏览器里打开任务列表...",
"treeview.node.create-config": "创建任务配置文件...",
"treeview.node.submitjob": "提交任务...",
"treeview.node.simulate": "模拟任务执行...",
"treeview.hdfs.select-cluster.label": "双击以连接到 PAI 集群的 HDFS...",
"treeview.joblist.recent": "近期从 VS Code 提交的任务",
"treeview.joblist.all": "所有任务",
"treeview.joblist.view": "查看任务详情",
"treeview.joblist.more": "显示更多...",
"treeview.joblist.error": "载入任务列表时发生错误:{0}",
"container.hdfs.mkdir.prompt": "请输入文件夹名",
"container.hdfs.mkdir.cancelled": "新建文件夹操作已取消",
"hdfs.workspace.title": "HDFS 浏览器 - {0}",
"hdfs.progress": "正在传输 - {0}% ({1} 字节 / {2} 字节)",
"hdfs.downloading": "正在下载 {0}",
Expand Down
Loading

0 comments on commit 021a645

Please sign in to comment.