Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Add distributed training examples of PyTorch #4821

Merged
merged 56 commits into from
Sep 10, 2020
Merged

Add distributed training examples of PyTorch #4821

merged 56 commits into from
Sep 10, 2020

Conversation

vvfreesoul
Copy link
Contributor

First pr

@ghost
Copy link

ghost commented Aug 18, 2020

CLA assistant check
All CLA requirements met.

@coveralls
Copy link

coveralls commented Aug 18, 2020

Coverage Status

Coverage increased (+0.2%) to 34.801% when pulling 853d112 on vvfreesoul:master into 5eed779 on microsoft:master.

@hzy46 hzy46 changed the title imagenet-nccl for test Add distributed training examples of PyTorch Aug 21, 2020
@scarlett2018 scarlett2018 mentioned this pull request Aug 24, 2020
39 tasks
@@ -0,0 +1,117 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the singal in file name should be single?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

print("env rank:",int(os.environ['PAI_TASK_INDEX']) * ngpus_per_node + gpu)
# For multiprocessing distributed training, rank needs to be the
# global rank among all the processes
# rank = int(os.environ['PAI_TASK_INDEX']) * ngpus_per_node + gpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why commented out ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the calculation of the rank directly into the parameters, I have now removed the comments, but I still have a code to print the rank.

@scarlett2018 scarlett2018 mentioned this pull request Sep 7, 2020
5 tasks
### Run PyTorch Distributed Jobs in OpenPAI
Example Name | Multi-GPU | Multi-Node | Backend |Apex| Job protocol |
---|---|---|---|---|---|
Single-Node DataParallel CIFAR-10 | ✓| x | -|-| [cifar10-single-node-gpus-cpu-DP.yaml](../../../examples/Distributed-example/cifar10-single-node-gpus-cpu-DP.yaml)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be starts with "github.com/....."

@@ -0,0 +1,34 @@
## How OpenPAI Deploy Distributed Jobs
### Taskrole and Instance
When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, each role may have one or more instances. For example, if it's 8 instances in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role. Please visit [here](how-to-use-advanced-job-settings.html#multiple-task-roles) for specific operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how-to-use-advanced-job-settings.html#multiple-task-roles -> how-to-use-advanced-job-settings.md#multiple-task-roles

When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, each role may have one or more instances. For example, if it's 8 instances in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role. Please visit [here](how-to-use-advanced-job-settings.html#multiple-task-roles) for specific operations.

### Environmental variables
In a distributed job, one task might communicate with others (When we say task, we mean a single instance of a task role). So a task need to be aware of other tasks' runtime information such as IP, port, etc. The system exposes such runtime information as environment variables to each task's Docker container. For mutual communication, users can write code in the container to access those runtime environment variables. Please visit [here](how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation) for specific operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation -> how-to-use-advanced-job-settings.md#environmental-variables-and-port-reservation

The single node program is simple. The program executed in PAI is exactly the same as the program in our machine. It should be noted that an Worker can be applied in PAI and a Instance can be applied in Worker. In a worker, we can apply for GPUs that we need. We provide an [example](../../../examples/Distributed-example/cifar10-single-node-gpus-cpu-DP.py) of DP.

## DistributedDataParallel
DDP requires users set a master node ip and port for synchronization in PyTorch. For the port, you can simply set one certain port, such as `5000` as your master port. However, this port may conflict with others. To prevent port conflict, you can reserve a port in OpenPAI, as we mentioned [here](how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation). The port you reserved is available in environmental variables like `PAI_PORT_LIST_$taskRole_$taskIndex_$portLabel`, where `$taskIndex` means the instance index of that task role. For example, if your task role name is `work` and port label is `SyncPort`, you can add the following code in your PyTorch DDP program:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation -> how-to-use-advanced-job-settings.md#environmental-variables-and-port-reservation

- 'mount 10.151.40.32:/mnt/zhiyuhe /mnt/data'
- >-
wget
https://raw.githubusercontent.com/vvfreesoul/pai/master/examples/Distributed-example/Lite-imagenet-single-mul-DDP-nccl-gloo.py
Copy link
Contributor

@hzy46 hzy46 Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vvfreesoul -> master, and the same for others

@@ -0,0 +1,20 @@
Index: examples/Distributed-example/Lite-imagenet-singal+mul-DDP-nccl+gloo.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add also remove Add_distributed_training_examples_of_PyTorch.patch

@scarlett2018 scarlett2018 mentioned this pull request Sep 10, 2020
31 tasks
@hzy46 hzy46 merged commit d0d0fc8 into microsoft:master Sep 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants