-
Notifications
You must be signed in to change notification settings - Fork 549
Add distributed training examples of PyTorch #4821
Conversation
@@ -0,0 +1,117 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the singal
in file name should be single
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
print("env rank:",int(os.environ['PAI_TASK_INDEX']) * ngpus_per_node + gpu) | ||
# For multiprocessing distributed training, rank needs to be the | ||
# global rank among all the processes | ||
# rank = int(os.environ['PAI_TASK_INDEX']) * ngpus_per_node + gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why commented out ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the calculation of the rank directly into the parameters, I have now removed the comments, but I still have a code to print the rank.
### Run PyTorch Distributed Jobs in OpenPAI | ||
Example Name | Multi-GPU | Multi-Node | Backend |Apex| Job protocol | | ||
---|---|---|---|---|---| | ||
Single-Node DataParallel CIFAR-10 | ✓| x | -|-| [cifar10-single-node-gpus-cpu-DP.yaml](../../../examples/Distributed-example/cifar10-single-node-gpus-cpu-DP.yaml)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be starts with "github.com/....."
@@ -0,0 +1,34 @@ | |||
## How OpenPAI Deploy Distributed Jobs | |||
### Taskrole and Instance | |||
When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, each role may have one or more instances. For example, if it's 8 instances in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role. Please visit [here](how-to-use-advanced-job-settings.html#multiple-task-roles) for specific operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how-to-use-advanced-job-settings.html#multiple-task-roles
-> how-to-use-advanced-job-settings.md#multiple-task-roles
When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, each role may have one or more instances. For example, if it's 8 instances in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role. Please visit [here](how-to-use-advanced-job-settings.html#multiple-task-roles) for specific operations. | ||
|
||
### Environmental variables | ||
In a distributed job, one task might communicate with others (When we say task, we mean a single instance of a task role). So a task need to be aware of other tasks' runtime information such as IP, port, etc. The system exposes such runtime information as environment variables to each task's Docker container. For mutual communication, users can write code in the container to access those runtime environment variables. Please visit [here](how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation) for specific operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation
-> how-to-use-advanced-job-settings.md#environmental-variables-and-port-reservation
The single node program is simple. The program executed in PAI is exactly the same as the program in our machine. It should be noted that an Worker can be applied in PAI and a Instance can be applied in Worker. In a worker, we can apply for GPUs that we need. We provide an [example](../../../examples/Distributed-example/cifar10-single-node-gpus-cpu-DP.py) of DP. | ||
|
||
## DistributedDataParallel | ||
DDP requires users set a master node ip and port for synchronization in PyTorch. For the port, you can simply set one certain port, such as `5000` as your master port. However, this port may conflict with others. To prevent port conflict, you can reserve a port in OpenPAI, as we mentioned [here](how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation). The port you reserved is available in environmental variables like `PAI_PORT_LIST_$taskRole_$taskIndex_$portLabel`, where `$taskIndex` means the instance index of that task role. For example, if your task role name is `work` and port label is `SyncPort`, you can add the following code in your PyTorch DDP program: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation
-> how-to-use-advanced-job-settings.md#environmental-variables-and-port-reservation
- 'mount 10.151.40.32:/mnt/zhiyuhe /mnt/data' | ||
- >- | ||
wget | ||
https://raw.githubusercontent.com/vvfreesoul/pai/master/examples/Distributed-example/Lite-imagenet-single-mul-DDP-nccl-gloo.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vvfreesoul
-> master
, and the same for others
imagenet-nccl_for_test.patch
Outdated
@@ -0,0 +1,20 @@ | |||
Index: examples/Distributed-example/Lite-imagenet-singal+mul-DDP-nccl+gloo.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add also remove Add_distributed_training_examples_of_PyTorch.patch
First pr