Add distributed training examples of PyTorch #4821

vvfreesoul · 2020-08-18T08:58:17Z

First pr

ghost · 2020-08-18T08:58:31Z

All CLA requirements met.

coveralls · 2020-08-18T09:00:55Z

Coverage increased (+0.2%) to 34.801% when pulling 853d112 on vvfreesoul:master into 5eed779 on microsoft:master.

hzy46 · 2020-08-31T02:19:04Z

Please remove *.patch files in the PR.
https://github.com/microsoft/pai/blob/0fd1f19cc423e26b7ac7b0ec45f46808db4ae023/Add_distributed_training_examples_of_PyTorch.patch
https://github.com/microsoft/pai/blob/0fd1f19cc423e26b7ac7b0ec45f46808db4ae023/imagenet-nccl_for_test.patch

hzy46 · 2020-08-31T02:31:25Z

examples/Distributed-example/Lite-imagenet-singal-mul-DDP-nccl-gloo.py

@@ -0,0 +1,117 @@
+import os


the singal in file name should be single?

hzy46 · 2020-08-31T02:32:17Z

examples/Distributed-example/PytorchExample-imagenet-singal-mul-DDP-nccl-gloo.py

+            print("env rank:",int(os.environ['PAI_TASK_INDEX']) * ngpus_per_node + gpu)
+            # For multiprocessing distributed training, rank needs to be the
+            # global rank among all the processes
+            # rank = int(os.environ['PAI_TASK_INDEX']) * ngpus_per_node + gpu


Why commented out ?

I wrote the calculation of the rank directly into the parameters, I have now removed the comments, but I still have a code to print the rank.

hzy46 · 2020-09-09T02:39:55Z

docs/manual/cluster-user/how-to-run-distributed-job.md

+### Run PyTorch Distributed Jobs in OpenPAI
+Example Name | Multi-GPU | Multi-Node | Backend |Apex| Job protocol |
+---|---|---|---|---|---| 
+Single-Node DataParallel CIFAR-10 | ✓| x | -|-| [cifar10-single-node-gpus-cpu-DP.yaml](../../../examples/Distributed-example/cifar10-single-node-gpus-cpu-DP.yaml)|


This should be starts with "github.com/....."

hzy46 · 2020-09-09T02:40:49Z

docs/manual/cluster-user/how-to-run-distributed-job.md

@@ -0,0 +1,34 @@
+## How OpenPAI Deploy Distributed Jobs
+### Taskrole and Instance
+When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, each role may have one or more instances. For example, if it's 8 instances in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role. Please visit [here](how-to-use-advanced-job-settings.html#multiple-task-roles) for specific operations.


how-to-use-advanced-job-settings.html#multiple-task-roles -> how-to-use-advanced-job-settings.md#multiple-task-roles

hzy46 · 2020-09-09T02:41:57Z

docs/manual/cluster-user/how-to-run-distributed-job.md

+When we execute distributed programs on PAI, we can add different task roles for our job. For single server jobs, there is only one task role. For distributed jobs, there may be multiple task roles. For example, when TensorFlow is used to running distributed jobs, it has two roles, including the parameter server and the worker. In distributed jobs, each role may have one or more instances. For example, if it's 8 instances in a worker role of TensorFlow. It means there should be 8 Docker containers for the worker role. Please visit [here](how-to-use-advanced-job-settings.html#multiple-task-roles) for specific operations.
+
+### Environmental variables
+In a distributed job, one task might communicate with others (When we say task, we mean a single instance of a task role). So a task need to be aware of other tasks' runtime information such as IP, port, etc. The system exposes such runtime information as environment variables to each task's Docker container. For mutual communication, users can write code in the container to access those runtime environment variables. Please visit [here](how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation) for specific operations.


how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation -> how-to-use-advanced-job-settings.md#environmental-variables-and-port-reservation

hzy46 · 2020-09-09T02:42:27Z

docs/manual/cluster-user/how-to-run-distributed-job.md

+The single node program is simple. The program executed in PAI is exactly the same as the program in our machine. It should be noted that an Worker can be applied in PAI and a Instance can be applied in Worker. In a worker, we can apply for GPUs that we need. We provide an [example](../../../examples/Distributed-example/cifar10-single-node-gpus-cpu-DP.py) of DP.
+
+## DistributedDataParallel
+DDP requires users set a master node ip and port for synchronization in PyTorch. For the port, you can simply set one certain port, such as `5000` as your master port. However, this port may conflict with others. To prevent port conflict, you can reserve a port in OpenPAI, as we mentioned [here](how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation). The port you reserved is available in environmental variables like `PAI_PORT_LIST_$taskRole_$taskIndex_$portLabel`, where `$taskIndex` means the instance index of that task role. For example, if your task role name is `work` and port label is `SyncPort`, you can add the following code in your PyTorch DDP program:


how-to-use-advanced-job-settings.html#environmental-variables-and-port-reservation -> how-to-use-advanced-job-settings.md#environmental-variables-and-port-reservation

hzy46 · 2020-09-09T02:42:46Z

examples/Distributed-example/Lite-imagenet-single-mul-DDP-gloo.yaml

+      - 'mount 10.151.40.32:/mnt/zhiyuhe /mnt/data'
+      - >-
+        wget
+        https://raw.githubusercontent.com/vvfreesoul/pai/master/examples/Distributed-example/Lite-imagenet-single-mul-DDP-nccl-gloo.py


vvfreesoul -> master, and the same for others

hzy46 · 2020-09-09T02:43:16Z

imagenet-nccl_for_test.patch

@@ -0,0 +1,20 @@
+Index: examples/Distributed-example/Lite-imagenet-singal+mul-DDP-nccl+gloo.py


remove this

add also remove Add_distributed_training_examples_of_PyTorch.patch

imagenet-nccl for test

c612608

vvfreesoul added 3 commits August 21, 2020 09:00

imagenet-nccl for test

9b8ce66

imagenet-nccl for test

e6772d3

imagenet-nccl for test

b1f5b8c

hzy46 changed the title ~~imagenet-nccl for test~~ Add distributed training examples of PyTorch Aug 21, 2020

vvfreesoul added 9 commits August 24, 2020 04:18

imagenet-nccl for test

31a46c8

imagenet-nccl for test

9057564

imagenet-nccl for test

610a420

imagenet-nccl for test

da4b007

imagenet-nccl for test

6a5fc8c

imagenet-nccl for test

f51c5aa

imagenet-nccl for test

b4f03fe

imagenet-nccl for test

e18c9f8

imagenet-nccl for test

cf7c284

scarlett2018 mentioned this pull request Aug 24, 2020

2020 July ~ Aug Release #4642

Closed

39 tasks

vvfreesoul added 10 commits August 24, 2020 14:34

Add distributed training examples of PyTorch

3a84055

Add distributed training examples of PyTorch

4ad2f85

Add distributed training examples of PyTorch

43a11d2

Add distributed training examples of PyTorch

2e59d33

Add distributed training examples of PyTorch

ed0a7c6

Add distributed training examples of PyTorch

6ac0633

Add distributed training examples of PyTorch

562c448

Add distributed training examples of PyTorch

e4b5dd1

Add distributed training examples of PyTorch

ce8b3ce

Add distributed training examples of PyTorch

0fd1f19

hzy46 reviewed Aug 31, 2020

View reviewed changes

vvfreesoul added 9 commits September 4, 2020 14:48

Add distributed training examples of PyTorch

6373f3a

Add distributed training examples of PyTorch

429a6e9

Add distributed training examples of PyTorch

f8fa108

Add distributed training examples of PyTorch

659c48b

Merge remote-tracking branch 'origin/master'

0037ab4

Add distributed training examples of PyTorch

f957c60

Add distributed training examples of PyTorch

863eda6

Add distributed training examples of PyTorch

640c193

Add distributed training examples of PyTorch

42cda8e

scarlett2018 mentioned this pull request Sep 7, 2020

OpenPAI Backlog #4512

Open

5 tasks

hzy46 reviewed Sep 9, 2020

View reviewed changes

vvfreesoul added 11 commits September 9, 2020 11:14

Add distributed training examples of PyTorch

8c2c599

Add distributed training examples of PyTorch

eed7c7f

Add distributed training examples of PyTorch

adeb4c6

Add distributed training examples of PyTorch

1f675a1

Add distributed training examples of PyTorch

f0242c7

Add distributed training examples of PyTorch

a54c606

Add distributed training examples of PyTorch

f494dcf

Add distributed training examples of PyTorch

c46c462

Add distributed training examples of PyTorch

b18d0df

Add distributed training examples of PyTorch

f585648

Add distributed training examples of PyTorch

853d112

hzy46 approved these changes Sep 10, 2020

View reviewed changes

scarlett2018 mentioned this pull request Sep 10, 2020

2020 Sept ~ Oct release plan #4898

Closed

31 tasks

hzy46 merged commit d0d0fc8 into microsoft:master Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed training examples of PyTorch #4821

Add distributed training examples of PyTorch #4821

vvfreesoul commented Aug 18, 2020

ghost commented Aug 18, 2020 •

edited by ghost

Loading

coveralls commented Aug 18, 2020 •

edited

Loading

hzy46 commented Aug 31, 2020

hzy46 Aug 31, 2020

vvfreesoul Aug 31, 2020

hzy46 Aug 31, 2020

vvfreesoul Aug 31, 2020

hzy46 Sep 9, 2020

hzy46 Sep 9, 2020

hzy46 Sep 9, 2020

hzy46 Sep 9, 2020

hzy46 Sep 9, 2020 •

edited

Loading

hzy46 Sep 9, 2020

hzy46 Sep 9, 2020

		@@ -0,0 +1,20 @@
		Index: examples/Distributed-example/Lite-imagenet-singal+mul-DDP-nccl+gloo.py

Add distributed training examples of PyTorch #4821

Add distributed training examples of PyTorch #4821

Conversation

vvfreesoul commented Aug 18, 2020

ghost commented Aug 18, 2020 • edited by ghost Loading

coveralls commented Aug 18, 2020 • edited Loading

hzy46 commented Aug 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzy46 Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Aug 18, 2020 •

edited by ghost

Loading

coveralls commented Aug 18, 2020 •

edited

Loading

hzy46 Sep 9, 2020 •

edited

Loading