-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mpi enabled #9490
Mpi enabled #9490
Conversation
@Yancey1989 @typhoonzero I write a design doc about Mpi enabled, thanks for the review. |
Please fix the style check first |
@@ -0,0 +1,33 @@ | |||
#MPI-enabled PaddlePaddle Design doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a space after "#"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
#MPI-enabled PaddlePaddle Design doc | ||
|
||
# Background | ||
Now, PaddlePaddle Fluid with Distribution has relatively large network bottleneck, We want to use RDMA and GPUDriect to improve and solve it, so we enabled the features to PaddlePaddle with the help of MPI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we do distribute multi GPU training, the communication overhead between servers become the major bottleneck, because of the following reasons:
- Must copy at least once from GPU to CPU memory so that the data can be ready to transfer. And for pserver side, copy data from CPU to GPU introduce more overhead.
- GPU->CPU data transfer is 10 times slower than data transfer between GPUs or between PCIe devices.
- TCP connections can not make full use of RDMA 100Gb devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# Background | ||
Now, PaddlePaddle Fluid with Distribution has relatively large network bottleneck, We want to use RDMA and GPUDriect to improve and solve it, so we enabled the features to PaddlePaddle with the help of MPI. | ||
|
||
We will introduce Open MPI API to PaddlePaddle, which can bring two benefits to PaddlePaddle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open MPI => OpenMPI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
We will introduce Open MPI API to PaddlePaddle, which can bring two benefits to PaddlePaddle: | ||
1. Enable RDMA with PaddlePaddle, which bring high-performance low latency networks. | ||
2. Enable GPUDriect with PaddlePaddle, which bring the highest throughput and lowest latency GPU read and write. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need details of the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
### mpi_module | ||
We will build a new module to package MPI send and receive process. MPI send and recvice are defferent to gRPC, the MPI [recvice](https://www.open-mpi.org/doc/v1.8/man3/MPI_Irecv.3.php) must know receive buffer size and receive buffer element. For this reason, We have to make conmunications twice, the first one is to send metadata about gradient through gRPC, the second one is the real conmunications through MPI which send gradient data to mpi_listenandserve_op. | ||
The detail flow is below: | ||
![](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/dist_train/src/mpi_module.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
picture is not shown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Launch the script using the ```mpirun``` launcher, For example: ```mpirun -np 3 -hosts node1,node2,node3 python train.py```. By doing this, We can number the actors (trainer/pserver/master) with o .. (n-1). The node's number is the Rank of the calling process in a group of comm (integer), The MPI processes identify each other using a Rank ID. We have to create a mapping between PaddlePaddle's actors and their Rank ID so that we can communicate with the correct destinations when using MPI operations. | ||
**We have to store the Rank ID and the mapping in global variables.** | ||
|
||
## New OP/MODULE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just list things need to be changed like:
- mpi_send_op
- mpi_serv_op
- modify transpiler to support using MPI or not
- compile arg to enable MPI support
- ...
then discuss the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,29 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate the implement and design to 2 PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have deleted the implement now and will rebuild it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
design doc about #9405