-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi Graphics Card #6637
Comments
Each Response by @tonyyang-svail: a sequence of
|
There are N Response by @tonyyang-svail:
Yes
We have to make
|
I think the Python API for parallel.do is hard to use and fragile. We can still expose the parallel.do Python API to the user, just as C allow inline assembly. But in my opinion it should be very low priority, we need to focus more on the transpiler, so the user don't have to worry about it. |
@helinwang In some implementation, parallel_do_grad may not be fully parallel, because backward can merge gradient and update parameter when it gets one and then calculate other gradients, this requires a sync with each card. But we can implement is as parallel, first run all grad block parallelly, and then merge gradient and update parameter. |
I think we need an Op to do the |
The initialized parameters consistency among different GPUs can be maintained by random initialization with an identical seed? |
On CPU we don't need to run
So this ProgramDesc is actually targeted for multiple-GPU only, and we can't send the same compiled ProgramDesc to run on both multple-GPU cluster and no-GPU-multiple-core cluster. We really need to consider where the ProgramDesc is running during the compiling phase, like the just-in-time compiler. |
This issue demonstrates the difficulties in implementing multi GPU training.
Background
Parallelism: Data Parallel
Comunication pattern: Ring Based Allreduce. http://research.baidu.com/bringing-hpc-techniques-deep-learning/
Python Example
ParallelDoOp
ProgramDesc
Problems
allreduce
? Backward will support this?parallel_do(block3)
? Answer:param += grad
only happens on the host place.target
?parallel_do
will aggregate its output.The text was updated successfully, but these errors were encountered: