Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

How does caffeonspark exchange and synchronize the each executor's parameters? #262

Open
guyang88 opened this issue Jun 6, 2017 · 11 comments

Comments

@guyang88
Copy link

guyang88 commented Jun 6, 2017

@anfeng @junshi15 How does caffeonspark exchange and synchronize the each executor's parameters?

@junshi15
Copy link
Collaborator

junshi15 commented Jun 7, 2017

Assuming multiple-gpu per node and multiple nodes, there are two levels of exchange:
Inside a node, each gpu computes its gradients based on its batch and send them to a root gpu. Th root gpu average them.
Across the nodes, the root gpus send averaged gradients to a master node's root gpu, which average them, and update the weights. The weights are broadcast to each node's root gpu, then the root gpu broadcasts the weight inside the nodes.

All those need to be done synchronously. No gpu is allowed to run next batch unless everybody gets the updated weights.

@guyang88
Copy link
Author

guyang88 commented Jun 8, 2017

@junshi15 thanks,but I have a question.why not use parameter server?which asynchronous technique can make training faster?

@junshi15
Copy link
Collaborator

junshi15 commented Jun 9, 2017

sync version is simple to implement and verify. We do not have need for async training at this moment. In addition, we are limited by our resource. Your contribution is welcome.

@jacklonghui
Copy link

@junshi15 @guyang88 Excuse me, I've been paying attention to this problem recently.In the source code, caffe-distri/src/main/cpp/util/socket_sync_cpu.cpp, and, rdma_sync.cpp.It seems to pass data from the parameter server, slicing rather than full weights or gradients. Is that so?I'm a little confused now. Can you help me? Thank you!

@junshi15
Copy link
Collaborator

@jacklonghui Regarding slicing, it is an efficient implementation of all-reduce. If all the clients send its gradients to one node, then that node will be a bottleneck. What's implemented in CaffeOnSpark is a ring algorithm, where each node sends and receives portion of the entire gradients.

@jacklonghui
Copy link

@junshi15 ok,thank you! I got.

@jacklonghui
Copy link

@junshi15
Hi,as you said above, I have several questions as follows:
(1)Master node is a single node, which is mainly responsible for inter cluster scheduling, and does not do iterative training like work nodes?
(2)In code https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-distri/src/main/cpp/util/socket_sync.cpp,
each node sends and receives a portion of the entire gradient, is the weight the same?
(3)Besides, I'm still a little confused. Here, it seems that broadcast gradients and weights are broadcast in the manner of parallel transmission and reception by each node,
rather than broadcast gradient and weight by the master node. Is that so?

@junshi15
Copy link
Collaborator

  1. yes, it does training as well.
  2. everybody's gradient is different, since the gradients are calculated based on individual's mini batch. then the gradients are aggregated and applied to weights, at the end of an iteration, everybody has the same weights.
  3. In this implementation, everybody is a master (of a portion of gradients/weights), and everybody is a slave (of the remaining portion of the gradients/weights).

@jacklonghui
Copy link

@junshi15 ok, thank you! in this lines.
"...The root gpus send averaged gradients to a master node's root gpu, which average them,
and update the weights. The weights are broadcast to each node's root gpu, then the root gpu broadcasts the weight inside the nodes..."

Does the "master node" here exist for everybody? If not, then there is a "master node" that collects and processes the gradients that everybody sends,
and broadcasts the weights to everybody. Where does "master node" exist?

@junshi15
Copy link
Collaborator

The line you quote is conceptional true. What's implemented here is different.
In this particular implementation, everybody is a master and a worker. So you can regard every node as a master node and a worker node. On every node, there are MasterBuffer and WorkerBuffer. When gradients are ready, this function is called. When an iteration starts, this function is called. Please examine the code for details.

@jacklonghui
Copy link

@junshi15 ok, thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants