Multi Graphics Card #6637

tonyyang-svail · 2017-12-14T15:40:24Z

This issue demonstrates the difficulties in implementing multi GPU training.

Background

Parallelism: Data Parallel

Comunication pattern: Ring Based Allreduce. http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Python Example

data = layer.data()
places = layer.get_places(all_gpu=True)

data_array = split_data(data, places)
label_array = split_data(label, places)

with parallel_for(places) as p_for:
  h1 = layer.fc(input=read_from_array(data_array, p_for.i)) # h1 = w1 * data
  h2 = layer.fc(h1) # h2 = w2 * h1
  loss = layer.softmax(h2, read_from_array(label, p_for.i))

append_backward(loss)

with parallel_for(places) as p_for:
  append_optimization(loss, Adam())

exe = Executor(CPUPlace)
exe.run(fluid.default_startup_program())
avg_loss_value, = exe.run(fluid.default_main_program()). # TBD: how to aggregate loss

ParallelDoOp

/* ParallelDoOp
 * Input:
 *    places    vector<Place>
 *      Input   Variable
 * Output:
 *    par_scopes  vector<Scope*>
 * Attr:
 *    block   BlockDescBind
 */
class ParallelDoOp : public OperatorBase {
  ...
  void Run(const framework::Scope &scope,
           const platform::DeviceContext &dev_ctx) const override {
    vector<thread> threads;
    auto& block = attr("block");
    auto& par_scopes = input("par_scopes");
    for (auto& place : input("places")) {
      threads.push_back(thread(
        [&] {
          auto p_scope = scope->NewScope();
          auto par_scopes.push_back(p_scope);
          auto exe = Executor(place);
          exe.run(p_scope, block->program, block->id);
        }
      ));
    }
    
    join_all_threads();
  } 
}

/* ParallelDoGradOp
 * Input:
 *    places    vector<Place>
 *      Input   Variable
 *    par_scopes  vector<Scope*>
 * Output:
 *      Input_Grad  Variable
 * Attr:
 *    block   BlockDescBind   Note this is the backward block
 */
class ParallelDoGradOp : public OperatorBase {
  ...
  void Run(const framework::Scope &scope,
           const platform::DeviceContext &dev_ctx) const override {
    vector<thread> threads;
    auto& block = attr("block");
    for (auto& place, p_scope : input("places"), input("par_scopes")) {
      threads.push_back(thread(
        [&] {
          auto exe = Executor(place);
          exe.run(p_scope, block->program, block->id);
        }
      ));
    }
    
    join_all_threads();
  }
}

ProgramDesc

# start_program will be run by executor(CPUPlace), all w1, w2 will be allocated on CPU
start_program
{
  vars: w1, w2
  ops: init(w1), init(w2)
}

main_program
{
block0 {
  vars: data, places, w1, w2
  ops: data, get_place, parallel_do(block1),
       parallel_do_grad(block2),      # append_backward
       parallel_do(block3)            # append_optimization
       
}
block1 {
  vars: data, h1, h2, loss            # TBD: need add w1, w2 here?
  ops: fc, fc, softmax
}
block2 {
  vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad
  ops: softmax_grad,
       fc_grad, allreduce(places, scopes, w1_grad),  # TBD: who add allreduce?
       fc_grad, allreduce(places, scopes, w2_grad)
}
block3 {
  vars: lr                    # TBD: need add w1, w2 here?
  ops: sgd(w2, w2_grad),
       sgd(w1, w1_grad)
}
}

Problems

At the first iteration, who will copy the initialized parameters (note some parameter doesn't need to copy) to different GPUs? In later iterations, how to avoid this copy.
- Answer: we copy on every iteration. However, we allow parameter sharing if the place is same.
Who will add allreduce? Backward will support this?
- Answer: parallel_do will manually accumulate the gradients across all places.
Who will add parallel_do(block3)? Answer:
- Answer: parallel_do outputs the gradient to the host place. all the param += grad only happens on the host place.
How to save model? Answer:
- Answer: all the parameters will be at the host place.
How does optimization access forward/backward scope?
- Answer: all the updates will appear at w and w_grad on the host place.
How to aggregate target?
- Answer: parallel_do will aggregate its output.

The text was updated successfully, but these errors were encountered:

helinwang · 2017-12-14T21:23:11Z

Each parallel_do or parallel_do_grad is a blocking call, how can we exploit the concurrency between a sequence of parallel_dos?

Response by @tonyyang-svail: a sequence of parallel_do should not be paralleled.

parallel_do1()
parallel_do2() # this should wait for parallel_do1

helinwang · 2017-12-14T21:27:30Z

There are N loss created (N = number of places), so are we doing back propagation N times? If so, why should the backward pass #0 wait for all forward pass (it only needs the #0 forward pass to complete, but parallel_do is a blocking call, parallel_do_grad can't happen before parallel_do is finished). If not, we need the code to merge the loss.

Response by @tonyyang-svail:

There are N loss created (N = number of places), so are we doing back propagation N times?

Yes

If so, why should the backward pass #0 wait for all forward pass (it only needs the #0 forward pass to complete, but parallel_do is a blocking call, parallel_do_grad can't happen before parallel_do is finished).

We have to make parallel_do blocking because we don't know the merged output of parallel_do is used by other op. Consider the following case:

If not, we need the code to merge the loss.

parallel_do will aggregate its output.

helinwang · 2017-12-14T21:34:50Z

I think the Python API for parallel.do is hard to use and fragile. We can still expose the parallel.do Python API to the user, just as C allow inline assembly. But in my opinion it should be very low priority, we need to focus more on the transpiler, so the user don't have to worry about it.

jacquesqiao · 2017-12-15T02:03:08Z

There a N loss created (N = number of places), so are we doing back propagation N times? If so, why should the backward pass #0 wait for all forward pass (it only needs the #0 forward pass to complete, but parallel_do is a blocking call, parallel_do_grad can't happen before parallel_do is finished). If not, we need the code to merge the loss.

@helinwang In some implementation, parallel_do_grad may not be fully parallel, because backward can merge gradient and update parameter when it gets one and then calculate other gradients, this requires a sync with each card. But we can implement is as parallel, first run all grad block parallelly, and then merge gradient and update parameter.

Yancey1989 · 2017-12-15T02:40:05Z

Who will copy initialized parameters to different GPUs?

I think we need an Op to do the Broadcast and specify a parameter to a root device number with some method round-robin/hash/...,

typhoonzero · 2017-12-15T02:41:18Z

I'm not sure this picture will tell the general multi-gpu SGD procedure.

wanghaoshuang · 2017-12-15T03:00:42Z

Who will copy initialized parameters to different GPUs?

The initialized parameters consistency among different GPUs can be maintained by random initialization with an identical seed?

helinwang · 2017-12-15T21:32:48Z

On CPU we don't need to run block3 parallel, because there is only one memory, we don't need to update it multiple times:

block3 {
  vars: lr                    # TBD: need add w1, w2 here?
  ops: sgd(w2, w2_grad),
       sgd(w1, w1_grad)
}

So this ProgramDesc is actually targeted for multiple-GPU only, and we can't send the same compiled ProgramDesc to run on both multple-GPU cluster and no-GPU-multiple-core cluster.

We really need to consider where the ProgramDesc is running during the compiling phase, like the just-in-time compiler.

tonyyang-svail added the MultiDevices label Dec 14, 2017

tonyyang-svail added this to the Release 0.11.1 milestone Dec 14, 2017

tonyyang-svail assigned reyoung, Yancey1989, wangkuiyi, dzhwinter, typhoonzero, QiJune and tonyyang-svail Dec 14, 2017

helinwang self-assigned this Dec 14, 2017

jacquesqiao self-assigned this Dec 15, 2017

tonyyang-svail mentioned this issue Dec 19, 2017

feature/parallel_do #6730

Merged

reyoung closed this as completed in #6730 Jan 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi Graphics Card #6637

Multi Graphics Card #6637

tonyyang-svail commented Dec 14, 2017 •

edited

Loading

helinwang commented Dec 14, 2017 •

edited by tonyyang-svail

Loading

helinwang commented Dec 14, 2017 •

edited by tonyyang-svail

Loading

helinwang commented Dec 14, 2017

jacquesqiao commented Dec 15, 2017

Yancey1989 commented Dec 15, 2017

typhoonzero commented Dec 15, 2017 •

edited

Loading

wanghaoshuang commented Dec 15, 2017

helinwang commented Dec 15, 2017 •

edited

Loading

Multi Graphics Card #6637

Multi Graphics Card #6637

Comments

tonyyang-svail commented Dec 14, 2017 • edited Loading

Background

Python Example

ParallelDoOp

ProgramDesc

Problems

helinwang commented Dec 14, 2017 • edited by tonyyang-svail Loading

helinwang commented Dec 14, 2017 • edited by tonyyang-svail Loading

helinwang commented Dec 14, 2017

jacquesqiao commented Dec 15, 2017

Yancey1989 commented Dec 15, 2017

typhoonzero commented Dec 15, 2017 • edited Loading

wanghaoshuang commented Dec 15, 2017

helinwang commented Dec 15, 2017 • edited Loading

tonyyang-svail commented Dec 14, 2017 •

edited

Loading

helinwang commented Dec 14, 2017 •

edited by tonyyang-svail

Loading

helinwang commented Dec 14, 2017 •

edited by tonyyang-svail

Loading

typhoonzero commented Dec 15, 2017 •

edited

Loading

helinwang commented Dec 15, 2017 •

edited

Loading