Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Graphics Card #6637

Closed
tonyyang-svail opened this issue Dec 14, 2017 · 8 comments · Fixed by #6730
Closed

Multi Graphics Card #6637

tonyyang-svail opened this issue Dec 14, 2017 · 8 comments · Fixed by #6730

Comments

@tonyyang-svail
Copy link

tonyyang-svail commented Dec 14, 2017

This issue demonstrates the difficulties in implementing multi GPU training.

Background

Parallelism: Data Parallel

Comunication pattern: Ring Based Allreduce. http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Python Example

data = layer.data()
places = layer.get_places(all_gpu=True)

data_array = split_data(data, places)
label_array = split_data(label, places)

with parallel_for(places) as p_for:
  h1 = layer.fc(input=read_from_array(data_array, p_for.i)) # h1 = w1 * data
  h2 = layer.fc(h1) # h2 = w2 * h1
  loss = layer.softmax(h2, read_from_array(label, p_for.i))

append_backward(loss)

with parallel_for(places) as p_for:
  append_optimization(loss, Adam())

exe = Executor(CPUPlace)
exe.run(fluid.default_startup_program())
avg_loss_value, = exe.run(fluid.default_main_program()). # TBD: how to aggregate loss

ParallelDoOp

/* ParallelDoOp
 * Input:
 *    places    vector<Place>
 *      Input   Variable
 * Output:
 *    par_scopes  vector<Scope*>
 * Attr:
 *    block   BlockDescBind
 */
class ParallelDoOp : public OperatorBase {
  ...
  void Run(const framework::Scope &scope,
           const platform::DeviceContext &dev_ctx) const override {
    vector<thread> threads;
    auto& block = attr("block");
    auto& par_scopes = input("par_scopes");
    for (auto& place : input("places")) {
      threads.push_back(thread(
        [&] {
          auto p_scope = scope->NewScope();
          auto par_scopes.push_back(p_scope);
          auto exe = Executor(place);
          exe.run(p_scope, block->program, block->id);
        }
      ));
    }
    
    join_all_threads();
  } 
}

/* ParallelDoGradOp
 * Input:
 *    places    vector<Place>
 *      Input   Variable
 *    par_scopes  vector<Scope*>
 * Output:
 *      Input_Grad  Variable
 * Attr:
 *    block   BlockDescBind   Note this is the backward block
 */
class ParallelDoGradOp : public OperatorBase {
  ...
  void Run(const framework::Scope &scope,
           const platform::DeviceContext &dev_ctx) const override {
    vector<thread> threads;
    auto& block = attr("block");
    for (auto& place, p_scope : input("places"), input("par_scopes")) {
      threads.push_back(thread(
        [&] {
          auto exe = Executor(place);
          exe.run(p_scope, block->program, block->id);
        }
      ));
    }
    
    join_all_threads();
  }
}

ProgramDesc

# start_program will be run by executor(CPUPlace), all w1, w2 will be allocated on CPU
start_program
{
  vars: w1, w2
  ops: init(w1), init(w2)
}

main_program
{
block0 {
  vars: data, places, w1, w2
  ops: data, get_place, parallel_do(block1),
       parallel_do_grad(block2),      # append_backward
       parallel_do(block3)            # append_optimization
       
}
block1 {
  vars: data, h1, h2, loss            # TBD: need add w1, w2 here?
  ops: fc, fc, softmax
}
block2 {
  vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad
  ops: softmax_grad,
       fc_grad, allreduce(places, scopes, w1_grad),  # TBD: who add allreduce?
       fc_grad, allreduce(places, scopes, w2_grad)
}
block3 {
  vars: lr                    # TBD: need add w1, w2 here?
  ops: sgd(w2, w2_grad),
       sgd(w1, w1_grad)
}
}

Problems

  1. At the first iteration, who will copy the initialized parameters (note some parameter doesn't need to copy) to different GPUs? In later iterations, how to avoid this copy.
    • Answer: we copy on every iteration. However, we allow parameter sharing if the place is same.
  2. Who will add allreduce? Backward will support this?
    • Answer: parallel_do will manually accumulate the gradients across all places.
  3. Who will add parallel_do(block3)? Answer:
    • Answer: parallel_do outputs the gradient to the host place. all the param += grad only happens on the host place.
  4. How to save model? Answer:
    • Answer: all the parameters will be at the host place.
  5. How does optimization access forward/backward scope?
    • Answer: all the updates will appear at w and w_grad on the host place.
  6. How to aggregate target?
    • Answer: parallel_do will aggregate its output.
@helinwang
Copy link
Contributor

helinwang commented Dec 14, 2017

Each parallel_do or parallel_do_grad is a blocking call, how can we exploit the concurrency between a sequence of parallel_dos?


Response by @tonyyang-svail: a sequence of parallel_do should not be paralleled.

parallel_do1()
parallel_do2() # this should wait for parallel_do1

@helinwang
Copy link
Contributor

helinwang commented Dec 14, 2017

There are N loss created (N = number of places), so are we doing back propagation N times? If so, why should the backward pass #0 wait for all forward pass (it only needs the #0 forward pass to complete, but parallel_do is a blocking call, parallel_do_grad can't happen before parallel_do is finished). If not, we need the code to merge the loss.


Response by @tonyyang-svail:

There are N loss created (N = number of places), so are we doing back propagation N times?

Yes

If so, why should the backward pass #0 wait for all forward pass (it only needs the #0 forward pass to complete, but parallel_do is a blocking call, parallel_do_grad can't happen before parallel_do is finished).

We have to make parallel_do blocking because we don't know the merged output of parallel_do is used by other op. Consider the following case:

If not, we need the code to merge the loss.

parallel_do will aggregate its output.

@helinwang
Copy link
Contributor

I think the Python API for parallel.do is hard to use and fragile. We can still expose the parallel.do Python API to the user, just as C allow inline assembly. But in my opinion it should be very low priority, we need to focus more on the transpiler, so the user don't have to worry about it.

@jacquesqiao jacquesqiao self-assigned this Dec 15, 2017
@jacquesqiao
Copy link
Member

There a N loss created (N = number of places), so are we doing back propagation N times? If so, why should the backward pass #0 wait for all forward pass (it only needs the #0 forward pass to complete, but parallel_do is a blocking call, parallel_do_grad can't happen before parallel_do is finished). If not, we need the code to merge the loss.

@helinwang In some implementation, parallel_do_grad may not be fully parallel, because backward can merge gradient and update parameter when it gets one and then calculate other gradients, this requires a sync with each card. But we can implement is as parallel, first run all grad block parallelly, and then merge gradient and update parameter.

@Yancey1989
Copy link
Contributor

Who will copy initialized parameters to different GPUs?

I think we need an Op to do the Broadcast and specify a parameter to a root device number with some method round-robin/hash/...,

@typhoonzero
Copy link
Contributor

typhoonzero commented Dec 15, 2017

multi-gpu merge

I'm not sure this picture will tell the general multi-gpu SGD procedure.

@wanghaoshuang
Copy link
Contributor

Who will copy initialized parameters to different GPUs?

The initialized parameters consistency among different GPUs can be maintained by random initialization with an identical seed?

@helinwang
Copy link
Contributor

helinwang commented Dec 15, 2017

On CPU we don't need to run block3 parallel, because there is only one memory, we don't need to update it multiple times:

block3 {
  vars: lr                    # TBD: need add w1, w2 here?
  ops: sgd(w2, w2_grad),
       sgd(w1, w1_grad)
}

So this ProgramDesc is actually targeted for multiple-GPU only, and we can't send the same compiled ProgramDesc to run on both multple-GPU cluster and no-GPU-multiple-core cluster.

We really need to consider where the ProgramDesc is running during the compiling phase, like the just-in-time compiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment