Survey on Multi GPU Feature

Caffe2 & NCCL library

Caffe2有两种多GPU使用方式，一种是通过封装了NCCL library的操作，达到多GPU的参数共享和更新。NCCL2实现了以下类MPI原语，caffe2封装了context，每次发送一个TensorCUDA。多GPU间的同步使用了环形通信算法。

all-reduce
all-gather
reduce-scatter
reduce
broadcast

Caffe也提供了在Python端通过cpu2gpu以及gpu2gpu的底层接口实现的broadcast，allreduce, allgather函数。

tensorflow

tensorflow通过tf.device() 指定参数的存储位置，而例如with tf.device("gpu0")可以将一组参数运算放置在gpu0上，参数的命名前缀使用tf.name_scope() 加入。使用tf.get_variable() 可以取到对应的参数，达到多个GPU间共享参数的目的。

示例代码

import tensorflow as tf
tower_grads = []
with tf.device('/gpu:%d' % i):
  with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
     tf.get_variable_scope().reuse_variables()
     loss = tower_loss(scope, image_batch, label_batch)
     tower_grads.append(loss)

tensorflow 用户贡献了NCCL的AllReduce, Broadcast等设备通信接口的封装。

mxnet

mxnet原生支持单机多GPU，多机多GPU需要重新条件编译。在mxnet的模块中，由Module提供多GPU负载配置，用均衡各个GPU的计算负载，KVStore 配置选项—kv_store type 为local运行在单机多CPU上，配置为device运行在单机多GPU上。多机多GPU需要指定USE_DIST_KVSTORE=1重新编译，然后配置为dist_device_sync。

mxnet没有给API暴露GPU的通信接口，统一封装在KVstore里的Push/Pull接口中，单机和多机实现为不同版本的KVstore子类。在C++底层实现了BroadCast, AllReduce，但是没有给用户提供Broadcast, AllReduce Operator

    示例代码如下:

class KVStoreLocal : public KVStore {
    void Pull(const std::vector<int>& keys,
            const std::vector<NDArray*>& values,
            int priority) override {
        comm->Broadcast();
    }
    void Push(const std::vector<int>& keys,
            const std::vector<NDArray>& values,
            int priority) override {
        comm->Reduce();
    }
    private:
  	Comm* comm_;  // maybe GPU/CPU device or node
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Survey on Multi GPU Feature

Caffe2 & NCCL library

tensorflow

mxnet

Clone this wiki locally