Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster train job will hang if there are too many parameter server or ports #2224

Closed
typhoonzero opened this issue May 22, 2017 · 4 comments
Closed
Assignees

Comments

@typhoonzero
Copy link
Contributor

typhoonzero commented May 22, 2017

If there are too many parameter servers or too many parameter server ports(or sparse ports), some parameter servers will wait forever.

When parameter start up, ti says:

W0522 12:00:09.495564 35864 ParameterServer2.cpp:269] --ports_num or --ports_num_for_sparse might be too large, or total dense parameter size or sparse parameters size might be too small, this psever doesn't store any parameter.

In ParameterServer2.cpp:

void ParameterServer2::setParameter(const SendParameterRequest& request,
                                    std::vector<Buffer>& inputBuffers,
                                    SendParameterResponse* response,
                                    std::vector<Buffer>* outputBuffers) {
...
if (!request.blocks().size()) {
    LOG(WARNING)
        << "--ports_num or --ports_num_for_sparse might be too large, "
        << "or total dense parameter size or sparse parameters size "
        << "might be too small, this psever doesn't store any parameter.";
    return;
  }

...


void ParameterServer2::addGradient(const SendParameterRequest& request,
                                   std::vector<Buffer>& inputBuffers,
                                   SendParameterResponse* response,
                                   std::vector<Buffer>* outputBuffers) {

if (!numPassFinishClients_) {
    REGISTER_BARRIER_DELTA_SERVER_SET(
        *statSet_,
        "forwardbackwardDelta",
        FLAGS_num_gradient_servers,
        request.trainer_id(),
        request.forwardbackward_time(),
        isSparseServer_ ? "_sparseUpdater" : "_denseUpdater");
  }

It seems that the hanging problem is due to some other reason. But I still need to figure out the details when parameter block is more than pserver instances

@jacquesqiao
Copy link
Member

多少个port算比较多呀

@typhoonzero
Copy link
Contributor Author

@jacquesqiao (dense) parameter_block > pserver count * num_ports 参考client端:

else {  /// parameter set for dense and sparse
      real* buf =
          sendingPara ? parameter->getBuf(parameterType)->getPoint(0) : nullptr;
      uint64_t endDim = 0;
      for (uint64_t beginDim = 0; beginDim < paraSize; beginDim = endDim) {
        endDim = std::min<int64_t>(beginDim + blockSize, paraSize);
        int64_t blockId = beginDim / blockSize;
        int serverId = std::abs((blockId + nameHash) % serviceNum_);

        auto& request = sendJob->parallelRequests[serverId];
        ParameterBlock* block = request.add_blocks();
        block->set_para_id(segments.id);
        block->set_block_id(blockId);
        block->set_begin_pos(beginDim);
        block->set_block_size(endDim - beginDim);
        if (buf) {
          sendJob->parallelInputIovs[serverId].push_back(
              {buf + beginDim, sizeof(real) * ((size_t)(endDim - beginDim))});
        }
      }

@dzhwinter
Copy link
Contributor

dzhwinter commented May 22, 2017

block()==0,还是Parameterblock一个都没有,还是client发的时候就有问题,比如paraSize == 0 @typhoonzero ,server上没有任何参数。不过为什么port多了就不行? 再多也是会有一个block吧?

@typhoonzero
Copy link
Contributor Author

The warning message doesn't seem to be the reason of job hanging, I remember the reason was error configured job. Closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants