-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster train job will hang if there are too many parameter server or ports #2224
Comments
多少个port算比较多呀 |
@jacquesqiao (dense) parameter_block > pserver count * num_ports 参考client端: else { /// parameter set for dense and sparse
real* buf =
sendingPara ? parameter->getBuf(parameterType)->getPoint(0) : nullptr;
uint64_t endDim = 0;
for (uint64_t beginDim = 0; beginDim < paraSize; beginDim = endDim) {
endDim = std::min<int64_t>(beginDim + blockSize, paraSize);
int64_t blockId = beginDim / blockSize;
int serverId = std::abs((blockId + nameHash) % serviceNum_);
auto& request = sendJob->parallelRequests[serverId];
ParameterBlock* block = request.add_blocks();
block->set_para_id(segments.id);
block->set_block_id(blockId);
block->set_begin_pos(beginDim);
block->set_block_size(endDim - beginDim);
if (buf) {
sendJob->parallelInputIovs[serverId].push_back(
{buf + beginDim, sizeof(real) * ((size_t)(endDim - beginDim))});
}
} |
block()==0,还是Parameterblock一个都没有,还是client发的时候就有问题,比如paraSize == 0 @typhoonzero ,server上没有任何参数。不过为什么port多了就不行? 再多也是会有一个block吧? |
The warning message doesn't seem to be the reason of job hanging, I remember the reason was error configured job. Closing this for now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If there are too many parameter servers or too many parameter server ports(or sparse ports), some parameter servers will wait forever.
When parameter start up, ti says:
In
ParameterServer2.cpp
:It seems that the hanging problem is due to some other reason. But I still need to figure out the details when parameter block is more than pserver instances
The text was updated successfully, but these errors were encountered: