-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rabit harden] fix rabit tests #81
Conversation
you can consider temporarily point dmlc-core in this PR to your private branch to have a cross-check before we merge any of them |
This reverts commit 0eba614.
passing test locally with gcc-7/g++-7 on osx 10.14.3. |
^^ @CodingCat @hcho3 |
Passed integration tests in xgboost side |
passing XGBoost tests |
@trivialfis Can you help merge this? should be straight forward pr. |
Notice, this pr will fail test unless dmlc-core local tracker pr landed. dmlc/dmlc-core#510
This first pr of series of prs in order to harden rabit and make sure the tests running with meaningful test coverage. The goal of this pr is to enable model recover tests which simulate worker failure and resume (catchup with next allreduce) expected behavior in guide doc
What this pr does is basically working along side with local tracker pr and pass latest command into rabit init; remove duplicated ntrail value overwrite from DMLC_NUM_ATTEMPT; add console output when worker recovered/catchup in allreduce.