Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rabit harden] fix rabit tests #81

Merged
merged 20 commits into from
Mar 14, 2019
Merged

[rabit harden] fix rabit tests #81

merged 20 commits into from
Mar 14, 2019

Conversation

chenqin
Copy link
Contributor

@chenqin chenqin commented Mar 6, 2019

Notice, this pr will fail test unless dmlc-core local tracker pr landed. dmlc/dmlc-core#510

This first pr of series of prs in order to harden rabit and make sure the tests running with meaningful test coverage. The goal of this pr is to enable model recover tests which simulate worker failure and resume (catchup with next allreduce) expected behavior in guide doc

What this pr does is basically working along side with local tracker pr and pass latest command into rabit init; remove duplicated ntrail value overwrite from DMLC_NUM_ATTEMPT; add console output when worker recovered/catchup in allreduce.

@chenqin chenqin changed the title enable model_recover_10_10k test [rabit harden] fix rabit recovery tests Mar 7, 2019
@chenqin chenqin changed the title [rabit harden] fix rabit recovery tests [rabit harden] fix rabit model recovery tests Mar 7, 2019
@CodingCat
Copy link
Member

you can consider temporarily point dmlc-core in this PR to your private branch to have a cross-check before we merge any of them

Chen Qin added 2 commits March 7, 2019 18:08
src/allreduce_mock.h Outdated Show resolved Hide resolved
test/test.mk Outdated Show resolved Hide resolved
test/model_recover.cc Show resolved Hide resolved
test/model_recover.cc Outdated Show resolved Hide resolved
test/model_recover.cc Outdated Show resolved Hide resolved
test/model_recover.cc Outdated Show resolved Hide resolved
@chenqin chenqin changed the title [rabit harden] fix rabit model recovery tests [rabit harden] fix rabit tests Mar 9, 2019
@chenqin
Copy link
Contributor Author

chenqin commented Mar 10, 2019

passing test locally with gcc-7/g++-7 on osx 10.14.3.

@chenqin
Copy link
Contributor Author

chenqin commented Mar 10, 2019

^^ @CodingCat @hcho3

@chenqin
Copy link
Contributor Author

chenqin commented Mar 12, 2019

Makefile Outdated Show resolved Hide resolved
src/allreduce_robust.cc Outdated Show resolved Hide resolved
src/allreduce_robust.cc Outdated Show resolved Hide resolved
src/allreduce_robust.cc Outdated Show resolved Hide resolved
src/allreduce_robust.cc Outdated Show resolved Hide resolved
@chenqin
Copy link
Contributor Author

chenqin commented Mar 13, 2019

passing XGBoost tests
https://travis-ci.org/chenqin/xgboost/builds/505580249

@chenqin
Copy link
Contributor Author

chenqin commented Mar 14, 2019

@trivialfis Can you help merge this? should be straight forward pr.

@trivialfis
Copy link
Member

@chenqin No problem. Just one small question, is this compiling on windows? Last time I tried to upgrade rabit in XGBoost the compilation failed #80.

@chenqin
Copy link
Contributor Author

chenqin commented Mar 14, 2019

@trivialfis trivialfis merged commit ed06e0c into dmlc:master Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants