Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native xgb distributed training] allow failed worker retry #4769

Closed
wants to merge 0 commits into from

Conversation

chenqin
Copy link
Contributor

@chenqin chenqin commented Aug 14, 2019

The goal of this pr is to enable native xgb worker retry on both approx and fast hist tree_method,
dmlc/rabit#98

  • via enable rabit_cache=1 setting, user can retry failed xgb worker without restart entire cluster
  • add rabit_recovery test case to traivs
  • add missing parameters to checkpoint payload
  • detect older model to be backward compatible

Misc: I will update gitmodules once dependency pr landed.

@trivialfis
Copy link
Member

Is it possible to use CMake instead of Make?

@chenqin
Copy link
Contributor Author

chenqin commented Aug 15, 2019

Is it possible to use CMake instead of Make?

Looks like building librabit_mock.a with rabit CMake will not build correctly. xgboost cmake has separate logic to build librabit.a. Looking...

@chenqin
Copy link
Contributor Author

chenqin commented Aug 24, 2019

Is it possible to use CMake instead of Make?
@trivialfis
hum, switched to cmake and passed travis-ci, jenkinsis still failing not sure why

@trivialfis
Copy link
Member

@chenqin clang-tidy?

@chenqin
Copy link
Contributor Author

chenqin commented Aug 29, 2019

moved to #4808

@chenqin chenqin deleted the native_dist branch August 29, 2019 05:53
@lock lock bot locked as resolved and limited conversation to collaborators Nov 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants