Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rabit improvement] support rabit worker set/get configs from tracker #94

Closed
wants to merge 10 commits into from
Closed

Conversation

chenqin
Copy link
Contributor

@chenqin chenqin commented Jun 13, 2019

native rabit checkpoint restore were failing in XGB.

In order to let restart worker avoid run allreduce before loadcheckpoint, we plan to save those config(s) in dmlc-tracker when training job starts first time(e.g number of columns of partitioned training data set)

If worker failure, instead of calling allreduce and break rabit recovery assumption, we can fetch configs from tracker. This would allow checkpoint load correctly and starts at right iteration number.

If tracker die, training job will die anyway. We might leverage spark hdfs checkpoint and recover entire cluster from there.

More detail here
dmlc/xgboost#4250 (comment)

@chenqin chenqin closed this Jun 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant