-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808
Conversation
Running tests locally, for faster experimentation: https://xgboost.readthedocs.io/en/latest/contrib/unit_tests.html |
9134b63
to
df6f715
Compare
Some investigate around checkpoint restore caused build failure, there seems some issue with XBG CLI loadcheckpoint where we lost some training parameters initially set in learner creation. This is due to we overwrite learner with checkpoint payload where not all settings were saved. In essence, we want to split parameters outside of checkpoint payload. Since we can get from restart worker and decide proper config for each environments (gpu/cpu/distributed etc) , we might just merge those config back to learner before start/resume training. |
I did tests of commenting extra config in learner, it pass failed jvm tests on model save/load . So looks like jvm saved model were impacted with extra configs. @CodingCat F.Y.I update, this is actually xgboost4j not xgboost4j-spark, I can help clean up in upcoming jvm focused pr. |
@trivialfis since you were working on organizing trainer parameters, can you help review this change. We allow failed xgb worker retry with additional saved_params IF user opt-ed in rabit_bootstrap_cache cc @hcho3 @CodingCat update, accuracy is same w/o recovery |
Yes. I'm also playing with rabit recently. Will review soon. Thanks for mentioning me. |
@chenqin Is this currently critical? I wrote the CMake file before, the reason I build rabit in XGBoost is because it fails to build on Windows. Would it be inconvenient for you if I take some time to open a PR for rabit cmake build file? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Please address the comments for simd and todo item.
Addressed feedbacks
|
Do we need secondary reviewer before merge? |
Considering I'm only start doing distributed computing in lesser than a month ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however I'm not exactly an expert in distributed training either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I am not expert in this part of xgboost and I am not sure I understand what does this pull request archives. More specifically I do not quite get where did you add a recovery from another worker in the code
@trams Full context can be found in dmlc/rabit#98. The goal is to "implement immutable cache in rabit to help failed worker recover not synced allreduces in bootstrap time." Most of the recovery logic is found in dmlc/rabit#98, and this PR modifies Rabit calls to make use of the new recovery logic. |
b9f5b7d
to
3f2306f
Compare
I'll add here my understanding of the original PR in rabit, hopefully it helps other maintainers with reviewing and understanding the changes. @CodingCat and @chenqin can correct any mistakes in my explanation. After going through the design doc, my understanding of the purpose of dlmc/rabit#98 is allowing for single worker recovery vs. the current fail-all recovery (explained below). The doc also mentions handling cases where workers fail before the first checkpoint, during bootstrap actions like getting the number of features which is done through an allreduce at the beginning of training or broadcasting the column sampler seed value. It's not 100% clear to me why these are special cases, it's hard to grok the doc for that section (and I'm jet lagged). If my understanding is correct, in the current implementation, in case a failure happens all workers are shut down and restarted ("fail-all" in the doc), and learning resumes from the last checkpoint. This involves requesting resources from the scheduler (e.g. Yarn) and shuffling all the data again from scratch. Putting my PhD hat on, I'll note here that I don't fully agree with this definition of a shuffle in the doc, which treats each partition (i.e. worker) as a "shuffle". A "shuffle" (in Spark terms) is a single distribution of the data amongst all workers in the cluster. Each distinct failure currently requires one shuffle, so I'd rather say that F failures require F shuffles, rather than F*W. However we need the per worker granularity later, so I'd say that currently each failure requires us to send In any case, when we have massive datasets and hundreds of workers, both these operations, requesting resources and shuffling the data across the network can be very costly and block training for extended periods of time, especially in multi-tenant clusters. The proposed solution then is not to kill all workers and start training from the last checkpoint, but rather do a single node recovery: when a node fails, only that one is restarted, the rest of the cluster waits for it to be bootstrapped, and then continues learning. The doc continues to explain how the recovery is handled, but I haven't gone through that part in detail. Hope this helps explain the purpose of the PR! |
Appreciate your help on explain things in much detail way! I agree the wording may needs more thoughts. yes, it strictly W full reshuffle assume every failure cause entire job retry (before loadcheckpoint) and we have W failures one at a time in job life cycle. In real life, those dataset may not saved in HDFS instead generated from various table join and feature extraction stages. In this pr, along with techniques of external shuffle service as well as determined partitioned scheme. We are moving towards limited impact of single failure to less than full shuffle. So yes, the comparison is lower-bound of estimation where ideally also W reshuffle on 1/M dataset. Worst case should be same as current approach, when cluster lost track of critical mass of datasets so that it needs to redo everything from beginning.
|
Can we rerun jvm tests, seems flaky. Are we conform to merge this change? |
Restarted. Will merge once we have tests passed. |
@chenqin Merged, big thanks! |
[copy from xgboost/pull/4769] The goal of this pr is to enable failed native xgb worker retry and restore on approx tree_method detail underlaying implementation can be found dmlc/rabit#98
Summary:
Note: per convo with @CodingCat fast histgram in master is still not ready to support.