[rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808

chenqin · 2019-08-25T21:26:39Z

[copy from xgboost/pull/4769] The goal of this pr is to enable failed native xgb worker retry and restore on approx tree_method detail underlaying implementation can be found dmlc/rabit#98

Summary:

via enable rabit_bootstrap_cache=1 setting, user can retry failed xgb worker without restart entire job
add xgb_recovery test case to traivs to simulate multiple xgb worker failure on approx three_method with prediction accuracy on par
add needed cfg_ to native checkpoint payload if user set rabit_bootstrap_cache=1
backward compatible with old model or disabled/not set rabit_bootstrap_cache case.

Note: per convo with @CodingCat fast histgram in master is still not ready to support.

This reverts commit fb484fb.

hcho3 · 2019-08-26T05:49:16Z

Running tests locally, for faster experimentation: https://xgboost.readthedocs.io/en/latest/contrib/unit_tests.html

chenqin · 2019-08-27T20:29:05Z

Some investigate around checkpoint restore caused build failure, there seems some issue with XBG CLI loadcheckpoint where we lost some training parameters initially set in learner creation. This is due to we overwrite learner with checkpoint payload where not all settings were saved.

In essence, we want to split parameters outside of checkpoint payload. Since we can get from restart worker and decide proper config for each environments (gpu/cpu/distributed etc) , we might just merge those config back to learner before start/resume training.

@trivialfis

chenqin · 2019-08-29T04:20:20Z

I did tests of commenting extra config in learner, it pass failed jvm tests on model save/load . So looks like jvm saved model were impacted with extra configs. @CodingCat F.Y.I
std::setstd::string saved_configs_ = {};
//{"max_depth", "tree_method", "dsplit",
// "seed", "silent", "gamma", "min_child_weight"};

update, this is actually xgboost4j not xgboost4j-spark, I can help clean up in upcoming jvm focused pr.

…lc#4753

chenqin · 2019-08-30T03:56:35Z

@trivialfis since you were working on organizing trainer parameters, can you help review this change. We allow failed xgb worker retry with additional saved_params IF user opt-ed in rabit_bootstrap_cache

cc @hcho3 @CodingCat

update, accuracy is same w/o recovery
2019-08-30 14:37:43,758 INFO [14:37:43] [11] test-rmse:0.026854 2019-08-30 14:37:44,772 INFO @tracker All nodes finishes job 2019-08-30 14:37:44,772 INFO @tracker 1.1630983352661133 secs between node start and job finish
with recovery
2019-08-30 01:14:27,988 INFO [01:14:27] [11] test-rmse:0.026854 2019-08-30 01:14:29,012 INFO @tracker All nodes finishes job 2019-08-30 01:14:29,013 INFO @tracker 1.644543170928955 secs between node start and job finish

trivialfis · 2019-09-02T17:08:32Z

Yes. I'm also playing with rabit recently. Will review soon. Thanks for mentioning me.

chenqin · 2019-09-02T22:22:57Z

source https://en.wikipedia.org/wiki/File:Oryctolagus_cuniculus_Rcdo.jpg

trivialfis · 2019-09-03T07:06:10Z

@chenqin Is this currently critical? I wrote the CMake file before, the reason I build rabit in XGBoost is because it fails to build on Windows. Would it be inconvenient for you if I take some time to open a PR for rabit cmake build file?

trivialfis

Looks good overall. Please address the comments for simd and todo item.

src/common/hist_util.cc

src/common/quantile.h

src/learner.cc

tests/cli/runxgb.sh

chenqin · 2019-09-12T06:29:04Z

Addressed feedbacks

remove omp declarative
explained TODO, follow up would move histogram init before loadcheckpoint.
explained allreduce in DMatrix::Load, follow up with fix of remove duplicated colmn size check in train/eval datasets

chenqin · 2019-09-12T17:08:31Z

Do we need secondary reviewer before merge?

trivialfis · 2019-09-12T17:18:47Z

Considering I'm only start doing distributed computing in lesser than a month ...

hcho3

LGTM, however I'm not exactly an expert in distributed training either.

@CodingCat @thvasilo @trams

trams

LGTM, but I am not expert in this part of xgboost and I am not sure I understand what does this pull request archives. More specifically I do not quite get where did you add a recovery from another worker in the code

src/common/hist_util.cc

tests/cli/runxgb.sh

tests/travis/setup.sh

.travis.yml

hcho3 · 2019-09-12T21:51:09Z

@trams Full context can be found in dmlc/rabit#98. The goal is to "implement immutable cache in rabit to help failed worker recover not synced allreduces in bootstrap time." Most of the recovery logic is found in dmlc/rabit#98, and this PR modifies Rabit calls to make use of the new recovery logic.

thvasilo · 2019-09-13T15:32:50Z

I'll add here my understanding of the original PR in rabit, hopefully it helps other maintainers with reviewing and understanding the changes. @CodingCat and @chenqin can correct any mistakes in my explanation.

After going through the design doc, my understanding of the purpose of dlmc/rabit#98 is allowing for single worker recovery vs. the current fail-all recovery (explained below). The doc also mentions handling cases where workers fail before the first checkpoint, during bootstrap actions like getting the number of features which is done through an allreduce at the beginning of training or broadcasting the column sampler seed value. It's not 100% clear to me why these are special cases, it's hard to grok the doc for that section (and I'm jet lagged).

If my understanding is correct, in the current implementation, in case a failure happens all workers are shut down and restarted ("fail-all" in the doc), and learning resumes from the last checkpoint. This involves requesting resources from the scheduler (e.g. Yarn) and shuffling all the data again from scratch.
If we have W workers and F failures happen (where a failure is defined as at least one worker failed during a distinct iteration) , we would need to request W*(F+1) new instances: W workers to start the job, then W workers again after each failure.
Since all workers, including healthy ones are shut down in case of a failure, we need to transfer the data again to the new workers. In case of W workers and F failures that would mean shuffling data W*F times, according to the doc.

Putting my PhD hat on, I'll note here that I don't fully agree with this definition of a shuffle in the doc, which treats each partition (i.e. worker) as a "shuffle". A "shuffle" (in Spark terms) is a single distribution of the data amongst all workers in the cluster. Each distinct failure currently requires one shuffle, so I'd rather say that F failures require F shuffles, rather than F*W. However we need the per worker granularity later, so I'd say that currently each failure requires us to send W data partitions over the network.

In any case, when we have massive datasets and hundreds of workers, both these operations, requesting resources and shuffling the data across the network can be very costly and block training for extended periods of time, especially in multi-tenant clusters.

The proposed solution then is not to kill all workers and start training from the last checkpoint, but rather do a single node recovery: when a node fails, only that one is restarted, the rest of the cluster waits for it to be bootstrapped, and then continues learning.
Compared to the existing approach, a failure would only mean requesting f new instances from the scheduler, where f is the number of failed workers in the current iteration. Accordingly we would only need to send f data partitions over the network (one for each failed worker), which can be highly beneficial when f << W.

The doc continues to explain how the recovery is handled, but I haven't gone through that part in detail. Hope this helps explain the purpose of the PR!

chenqin · 2019-09-13T22:58:06Z

Appreciate your help on explain things in much detail way!

I agree the wording may needs more thoughts. yes, it strictly W full reshuffle assume every failure cause entire job retry (before loadcheckpoint) and we have W failures one at a time in job life cycle. In real life, those dataset may not saved in HDFS instead generated from various table join and feature extraction stages.

In this pr, along with techniques of external shuffle service as well as determined partitioned scheme. We are moving towards limited impact of single failure to less than full shuffle. So yes, the comparison is lower-bound of estimation where ideally also W reshuffle on 1/M dataset. Worst case should be same as current approach, when cluster lost track of critical mass of datasets so that it needs to redo everything from beginning.

Putting my PhD hat on, I'll note here that I don't fully agree with this definition of a shuffle in the doc, which treats each partition (i.e. worker) as a "shuffle". A "shuffle" (in Spark terms) is a single distribution of the data amongst all workers in the cluster. Each distinct failure currently requires one shuffle, so I'd rather say that F failures require F shuffles, rather than F*W. However we need the per worker granularity later, so I'd say that currently each failure requires us to send W data partitions over the network.

chenqin · 2019-09-14T20:51:19Z

Can we rerun jvm tests, seems flaky.

Are we conform to merge this change?

trivialfis · 2019-09-16T17:26:20Z

Restarted. Will merge once we have tests passed.

trivialfis · 2019-09-17T03:32:31Z

@chenqin Merged, big thanks!

chenqin and others added 4 commits August 25, 2019 14:25

rebase and pull rabit with latest feature

cfa48ff

avoid pull fault tollerant rabit with windows platform

c763e68

detect windows 64

fb484fb

Revert "detect windows 64"

4889c2a

This reverts commit fb484fb.

patch rabit boostrap cache to xgb(except checkpoint)

df6f715

chenqin force-pushed the rabit_dist branch 3 times, most recently from 9134b63 to df6f715 Compare August 27, 2019 07:39

add max_depth, tree_method into checkpoint

d8a8a3e

Chen Qin and others added 4 commits August 27, 2019 14:05

pass configs when loading model

1ec7dc7

save missed configs to checkpoint

3dfa7e4

add support of xgb worker fail recovery for approx

6734f9a

Merge branch 'master' into rabit_dist

b96888a

chenqin added 2 commits August 28, 2019 22:34

workaround booster save/load inconsistency, leave fix to item 3 in dm…

6e1bdca

…lc#4753

Merge branch 'rabit_dist' of github.com:chenqin/xgboost into rabit_dist

e2369d3

This was referenced Aug 29, 2019

[native xgb distributed training] allow failed worker retry #4769

Closed

[EPIC] Allow failed worker retry in distributed training #4753

Closed

check if rabit_bootstrap_cache set before write to checkpoint

2bde716

chenqin force-pushed the rabit_dist branch from 19fc208 to 2bde716 Compare August 30, 2019 00:14

Chen Qin added 2 commits August 29, 2019 17:22

fix clang-tidy

e30e45e

remove unnessary changes, adding hist test

82bbfbe

chenqin changed the title ~~[WIP] rebase and pull rabit with latest feature~~ [rabit_bootstrap_cache ] failed xgb worker recover from other workers Aug 30, 2019

revert CMAKE file change on win32 check

98034e2

trivialfis requested changes Sep 11, 2019

View reviewed changes

src/common/hist_util.cc Outdated Show resolved Hide resolved

src/common/hist_util.cc Outdated Show resolved Hide resolved

src/common/quantile.h Outdated Show resolved Hide resolved

src/learner.cc Show resolved Hide resolved

tests/cli/runxgb.sh Outdated Show resolved Hide resolved

chenqin added 2 commits September 11, 2019 00:16

per feedback

ac90063

Merge branch 'rabit_dist' of github.com:chenqin/xgboost into rabit_dist

0a38691

chenqin mentioned this pull request Sep 12, 2019

[wip] json format support of learner and model payload #4852

Closed

trivialfis approved these changes Sep 12, 2019

View reviewed changes

hcho3 approved these changes Sep 12, 2019

View reviewed changes

trams reviewed Sep 12, 2019

View reviewed changes

src/common/hist_util.cc Outdated Show resolved Hide resolved

tests/cli/runxgb.sh Outdated Show resolved Hide resolved

tests/cli/runxgb.sh Outdated Show resolved Hide resolved

tests/travis/setup.sh Outdated Show resolved Hide resolved

trams reviewed Sep 12, 2019

View reviewed changes

.travis.yml Outdated Show resolved Hide resolved

chenqin force-pushed the rabit_dist branch 3 times, most recently from b9f5b7d to 3f2306f Compare September 13, 2019 05:09

per feedback, move distributed xgboost recovery to jenkins

3246613

chenqin force-pushed the rabit_dist branch from 3f2306f to 3246613 Compare September 13, 2019 05:23

chenqin and others added 2 commits September 12, 2019 23:33

try fix jenkins

03c790a

use exec in runxgb shell script

b822358

try fix jenkins missing xgboost exe

d81965b

chenqin force-pushed the rabit_dist branch from 5fb5d2c to d81965b Compare September 13, 2019 18:41

try fix path in jenkins

9e94f0f

trivialfis mentioned this pull request Sep 14, 2019

[RFC] Distinguish model, booster and runtime when performing IO operations. #4855

Closed

trivialfis merged commit 512f037 into dmlc:master Sep 17, 2019

lock bot locked as resolved and limited conversation to collaborators Dec 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808

[rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808

chenqin commented Aug 25, 2019 •

edited

Loading

hcho3 commented Aug 26, 2019

chenqin commented Aug 27, 2019

chenqin commented Aug 29, 2019 •

edited

Loading

chenqin commented Aug 30, 2019 •

edited

Loading

trivialfis commented Sep 2, 2019

chenqin commented Sep 2, 2019 •

edited

Loading

trivialfis commented Sep 3, 2019

trivialfis left a comment

chenqin commented Sep 12, 2019

chenqin commented Sep 12, 2019

trivialfis commented Sep 12, 2019

hcho3 left a comment

trams left a comment

hcho3 commented Sep 12, 2019 •

edited

Loading

thvasilo commented Sep 13, 2019 •

edited

Loading

chenqin commented Sep 13, 2019 •

edited

Loading

chenqin commented Sep 14, 2019 •

edited

Loading

trivialfis commented Sep 16, 2019

trivialfis commented Sep 17, 2019

[rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808

[rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808

Conversation

chenqin commented Aug 25, 2019 • edited Loading

hcho3 commented Aug 26, 2019

chenqin commented Aug 27, 2019

chenqin commented Aug 29, 2019 • edited Loading

chenqin commented Aug 30, 2019 • edited Loading

trivialfis commented Sep 2, 2019

chenqin commented Sep 2, 2019 • edited Loading

trivialfis commented Sep 3, 2019

trivialfis left a comment

Choose a reason for hiding this comment

chenqin commented Sep 12, 2019

chenqin commented Sep 12, 2019

trivialfis commented Sep 12, 2019

hcho3 left a comment

Choose a reason for hiding this comment

trams left a comment

Choose a reason for hiding this comment

hcho3 commented Sep 12, 2019 • edited Loading

thvasilo commented Sep 13, 2019 • edited Loading

chenqin commented Sep 13, 2019 • edited Loading

chenqin commented Sep 14, 2019 • edited Loading

trivialfis commented Sep 16, 2019

trivialfis commented Sep 17, 2019

chenqin commented Aug 25, 2019 •

edited

Loading

chenqin commented Aug 29, 2019 •

edited

Loading

chenqin commented Aug 30, 2019 •

edited

Loading

chenqin commented Sep 2, 2019 •

edited

Loading

hcho3 commented Sep 12, 2019 •

edited

Loading

thvasilo commented Sep 13, 2019 •

edited

Loading

chenqin commented Sep 13, 2019 •

edited

Loading

chenqin commented Sep 14, 2019 •

edited

Loading