[HOTFIX] distributed training with hist method #4716

sperlingxx · 2019-07-30T02:58:28Z

Problem:

Distributed training with hist method will stop working when synchronizing histograms of different machines.

Debug Stack:

Size of expand_nodes in different workers are not equal, because one node may be a leaf node on one worker, but it is still splitable on other workers.
Results of SplitEvaluation are inconsistent on different workers (machines).
Inconsistent split results will be observed only when nthread > 1.
When I changed OPENMP schedule policy of EvaluateSplitsBatch from schedule(guided) to schedule(dynamic) num_threads(nthread), everything works.

CodingCat · 2019-07-30T04:04:47Z

is it only happening in master branch?

sperlingxx · 2019-07-30T04:11:28Z

is it only happening in master branch?

@CodingCat Yes, I have tested branch release_0.9, everything is all right. I think some modifications in PR #4529 cause the problem.

CodingCat · 2019-07-30T04:13:38Z

is it only happening in master branch?

I think it is related to issue at #4679

in 0.9 branch, the node stats is synced for only once when working on the root and left/right should be calculated from cache (check #4140) so I think everything should be fine

@hcho3 @trivialfis @RAMitchell I believe it is a blocking issue for 1.0?

CodingCat · 2019-07-30T04:14:01Z

is it only happening in master branch?

@CodingCat Yes, I have tested branch release_0.9, everything is all right. I think some modifications in PR #4529 cause the problem.

Bingo! check my previous comment

trivialfis · 2019-07-30T04:57:28Z

I believe it is a blocking issue for 1.0?

Could you please add this to the roadmap as you have better idea for what's happening.

hcho3 · 2019-07-30T07:19:49Z

@CodingCat Yes, I think this is blocking.

@sperlingxx Thanks for the report. FYI, see https://xgboost.readthedocs.io/en/latest/contrib/unit_tests.html for locally running tests

CodingCat · 2019-07-30T15:13:37Z

I have updated 1.0.0 roadmap

sperlingxx · 2019-07-30T15:25:39Z

@hcho3 Thanks for doc link, I will tune unit tests in local.

sperlingxx · 2019-08-03T09:00:28Z

@hcho3 @trivialfis @CodingCat
I've tested codes of master branch with below script:

import xgboost as xgb
import unittest
import numpy as np


class TestOMP(unittest.TestCase):
    def test_omp(self):
        dpath = 'demo/data/'
        dtrain = xgb.DMatrix(dpath + 'agaricus.txt.train')
        dtest = xgb.DMatrix(dpath + 'agaricus.txt.test')

        param = {'booster': 'gbtree',
                 'objective': 'binary:logistic',
                 'grow_policy': 'depthwise',
                 'tree_method': 'hist',
                 'eval_metric': 'error',
                 'max_depth': 5,
                 'min_child_weight': 0}

        watchlist = [(dtest, 'eval'), (dtrain, 'train')]
        num_round = 5

        def run_trial():
            res = {}
            bst = xgb.train(param, dtrain, num_round, watchlist, evals_result=res)
            metrics = [res['train']['error'][-1], res['eval']['error'][-1]]
            preds = bst.predict(dtest)
            return metrics, preds

        def consist_test(title, n):
            auc, pred = run_trial()
            for i in range(n-1):
                auc2, pred2 = run_trial()
                try:
                    assert auc == auc2
                    assert np.array_equal(pred, pred2)
                except Exception as e:
                    print('-------test %s failed, num_trial: %d-------' % (title, i))
                    raise e
                auc, pred = auc2, pred2
            return auc, pred

        print('test approx ...')
        param['tree_method'] = 'approx'

        param['nthread'] = 1
        auc_1, pred_1 = consist_test('approx_thread_1', 100)

        param['nthread'] = 2
        auc_2, pred_2 = consist_test('approx_thread_2', 100)

        param['nthread'] = 3
        auc_3, pred_3 = consist_test('approx_thread_3', 100)

        assert auc_1 == auc_2 == auc_3
        assert np.array_equal(auc_1, auc_2)
        assert np.array_equal(auc_1, auc_3)

        print('test hist ...')
        param['tree_method'] = 'hist'

        param['nthread'] = 1
        auc_1, pred_1 = consist_test('hist_thread_1', 100)

        param['nthread'] = 2
        auc_2, pred_2 = consist_test('hist_thread_2', 100)

        param['nthread'] = 3
        auc_3, pred_3 = consist_test('hist_thread_3', 100)

        assert auc_1 == auc_2 == auc_3
        assert np.array_equal(auc_1, auc_2)
        assert np.array_equal(auc_1, auc_3)

approx method passed all the tests, but hist method always failed in consistent tests when thread num > 1. (If we reduce trial num, sometimes test of hist method will be passed).

After I changed all OPENMP scheduling policys in updater_quantile_hist.cc to static (from guided , consistent tests of hist method can be passed.

There exists three call of OPENMP parallel for in updater_quantile_hist.cc :

xgboost/src/tree/updater_quantile_hist.cc

Line 559 in cb9a80c

#pragma omp parallel for schedule(guided)

xgboost/src/tree/updater_quantile_hist.cc

Line 859 in cb9a80c

#pragma omp parallel for schedule(guided)

xgboost/src/tree/updater_quantile_hist.cc

Line 1082 in cb9a80c

#pragma omp parallel for schedule(guided)

trivialfis · 2019-08-03T19:35:19Z

@hcho3 Do you have time to investigate impact from previous optimization PR? I have a feeling that we might need to revert some problematic parts of it.

hcho3 · 2019-08-06T06:57:57Z

@trivialfis I'll probably have to take time to investigate. I'll let you know when I do so. My org has interest in fixing this problem as well.

hcho3

Let's merge this for now. I would like to re-visit the previous optimization PR within the next 2 weeks and address #4679

cc @trivialfis @sperlingxx

sperlingxx added 3 commits July 30, 2019 10:44

add parallel test for hist.EvalualiteSplit

9127d2a

update test_openmp.py

d09451d

update test_openmp.py

967f808

update test_openmp.py

f34d43b

sperlingxx added 2 commits July 30, 2019 12:25

update test_openmp.py

010fe08

update test_openmp.py

0f1ca2e

sperlingxx added 6 commits July 30, 2019 13:23

fix OMP schedule policy

2ac45d2

fix clang-tidy

cab0d7d

add logging: total_num_bins

884233e

fix

e7b72fa

fix

805c657

test

df7e9b8

This was referenced Jul 30, 2019

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

[BLOCKING] Per-node sync slows down distributed training with 'hist' #4679

Closed

replace guided OPENMP policy with static in updater_quantile_hist.cc

41fc3d3

hcho3 approved these changes Aug 12, 2019

View reviewed changes

sperlingxx changed the title ~~[WIP][HOTFIX] distributed training with hist method~~ [HOTFIX] distributed training with hist method Aug 13, 2019

hcho3 merged commit ef9af33 into dmlc:master Aug 13, 2019

hcho3 mentioned this pull request Aug 13, 2019

Fix unstable results in hist method #4767

Closed

CodingCat mentioned this pull request Sep 4, 2019

Optimizations of distributed 'hist' mode for CPUs #4824

Closed

sperlingxx mentioned this pull request Sep 11, 2019

[blocking] fix parallel eval_split of hist updater #4851

Merged

lock bot locked as resolved and limited conversation to collaborators Nov 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HOTFIX] distributed training with hist method #4716

[HOTFIX] distributed training with hist method #4716

sperlingxx commented Jul 30, 2019 •

edited

Loading

CodingCat commented Jul 30, 2019 •

edited

Loading

sperlingxx commented Jul 30, 2019

CodingCat commented Jul 30, 2019

CodingCat commented Jul 30, 2019

trivialfis commented Jul 30, 2019

hcho3 commented Jul 30, 2019

CodingCat commented Jul 30, 2019

sperlingxx commented Jul 30, 2019

sperlingxx commented Aug 3, 2019

trivialfis commented Aug 3, 2019

hcho3 commented Aug 6, 2019

hcho3 left a comment

[HOTFIX] distributed training with hist method #4716

[HOTFIX] distributed training with hist method #4716

Conversation

sperlingxx commented Jul 30, 2019 • edited Loading

CodingCat commented Jul 30, 2019 • edited Loading

sperlingxx commented Jul 30, 2019

CodingCat commented Jul 30, 2019

CodingCat commented Jul 30, 2019

trivialfis commented Jul 30, 2019

hcho3 commented Jul 30, 2019

CodingCat commented Jul 30, 2019

sperlingxx commented Jul 30, 2019

sperlingxx commented Aug 3, 2019

trivialfis commented Aug 3, 2019

hcho3 commented Aug 6, 2019

hcho3 left a comment

Choose a reason for hiding this comment

sperlingxx commented Jul 30, 2019 •

edited

Loading

CodingCat commented Jul 30, 2019 •

edited

Loading