-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOTFIX] distributed training with hist method #4716
Conversation
is it only happening in master branch? |
@CodingCat Yes, I have tested branch |
I think it is related to issue at #4679 in 0.9 branch, the node stats is synced for only once when working on the root and left/right should be calculated from cache (check #4140) so I think everything should be fine @hcho3 @trivialfis @RAMitchell I believe it is a blocking issue for 1.0? |
Bingo! check my previous comment |
Could you please add this to the roadmap as you have better idea for what's happening. |
@CodingCat Yes, I think this is blocking. @sperlingxx Thanks for the report. FYI, see https://xgboost.readthedocs.io/en/latest/contrib/unit_tests.html for locally running tests |
I have updated 1.0.0 roadmap |
@hcho3 Thanks for doc link, I will tune unit tests in local. |
@hcho3 @trivialfis @CodingCat import xgboost as xgb
import unittest
import numpy as np
class TestOMP(unittest.TestCase):
def test_omp(self):
dpath = 'demo/data/'
dtrain = xgb.DMatrix(dpath + 'agaricus.txt.train')
dtest = xgb.DMatrix(dpath + 'agaricus.txt.test')
param = {'booster': 'gbtree',
'objective': 'binary:logistic',
'grow_policy': 'depthwise',
'tree_method': 'hist',
'eval_metric': 'error',
'max_depth': 5,
'min_child_weight': 0}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 5
def run_trial():
res = {}
bst = xgb.train(param, dtrain, num_round, watchlist, evals_result=res)
metrics = [res['train']['error'][-1], res['eval']['error'][-1]]
preds = bst.predict(dtest)
return metrics, preds
def consist_test(title, n):
auc, pred = run_trial()
for i in range(n-1):
auc2, pred2 = run_trial()
try:
assert auc == auc2
assert np.array_equal(pred, pred2)
except Exception as e:
print('-------test %s failed, num_trial: %d-------' % (title, i))
raise e
auc, pred = auc2, pred2
return auc, pred
print('test approx ...')
param['tree_method'] = 'approx'
param['nthread'] = 1
auc_1, pred_1 = consist_test('approx_thread_1', 100)
param['nthread'] = 2
auc_2, pred_2 = consist_test('approx_thread_2', 100)
param['nthread'] = 3
auc_3, pred_3 = consist_test('approx_thread_3', 100)
assert auc_1 == auc_2 == auc_3
assert np.array_equal(auc_1, auc_2)
assert np.array_equal(auc_1, auc_3)
print('test hist ...')
param['tree_method'] = 'hist'
param['nthread'] = 1
auc_1, pred_1 = consist_test('hist_thread_1', 100)
param['nthread'] = 2
auc_2, pred_2 = consist_test('hist_thread_2', 100)
param['nthread'] = 3
auc_3, pred_3 = consist_test('hist_thread_3', 100)
assert auc_1 == auc_2 == auc_3
assert np.array_equal(auc_1, auc_2)
assert np.array_equal(auc_1, auc_3)
After I changed all OPENMP scheduling policys in updater_quantile_hist.cc to There exists three call of OPENMP parallel for in updater_quantile_hist.cc : xgboost/src/tree/updater_quantile_hist.cc Line 559 in cb9a80c
xgboost/src/tree/updater_quantile_hist.cc Line 859 in cb9a80c
xgboost/src/tree/updater_quantile_hist.cc Line 1082 in cb9a80c
|
@hcho3 Do you have time to investigate impact from previous optimization PR? I have a feeling that we might need to revert some problematic parts of it. |
@trivialfis I'll probably have to take time to investigate. I'll let you know when I do so. My org has interest in fixing this problem as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this for now. I would like to re-visit the previous optimization PR within the next 2 weeks and address #4679
Problem:
Debug Stack:
expand_nodes
in different workers are not equal, because one node may be a leaf node on one worker, but it is still splitable on other workers.SplitEvaluation
are inconsistent on different workers (machines).nthread > 1
.EvaluateSplitsBatch
fromschedule(guided)
toschedule(dynamic) num_threads(nthread)
, everything works.