You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is working fine without early stopping.
But when enabling early stop callback, seems it will early stop one of the workers and cause the error.
DaskLGBMRegressor can always trigger this issue so far.
I tried changing make_regression to make_classification as well as lgb.DaskLGBMRegressor to lgb.DaskLGBMClassifier.
The issue is reproducible, but sometimes won't trigger.
Reproducible example
# start Dask cluster like this
dask-ssh 192.168.222.{235,236,237} --scheduler 192.168.222.236
importdask.arrayasdaimportlightgbmaslgbfromsklearn.datasetsimportmake_regressionfromdistributedimportClient, waitclient=Client(address="tcp://192.168.222.236:8786")
# starting with clean workersclient.restart()
EARLY_STOP_ROUND=20NUM_ITERATION=1000LEARNING_RATE=0.01# adding callbackscallbacks= []
eval_result= {}
record_evaluation_callback=lgb.record_evaluation(eval_result)
callbacks.append(record_evaluation_callback)
log_evaluation_callback=lgb.log_evaluation()
callbacks.append(log_evaluation_callback)
early_stopping_callback=lgb.early_stopping(EARLY_STOP_ROUND)
callbacks.append(early_stopping_callback)
# creating sample regression dataX_np, y_np=make_regression(n_samples=1000, n_features=10)
row_chunks= (100, 100, 100, 100, 100, 100, 100, 100, 100, 100)
X=da.from_array(X_np, chunks=(row_chunks, (10,)))
y=da.from_array(y_np, chunks=(row_chunks))
X_test_np, y_test_np=make_regression(n_samples=300, n_features=10)
test_row_chunks= (100, 100, 100)
X_test=da.from_array(X_test_np, chunks=(test_row_chunks, (10,)))
y_test=da.from_array(y_test_np, chunks=(test_row_chunks))
# persist() + wait() + rebalance() to get an even spread of the data across workersX=X.persist()
y=y.persist()
X_test=client.persist(X_test)
y_test=client.persist(y_test)
_=wait([X, y, X_test, y_test])
client.rebalance()
# training and get socket recv error code 104model=lgb.DaskLGBMRegressor(num_iterations=NUM_ITERATION, learning_rate=LEARNING_RATE).fit(
X, y, eval_set=[(X_test, y_test)], eval_names=['test'], callbacks=callbacks)
Finding random open ports for workers
Traceback (most recent call last):
File "lightgbm_reproduce_socket_error.py", line 49, in <module>
model = lgb.DaskLGBMRegressor(num_iterations=NUM_ITERATION, learning_rate=LEARNING_RATE).fit(
File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/dask.py", line 1406, in fit
self._lgb_dask_fit(
File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/dask.py", line 1082, in _lgb_dask_fit
model = _train(
File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/dask.py", line 818, in _train
results = client.gather(futures_classifiers)
File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/client.py", line 2361, in gather
return self.sync(
File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/utils.py", line 351, in sync
return sync(
File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/utils.py", line 418, in sync
raise exc.with_traceback(tb)
File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/utils.py", line 391, in f
result = yield future
File "/home/lidawei/.local/lib/python3.8/site-packages/tornado/gen.py", line 767, in run
value = future.result()
File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/client.py", line 2224, in _gather
raise exception.with_traceback(traceback)
File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/dask.py", line 313, in _train_part
model.fit(
File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/sklearn.py", line 1049, in fit
super().fit(
File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/sklearn.py", line 842, in fit
self._Booster = train(
File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/engine.py", line 276, in train
booster.update(fobj=fobj)
File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/basic.py", line 3658, in update
_safe_call(_LIB.LGBM_BoosterUpdateOneIter(
File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/basic.py", line 242, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 104)
Hey @daviddwlee84, thanks for using LightGBM and for the excellent report. The dask interface doesn't support early stopping yet, that's being tracked in #3712.
Description
It is working fine without early stopping.
But when enabling early stop callback, seems it will early stop one of the workers and cause the error.
DaskLGBMRegressor
can always trigger this issue so far.I tried changing
make_regression
tomake_classification
as well aslgb.DaskLGBMRegressor
tolgb.DaskLGBMClassifier
.The issue is reproducible, but sometimes won't trigger.
Reproducible example
# start Dask cluster like this dask-ssh 192.168.222.{235,236,237} --scheduler 192.168.222.236
Environment info
LightGBM version or commit hash:
4.1.0
Command(s) you used to install LightGBM
Dependencies (pip list)
Additional Comments
Client-side logs
Server-side logs
Some issues and pull requests might be related
The text was updated successfully, but these errors were encountered: