-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][dask] fix argument names in custom eval function in Dask test #4833
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! But I don't think the proposed change is correct.
When using, for example, DaskLGBMRegressor
, lightgbm.dask
will create one task running LGBMRegressor.fit()
on each Dask worker.
LightGBM/python-package/lightgbm/dask.py
Lines 765 to 767 in 2f5d898
futures_classifiers = [ | |
client.submit( | |
_train_part, |
LightGBM/python-package/lightgbm/dask.py
Line 164 in 2f5d898
def _train_part( |
LightGBM/python-package/lightgbm/dask.py
Lines 319 to 329 in 2f5d898
model.fit( | |
data, | |
label, | |
sample_weight=weight, | |
init_score=init_score, | |
eval_set=local_eval_set, | |
eval_sample_weight=local_eval_sample_weight, | |
eval_init_score=local_eval_init_score, | |
eval_names=local_eval_names, | |
**kwargs | |
) |
At that moment, that evaluation function will be called on only local data in each worker process, which will be numpy arrays, not Dask arrays.
Here's some sample code proving that. Note that it includes an assert
which will fail if y_true
is anything other than a numpy array.
(sample code)
import dask.array as da
import numpy as np
from dask.distributed import Client, LocalCluster, wait
from lightgbm.dask import DaskLGBMRegressor
from sklearn.datasets import make_regression
# set up Dask cluster
n_workers = 2
cluster = LocalCluster(n_workers=n_workers)
client = Client(cluster)
client.wait_for_workers(n_workers)
print(f"View the dashboard: {cluster.dashboard_link}")
# create training data and an additional eval set
def _make_dataset(n_samples):
X, y = make_regression(n_samples=n_samples)
dX = da.from_array(X, chunks=(1000, X.shape[1]))
dy = da.from_array(y, chunks=1000)
return dX, dy
# training data
dX, dy = _make_dataset(10_000)
# eval data
dX_e, dy_e = _make_dataset(2_000)
def _custom_metric(y_true, y_pred):
metric_name = "custom_metric"
is_higher_better = False
metric_value = 0.708
assert str(type(y_true)) == "<class 'numpy.ndarray'>"
return metric_name, metric_value, is_higher_better
dask_reg = DaskLGBMRegressor(
client=client,
objective="regression_l2",
tree_learner="data"
)
dask_reg.fit(
X=dX,
y=dy,
eval_set=[
(dX, dy),
(dX_e, dy_e)
],
eval_metric = ["rmse", _custom_metric]
)
# confirm that custom metric function was actually called
# (should be an array of value 0.708)
print(dask_reg.evals_result_["valid_0"]["custom_metric"])
Ah, thank you very much! I was confused by the metrics in our tests: LightGBM/tests/python_package_test/test_dask.py Lines 222 to 236 in 2f5d898
They all takes arguments named |
ah yep, you're right! That is confusing. I'd support removing the |
Great! I reverted docstring changes and pushed a fix for argument names. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
RTD builds for visual checks: https://lightgbm.readthedocs.io/en/note_shape/.