Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][dask] fix argument names in custom eval function in Dask test #4833

Merged
merged 4 commits into from
Dec 2, 2021

Conversation

StrikerRUS
Copy link
Collaborator

RTD builds for visual checks: https://lightgbm.readthedocs.io/en/note_shape/.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! But I don't think the proposed change is correct.

When using, for example, DaskLGBMRegressor, lightgbm.dask will create one task running LGBMRegressor.fit() on each Dask worker.

futures_classifiers = [
client.submit(
_train_part,

def _train_part(

model.fit(
data,
label,
sample_weight=weight,
init_score=init_score,
eval_set=local_eval_set,
eval_sample_weight=local_eval_sample_weight,
eval_init_score=local_eval_init_score,
eval_names=local_eval_names,
**kwargs
)

At that moment, that evaluation function will be called on only local data in each worker process, which will be numpy arrays, not Dask arrays.

Here's some sample code proving that. Note that it includes an assert which will fail if y_true is anything other than a numpy array.

(sample code)
import dask.array as da
import numpy as np
from dask.distributed import Client, LocalCluster, wait
from lightgbm.dask import DaskLGBMRegressor
from sklearn.datasets import make_regression

# set  up Dask cluster
n_workers = 2
cluster = LocalCluster(n_workers=n_workers)
client = Client(cluster)
client.wait_for_workers(n_workers)
print(f"View the dashboard: {cluster.dashboard_link}")

# create training data and an additional eval set
def _make_dataset(n_samples):
    X, y = make_regression(n_samples=n_samples)
    dX = da.from_array(X, chunks=(1000, X.shape[1]))
    dy = da.from_array(y, chunks=1000)
    return dX, dy

# training data
dX, dy = _make_dataset(10_000)

# eval data
dX_e, dy_e = _make_dataset(2_000)

def _custom_metric(y_true, y_pred):
    metric_name = "custom_metric"
    is_higher_better = False
    metric_value = 0.708
    assert str(type(y_true)) == "<class 'numpy.ndarray'>"
    return metric_name, metric_value, is_higher_better

dask_reg = DaskLGBMRegressor(
    client=client,
    objective="regression_l2",
    tree_learner="data"
)
dask_reg.fit(
    X=dX,
    y=dy,
    eval_set=[
        (dX, dy),
        (dX_e, dy_e)
    ],
    eval_metric = ["rmse", _custom_metric]
)

# confirm that custom metric function was actually called
# (should be an array of value 0.708)
print(dask_reg.evals_result_["valid_0"]["custom_metric"])

@StrikerRUS
Copy link
Collaborator Author

Ah, thank you very much! I was confused by the metrics in our tests:

def _r2_score(dy_true, dy_pred):
numerator = ((dy_true - dy_pred) ** 2).sum(axis=0, dtype=np.float64)
denominator = ((dy_true - dy_true.mean(axis=0)) ** 2).sum(axis=0, dtype=np.float64)
return (1 - numerator / denominator).compute()
def _accuracy_score(dy_true, dy_pred):
return da.average(dy_true == dy_pred).compute()
def _constant_metric(dy_true, dy_pred):
metric_name = 'constant_metric'
value = 0.708
is_higher_better = False
return metric_name, value, is_higher_better

They all takes arguments named dy_true and dy_pred where d stays for "Dask", I guess.
I think it's better to remove d prefix from the arguments in _constant_metric() custom eval function. WDYT?

@jameslamb
Copy link
Collaborator

where d stays for "Dask", I guess.

ah yep, you're right! That is confusing. I'd support removing the d from the arguments in _constant_metric().

@StrikerRUS StrikerRUS changed the title [docs][dask] fix argument types in custom eval function for Dask estimators [tests][dask] fix argument names in custom eval function in Dask test Nov 30, 2021
@StrikerRUS
Copy link
Collaborator Author

@jameslamb

ah yep, you're right! That is confusing. I'd support removing the d from the arguments in _constant_metric().

Great! I reverted docstring changes and pushed a fix for argument names.

@StrikerRUS StrikerRUS requested a review from jameslamb November 30, 2021 23:42
@StrikerRUS StrikerRUS merged commit b31d5a4 into master Dec 2, 2021
@StrikerRUS StrikerRUS deleted the note_shape branch December 2, 2021 01:57
@StrikerRUS StrikerRUS mentioned this pull request Jan 6, 2022
13 tasks
@jameslamb jameslamb mentioned this pull request Oct 7, 2022
40 tasks
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants