[dask] lightgbm + dask generates crazy predictions #4695

szhang5947 · 2021-10-19T02:12:48Z

Description

We are experimenting with distributed learning using lightgbm + dask, and noticed that the predictions could be obviously wrong (crazy numbers). For the reproducible example below, the r_squared of the in-sample prediction using lightgbm + dask is -1.5e+56. As comparison, training on local machine without distributed learning generates a reasonable prediction with r_squared ~ 0.01.

Reproducible example

Given that a dask client has already been created,

import numpy as np
import pandas as pd
import dask
import lightgbm as lgb
import sklearn

# Prepare data
train_data = pd.read_csv("train_data.csv")
train_data_dask = dask.dataframe.from_pandas(train_data, npartitions=2)
X_train = train_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_train = train_data_dask["y"].to_dask_array(lengths=True)
w_train = train_data_dask["weight"].to_dask_array(lengths=True)

# Model training and in-sample prediction
model = lgb.DaskLGBMRegressor(
    client=client,
    max_depth=8,
    learning_rate=0.01,
    tree_learner="data",
    n_estimators=500,
)

model.fit(X_train, y_train, sample_weight=w_train)
y_pred = model.predict(X=X_train)

# Measure the result using r_squared
y_local = y_train.compute()
w_local = w_train.compute()
y_pred_local = y_pred.compute()

r_squared = sklearn.metrics.r2_score(y_local, y_pred_local, sample_weight=w_local)
print(f"r_squared: {r_squared}")

Output:

r_squared: -1.4827682862501415e+56

Environment info

dask version: 2021.05.1
lightgbm version: 3.2.1

Dataset

train_data.csv

The text was updated successfully, but these errors were encountered:

jmoralez · 2021-10-20T01:32:25Z

Could you upgrade lightgbm to 3.3.0 and try it? I believe this has been fixed.

jameslamb · 2021-10-25T05:29:58Z

Thanks very much for your interest in LightGBM, and for all the effort you put into this very clear write-up!

Could you upgrade lightgbm to 3.3.0 and try it? I believe this has been fixed.

To add more detail, I believe @jmoralez is referencing the fix from #4185 (which addressed #4026).

Prior to lightgbm 3.3.0, LightGBM 's distributed training contained a bug where if partitions of the data contained non-overlapping distributions of a feature, after sync-up the global histograms were incorrect. Mistakes in that sync-up can distort the boosting process pretty significantly, which is how it's possible for lightgbm to create predictions that seem unrelated to the training data.

It looks like the feature x0 in the provided training data has that characteristic.
Its maximum in one partition is less than its minimum in another partition.

print(dask.array.nanmin(X_train.partitions[0], 0).compute())
print(dask.array.nanmax(X_train.partitions[0], 0).compute())

# [-0.40703553 -0.22746199]
# [ 1.45464902 21.05879944]

print(dask.array.nanmin(X_train.partitions[1], 0).compute())
print(dask.array.nanmax(X_train.partitions[1], 0).compute())

# [ 1.45535379 -0.22753906]
# [25.88824717 25.7793674 ]

I tested tonight and it does seem that lightgbm 3.3.0 does fix the issue reported here.

testing code (click me)

Note that the code sample code in the issue description cannot be copied and run directly, since it does not contain code for defining client.
Below, I used a distributed.LocalCluster and a distributed.Client for it.

I also changed the pd.read_csv() call to read directly from the CSV attached to this issue, instead of relying on a local file.

import numpy as np
import pandas as pd
import dask
import lightgbm as lgb
import sklearn

from distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=2)
client = Client(cluster)

# Prepare data
train_data = pd.read_csv("https://github.com/microsoft/LightGBM/files/7369639/train_data.csv")
train_data_dask = dask.dataframe.from_pandas(train_data, npartitions=2)
X_train = train_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_train = train_data_dask["y"].to_dask_array(lengths=True)
w_train = train_data_dask["weight"].to_dask_array(lengths=True)

cluster = LocalCluster(n_workers=2)
client = Client(cluster)

# Prepare data
train_data = pd.read_csv("https://github.com/microsoft/LightGBM/files/7369639/train_data.csv")
train_data_dask = dask.dataframe.from_pandas(train_data, npartitions=2)
X_train = train_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_train = train_data_dask["y"].to_dask_array(lengths=True)
w_train = train_data_dask["weight"].to_dask_array(lengths=True)

# Model training and in-sample prediction
model = lgb.DaskLGBMRegressor(
    client=client,
    max_depth=8,
    learning_rate=0.01,
    tree_learner="data",
    n_estimators=500,
)

model.fit(X_train, y_train, sample_weight=w_train)
y_pred = model.predict(X=X_train)

# Measure the result using r_squared
y_local = y_train.compute()
w_local = w_train.compute()
y_pred_local = y_pred.compute()

r_squared = sklearn.metrics.r2_score(y_local, y_pred_local, sample_weight=w_local)
print(f"r_squared: {r_squared}")

# --- check if any features have non-overlapping distributions across partitions --- #
print(dask.array.nanmin(X_train.partitions[0], 0).compute())
print(dask.array.nanmax(X_train.partitions[0], 0).compute())

# [-0.40703553 -0.22746199]
# [ 1.45464902 21.05879944]

print(dask.array.nanmin(X_train.partitions[1], 0).compute())
print(dask.array.nanmax(X_train.partitions[1], 0).compute())

# [ 1.45535379 -0.22753906]
# [25.88824717 25.7793674 ]

r_squared: 0.06957108377003818

Given this investigation, I feel confident closing this issue.

Either of the following approaches should be sufficient to avoid this bug:

upgrade to lightgbm 3.3.0
randomly shuffle the training data to try to make the feature distributions across partitions similar

szhang5947 · 2021-10-26T08:33:17Z

Thanks for the explanation. That makes a lot of sense.
I tried 3.3.0 and it did fix the crazy pred issue.

github-actions · 2023-08-23T14:09:20Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

StrikerRUS added the awaiting response label Oct 21, 2021

jameslamb changed the title ~~lightgbm + dask generates crazy predictions~~ [dask] lightgbm + dask generates crazy predictions Oct 25, 2021

jameslamb added bug dask labels Oct 25, 2021

jameslamb closed this as completed Oct 25, 2021

jameslamb removed the awaiting response label Oct 25, 2021

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] lightgbm + dask generates crazy predictions #4695

[dask] lightgbm + dask generates crazy predictions #4695

szhang5947 commented Oct 19, 2021 •

edited

Loading

jmoralez commented Oct 20, 2021

jameslamb commented Oct 25, 2021 •

edited

Loading

szhang5947 commented Oct 26, 2021

github-actions bot commented Aug 23, 2023

[dask] lightgbm + dask generates crazy predictions #4695

[dask] lightgbm + dask generates crazy predictions #4695

Comments

szhang5947 commented Oct 19, 2021 • edited Loading

Description

Reproducible example

Environment info

Dataset

jmoralez commented Oct 20, 2021

jameslamb commented Oct 25, 2021 • edited Loading

szhang5947 commented Oct 26, 2021

github-actions bot commented Aug 23, 2023

szhang5947 commented Oct 19, 2021 •

edited

Loading

jameslamb commented Oct 25, 2021 •

edited

Loading