Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] lightgbm + dask generates crazy predictions #4695

Closed
szhang5947 opened this issue Oct 19, 2021 · 4 comments
Closed

[dask] lightgbm + dask generates crazy predictions #4695

szhang5947 opened this issue Oct 19, 2021 · 4 comments

Comments

@szhang5947
Copy link

szhang5947 commented Oct 19, 2021

Description

We are experimenting with distributed learning using lightgbm + dask, and noticed that the predictions could be obviously wrong (crazy numbers). For the reproducible example below, the r_squared of the in-sample prediction using lightgbm + dask is -1.5e+56. As comparison, training on local machine without distributed learning generates a reasonable prediction with r_squared ~ 0.01.

Reproducible example

Given that a dask client has already been created,

import numpy as np
import pandas as pd
import dask
import lightgbm as lgb
import sklearn

# Prepare data
train_data = pd.read_csv("train_data.csv")
train_data_dask = dask.dataframe.from_pandas(train_data, npartitions=2)
X_train = train_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_train = train_data_dask["y"].to_dask_array(lengths=True)
w_train = train_data_dask["weight"].to_dask_array(lengths=True)

# Model training and in-sample prediction
model = lgb.DaskLGBMRegressor(
    client=client,
    max_depth=8,
    learning_rate=0.01,
    tree_learner="data",
    n_estimators=500,
)

model.fit(X_train, y_train, sample_weight=w_train)
y_pred = model.predict(X=X_train)

# Measure the result using r_squared
y_local = y_train.compute()
w_local = w_train.compute()
y_pred_local = y_pred.compute()

r_squared = sklearn.metrics.r2_score(y_local, y_pred_local, sample_weight=w_local)
print(f"r_squared: {r_squared}")

Output:

r_squared: -1.4827682862501415e+56

Environment info

dask version: 2021.05.1
lightgbm version: 3.2.1

Dataset

train_data.csv

@jmoralez
Copy link
Collaborator

Could you upgrade lightgbm to 3.3.0 and try it? I believe this has been fixed.

@jameslamb jameslamb changed the title lightgbm + dask generates crazy predictions [dask] lightgbm + dask generates crazy predictions Oct 25, 2021
@jameslamb
Copy link
Collaborator

jameslamb commented Oct 25, 2021

Thanks very much for your interest in LightGBM, and for all the effort you put into this very clear write-up!

Could you upgrade lightgbm to 3.3.0 and try it? I believe this has been fixed.

To add more detail, I believe @jmoralez is referencing the fix from #4185 (which addressed #4026).

Prior to lightgbm 3.3.0, LightGBM 's distributed training contained a bug where if partitions of the data contained non-overlapping distributions of a feature, after sync-up the global histograms were incorrect. Mistakes in that sync-up can distort the boosting process pretty significantly, which is how it's possible for lightgbm to create predictions that seem unrelated to the training data.

It looks like the feature x0 in the provided training data has that characteristic.
Its maximum in one partition is less than its minimum in another partition.

print(dask.array.nanmin(X_train.partitions[0], 0).compute())
print(dask.array.nanmax(X_train.partitions[0], 0).compute())

# [-0.40703553 -0.22746199]
# [ 1.45464902 21.05879944]

print(dask.array.nanmin(X_train.partitions[1], 0).compute())
print(dask.array.nanmax(X_train.partitions[1], 0).compute())

# [ 1.45535379 -0.22753906]
# [25.88824717 25.7793674 ]

I tested tonight and it does seem that lightgbm 3.3.0 does fix the issue reported here.

testing code (click me)

Note that the code sample code in the issue description cannot be copied and run directly, since it does not contain code for defining client.
Below, I used a distributed.LocalCluster and a distributed.Client for it.

I also changed the pd.read_csv() call to read directly from the CSV attached to this issue, instead of relying on a local file.

import numpy as np
import pandas as pd
import dask
import lightgbm as lgb
import sklearn

from distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=2)
client = Client(cluster)

# Prepare data
train_data = pd.read_csv("https://github.com/microsoft/LightGBM/files/7369639/train_data.csv")
train_data_dask = dask.dataframe.from_pandas(train_data, npartitions=2)
X_train = train_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_train = train_data_dask["y"].to_dask_array(lengths=True)
w_train = train_data_dask["weight"].to_dask_array(lengths=True)

cluster = LocalCluster(n_workers=2)
client = Client(cluster)

# Prepare data
train_data = pd.read_csv("https://github.com/microsoft/LightGBM/files/7369639/train_data.csv")
train_data_dask = dask.dataframe.from_pandas(train_data, npartitions=2)
X_train = train_data_dask[["x0", "x1"]].to_dask_array(lengths=True)
y_train = train_data_dask["y"].to_dask_array(lengths=True)
w_train = train_data_dask["weight"].to_dask_array(lengths=True)

# Model training and in-sample prediction
model = lgb.DaskLGBMRegressor(
    client=client,
    max_depth=8,
    learning_rate=0.01,
    tree_learner="data",
    n_estimators=500,
)

model.fit(X_train, y_train, sample_weight=w_train)
y_pred = model.predict(X=X_train)

# Measure the result using r_squared
y_local = y_train.compute()
w_local = w_train.compute()
y_pred_local = y_pred.compute()

r_squared = sklearn.metrics.r2_score(y_local, y_pred_local, sample_weight=w_local)
print(f"r_squared: {r_squared}")

# --- check if any features have non-overlapping distributions across partitions --- #
print(dask.array.nanmin(X_train.partitions[0], 0).compute())
print(dask.array.nanmax(X_train.partitions[0], 0).compute())

# [-0.40703553 -0.22746199]
# [ 1.45464902 21.05879944]

print(dask.array.nanmin(X_train.partitions[1], 0).compute())
print(dask.array.nanmax(X_train.partitions[1], 0).compute())

# [ 1.45535379 -0.22753906]
# [25.88824717 25.7793674 ]

r_squared: 0.06957108377003818

Given this investigation, I feel confident closing this issue.

Either of the following approaches should be sufficient to avoid this bug:

  • upgrade to lightgbm 3.3.0
  • randomly shuffle the training data to try to make the feature distributions across partitions similar

@szhang5947
Copy link
Author

Thanks for the explanation. That makes a lot of sense.
I tried 3.3.0 and it did fix the crazy pred issue.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants