-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Gap between Non-Distributed and Distributed Lightgbm when Data is Sorted on Label #5025
Comments
Thanks for raising this @rudra0713. I can reproduce the issue and I actually sometimes get an error. I used the following: import dask.dataframe as dd
import lightgbm as lgb
import numpy as np
import pandas as pd
from dask.distributed import Client
from sklearn.metrics import accuracy_score
if __name__ == '__main__':
results = {}
client = Client()
n_workers = len(client.scheduler_info()['workers'])
train_data_len = 10000
rng = np.random.RandomState(0)
X = pd.DataFrame(rng.rand(train_data_len, 4), columns=list('ABCD'))
for order in ('sorted', 'scrambled'):
y = (rng.rand(train_data_len) < 0.5).astype('int')
if order == 'sorted':
y = np.sort(y)
reg = lgb.LGBMClassifier(verbosity=-1).fit(X, y)
pred = reg.predict(X)
results[f'acc_{order}'] = accuracy_score(y, pred)
df = X.copy()
df['y'] = y
ddf = dd.from_pandas(df, npartitions=n_workers)
dX, dy = ddf.drop(columns='y'), ddf['y']
dreg = lgb.DaskLGBMClassifier().fit(dX, dy)
dpred = dreg.predict(dX).compute()
results[f'dacc_{order}'] = accuracy_score(dy.compute(), dpred)
print(results) And when I try to predict with the distributed model trained on the sorted label I sometimes get:
One thing that immediately seems odd is that I see:
when there should actually be 4,970 positive. Versions:
|
Thanks for raising this, and for the tight reproducible example @jmoralez ! In my opinion, in cases where the data are pre-partitioned (like when using I'd support changes (at the C++ level, not in the Python package) to raise a more informative error when the distributions of the target have no overlap (for regression) or each partition does not have sufficient data about each of the target classes (for classification). For cases where you've set LightGBM/src/io/dataset_loader.cpp Lines 542 to 576 in 83a41da
|
Thanks a lot to both of you. @jmoralez I have also seen the exception @jameslamb I have not experimented with pre_partition=False before. Currently, I use the following line to create as many partitions as the number of workers.
I wonder if I use |
I just want to be sure you understand....setting |
Before asking about Now, regarding |
Correct, it's expected behavior. @shiyu1994 or @guolinke please correct me if I'm not right about that.
Never use this parameter with You might use See #3835 (comment) for more details. And since you are just trying to learn about distributed LightGBM, I recommend that you read all of the discussion in #3835. |
Thanks for the clarification @jameslamb. |
@rudra0713 Thanks for using LightGBM, and the reproducible example. I think it is necessary to support a uniform bin boundaries across different processes in distributed training. But, I don't think that should be the root cause for the significant gap in your example. Because it seems that the features are generated purely randomly. So the distribution of feature values across processes should be similar. This should be investigated further. |
Hi, I have a binary classification dataset where labels are sorted (I know, it's against standard ML practice to have data sorted, but the question is in the spirit of understanding Distributed LightGBM better). When I trained a non-distributed LightGBM and distributed LightGBM on this dataset, I observed a large gap in accuracy when I tested on the same dataset (0.68 vs 0.5). I checked the data partitions for the distributed LGBM, since the labels are fully sorted, almost all of the partitions have only one label. However, when I shuffle the dataset, performance are quite similar between the 2 models.
If this is not the expected behavior, I can share a reproducible code. But if this is the expected behavior, how would Dis. LGBM deal with highly imbalanced datasets. For example, a dataset with 10k rows, where 9k rows have label 0, and only 1k rows with label 1, it is possible that many partitions will end up with one of the labels.
The following is a snippet of how I am creating the data:
The text was updated successfully, but these errors were encountered: