Feature Importance changing drastically with shuffling of data in Lightgbm binary classifier. #5887

sahilkgit · 2023-05-12T09:31:46Z

Description:

Model Feature importance change drastically after I shuffle the training data.

How I observed this behaviour:

processed the data - some transformations.
dumped this data in parquet files on machine. (divided 400000 rows in each parquet eg- 000.parquet, 001.parquet)
trained the model - m1 on the data prepared in step-1.
read the data from the machine, dumped in step - 2. (read using glob module of python)
the order (rows) of data read is different.
trained model on this data - m2.
compare feature importance of m1 & m2.

environment details: python - 3.8, pandas - 1.2.4, numpy - 1.19.2 lightgbm - 3.2.1, machine - ubuntu.
Hyper params used:
n_estimators=541, num_leaves=592, colsample_bytree=0.52, min_data_in_leaf=50, min_split_gain=0.00005, bagging_fraction=0.978, lambda_l1=0.31, lambda_l2=0.4, cat_l2=0.18, max_cat_threshold=225, cat_smooth=120, max_depth=21, min_data_per_group=100, learning_rate=0.0911, min_child_weight=0.00029, metric=["binary_logloss"], boosting_type="gbdt", random_state=42, n_jobs=24, verbose=-1, objective="binary", boost_from_average=True, min_data_in_bin=80, max_bin=100, bagging_freq=3, feature_fraction_bynode=0.278107895754091, bin_construct_sample_cnt=0.752595265801746 * train_data.shape[0]
Feature Importance - m1

Feature Importance - m2

The text was updated successfully, but these errors were encountered:

jameslamb · 2023-05-12T15:36:41Z

Thanks for using LightGBM.

1. The parameters you're using introduce the possibility of randomness.

colsample_bytree=0.52
bagging_fraction=0.978
feature_fraction_bynode=0.278

These all lead LightGBM to randomly sample from the rows and columns during training.

Setting random_state=42 means that LightGBM will select the same row indices on successive training runs, but if you're loading data in a way that those indices point to different actual samples, then the set of samples considered by LightGBM will be different between training runs.

Suggestion: Either set these to 1.0, or accept that shuffling the data can result in different models.

2. Because you're using `bin_construct_sample_cnt < #data`, the bin boundaries could be changing

In addition, I believe bin_construct_sample_cnt refers to the number of rows from the beginning of the data that will be sampled. (although I couldn't find that in code ... @guolinke can you confirm that?)

If, after shuffling, the distribution of any of the features for the first {bin_construct_sample_cnt_} rows is very different from the distribution from the original dataset you trained on, that could result in different bin boundaries being drawn and therefore a different trained model.

Suggestion: Either set bin_construct_sample_cnt to the number of rows in the entire training dataset, or accept that shuffling the data can result in different models.

3. You need to supply some other parameters to get exactly-identical models between training runs on the same data

I recommend adding the following to your configuration.

# turn off multi-threading for some operations
deterministic = true

# always use the same multithreading strategy for bin construction
force_row_wise = true

Suggestion: Add these settings to eliminate some sources of randomness, if you're willing to accept slower training.

4. Other suggestions

I noticed you're on LightGBM v3.2.1. If it's possible, please upgrade to the latest version (v3.3.5), as it contains some bugfixes which might be affecting the behavior you're seeing, for example #4450 and #4234.

If you need additional help, please try to reduce this report to a reproducible example (working code + data we could use to replicate the issue).

sahilkgit · 2023-05-15T15:27:18Z

@jameslamb Thanks for reply!!

I also ran the same experiment with these set of hyperparams in same environment setup.
The Feature Importance did not vary by using these.

Hyperparams:
n_estimators=475, num_leaves=320, colsample_bytree="0.33", min_data_in_leaf=90, min_split_gain="0.00005", bagging_fraction=0.992, lambda_l1=0.11, lambda_l2=1.01, cat_l2=0.42, max_cat_threshold=300, cat_smooth=120, max_depth=22, min_data_per_group=90, learning_rate=0.1101, min_child_weight=0.20797, metric=["binary_logloss"], boosting_type="gbdt", random_state=42, n_jobs=24, verbose=-1, objective="binary", boost_from_average=True, min_data_in_bin=140, max_bin=225, bagging_freq=3, bin_consturct_sample_cnt=0.7874

jameslamb · 2023-05-15T20:18:36Z

I don't understand your response @sahilkgit . Do you still need help?

sahilkgit · 2023-05-16T07:32:35Z

Yes, as you @jameslamb mentioned earlier,

My argument is based on your first response, as you suggested to set colsample_bytree and bagging_fraction to 1.0 and same with bin_construct_sample_cnt because these might lead to different models in case data is shuffled.
but even when I am not following these suggestions and using some other values mentioned in my second response, I am getting similar models (with almost similar Feature Importance), when I rerun the same experiment mentioned in my first response.

so I am not clear, what is the cause of different models (complete different feature importance) when i am using first set of hyperparams but not facing this in second set of hyperparams?

jameslamb · 2023-05-17T03:02:43Z

even when I am not following these suggestions and using some other values mentioned in my second response, I am getting similar models

It isn't guaranteed that you'll get a different model if you don't follow every one of my suggestions from #5887 (comment). The impact of those different settings is dependent on the size and distribution of your training data.

Without a reproducible example (code + data that exactly demonstrates the behavior you're seeing), there's not much else we can do to help here. By providing only parameters and your subjective judgment of models' feature importance being "similar" or not, you are asking for a significant amount of guessing by myself and others trying to help.

github-actions · 2023-06-16T04:03:21Z

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions · 2023-09-20T00:19:05Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added the question label May 12, 2023

jameslamb added the awaiting response label May 12, 2023

github-actions bot removed the awaiting response label May 15, 2023

jameslamb added the awaiting response label May 15, 2023

github-actions bot removed the awaiting response label May 16, 2023

jameslamb added the awaiting response label May 17, 2023

This was referenced Jun 8, 2023

[python-package] do not use all features when using lbgm.LGBMClassifier #5915

Closed

Inconsistent results from LGBMRegressor between versions 3.2.1 and 3.3.5 #5913

Closed

github-actions bot closed this as completed Jun 16, 2023

jameslamb mentioned this issue Jun 18, 2023

Why better cpu slower speed? #5934

Open

jameslamb mentioned this issue Aug 20, 2023

[python-package] Why is using CUDA different from CPU results #6055

Closed

jameslamb mentioned this issue Aug 29, 2023

Different feature importances on different feature order even with deterministic params #6069

Open

github-actions bot removed the awaiting response label Sep 20, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Importance changing drastically with shuffling of data in Lightgbm binary classifier. #5887

Feature Importance changing drastically with shuffling of data in Lightgbm binary classifier. #5887

sahilkgit commented May 12, 2023

jameslamb commented May 12, 2023

sahilkgit commented May 15, 2023

jameslamb commented May 15, 2023

sahilkgit commented May 16, 2023

jameslamb commented May 17, 2023

github-actions bot commented Jun 16, 2023

github-actions bot commented Sep 20, 2023

Feature Importance changing drastically with shuffling of data in Lightgbm binary classifier. #5887

Feature Importance changing drastically with shuffling of data in Lightgbm binary classifier. #5887

Comments

sahilkgit commented May 12, 2023

Description:

jameslamb commented May 12, 2023

1. The parameters you're using introduce the possibility of randomness.

2. Because you're using bin_construct_sample_cnt < #data, the bin boundaries could be changing

3. You need to supply some other parameters to get exactly-identical models between training runs on the same data

4. Other suggestions

sahilkgit commented May 15, 2023

jameslamb commented May 15, 2023

sahilkgit commented May 16, 2023

jameslamb commented May 17, 2023

github-actions bot commented Jun 16, 2023

github-actions bot commented Sep 20, 2023

2. Because you're using `bin_construct_sample_cnt < #data`, the bin boundaries could be changing