Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Importance changing drastically with shuffling of data in Lightgbm binary classifier. #5887

Closed
sahilkgit opened this issue May 12, 2023 · 7 comments
Labels

Comments

@sahilkgit
Copy link

Description:

Model Feature importance change drastically after I shuffle the training data.

How I observed this behaviour:

  1. processed the data - some transformations.
  2. dumped this data in parquet files on machine. (divided 400000 rows in each parquet eg- 000.parquet, 001.parquet)
  3. trained the model - m1 on the data prepared in step-1.
  4. read the data from the machine, dumped in step - 2. (read using glob module of python)
  5. the order (rows) of data read is different.
  6. trained model on this data - m2.
  7. compare feature importance of m1 & m2.

environment details: python - 3.8, pandas - 1.2.4, numpy - 1.19.2 lightgbm - 3.2.1, machine - ubuntu.
Hyper params used:
n_estimators=541, num_leaves=592, colsample_bytree=0.52, min_data_in_leaf=50, min_split_gain=0.00005, bagging_fraction=0.978, lambda_l1=0.31, lambda_l2=0.4, cat_l2=0.18, max_cat_threshold=225, cat_smooth=120, max_depth=21, min_data_per_group=100, learning_rate=0.0911, min_child_weight=0.00029, metric=["binary_logloss"], boosting_type="gbdt", random_state=42, n_jobs=24, verbose=-1, objective="binary", boost_from_average=True, min_data_in_bin=80, max_bin=100, bagging_freq=3, feature_fraction_bynode=0.278107895754091, bin_construct_sample_cnt=0.752595265801746 * train_data.shape[0]
Feature Importance - m1
Screenshot 2023-05-12 at 2 56 49 PM

Feature Importance - m2
Screenshot 2023-05-12 at 2 57 24 PM

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

1. The parameters you're using introduce the possibility of randomness.

  • colsample_bytree=0.52
  • bagging_fraction=0.978
  • feature_fraction_bynode=0.278

These all lead LightGBM to randomly sample from the rows and columns during training.

Setting random_state=42 means that LightGBM will select the same row indices on successive training runs, but if you're loading data in a way that those indices point to different actual samples, then the set of samples considered by LightGBM will be different between training runs.

Suggestion: Either set these to 1.0, or accept that shuffling the data can result in different models.

2. Because you're using bin_construct_sample_cnt < #data, the bin boundaries could be changing

In addition, I believe bin_construct_sample_cnt refers to the number of rows from the beginning of the data that will be sampled. (although I couldn't find that in code ... @guolinke can you confirm that?)

If, after shuffling, the distribution of any of the features for the first {bin_construct_sample_cnt_} rows is very different from the distribution from the original dataset you trained on, that could result in different bin boundaries being drawn and therefore a different trained model.

Suggestion: Either set bin_construct_sample_cnt to the number of rows in the entire training dataset, or accept that shuffling the data can result in different models.

3. You need to supply some other parameters to get exactly-identical models between training runs on the same data

I recommend adding the following to your configuration.

# turn off multi-threading for some operations
deterministic = true

# always use the same multithreading strategy for bin construction
force_row_wise = true

Suggestion: Add these settings to eliminate some sources of randomness, if you're willing to accept slower training.

4. Other suggestions

I noticed you're on LightGBM v3.2.1. If it's possible, please upgrade to the latest version (v3.3.5), as it contains some bugfixes which might be affecting the behavior you're seeing, for example #4450 and #4234.

If you need additional help, please try to reduce this report to a reproducible example (working code + data we could use to replicate the issue).

@sahilkgit
Copy link
Author

@jameslamb Thanks for reply!!

I also ran the same experiment with these set of hyperparams in same environment setup.
The Feature Importance did not vary by using these.

Hyperparams:
n_estimators=475, num_leaves=320, colsample_bytree="0.33", min_data_in_leaf=90, min_split_gain="0.00005", bagging_fraction=0.992, lambda_l1=0.11, lambda_l2=1.01, cat_l2=0.42, max_cat_threshold=300, cat_smooth=120, max_depth=22, min_data_per_group=90, learning_rate=0.1101, min_child_weight=0.20797, metric=["binary_logloss"], boosting_type="gbdt", random_state=42, n_jobs=24, verbose=-1, objective="binary", boost_from_average=True, min_data_in_bin=140, max_bin=225, bagging_freq=3, bin_consturct_sample_cnt=0.7874

@jameslamb
Copy link
Collaborator

I don't understand your response @sahilkgit . Do you still need help?

@sahilkgit
Copy link
Author

Yes, as you @jameslamb mentioned earlier,

My argument is based on your first response, as you suggested to set colsample_bytree and bagging_fraction to 1.0 and same with bin_construct_sample_cnt because these might lead to different models in case data is shuffled.
but even when I am not following these suggestions and using some other values mentioned in my second response, I am getting similar models (with almost similar Feature Importance), when I rerun the same experiment mentioned in my first response.

so I am not clear, what is the cause of different models (complete different feature importance) when i am using first set of hyperparams but not facing this in second set of hyperparams?

@jameslamb
Copy link
Collaborator

even when I am not following these suggestions and using some other values mentioned in my second response, I am getting similar models

It isn't guaranteed that you'll get a different model if you don't follow every one of my suggestions from #5887 (comment). The impact of those different settings is dependent on the size and distribution of your training data.

Without a reproducible example (code + data that exactly demonstrates the behavior you're seeing), there's not much else we can do to help here. By providing only parameters and your subjective judgment of models' feature importance being "similar" or not, you are asking for a significant amount of guessing by myself and others trying to help.

@github-actions
Copy link

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants