[python-package] Difference between feature importance while changing the order of features #6362

ashad-media · 2024-03-15T06:53:09Z

Description

There are 2 numerical features: a, r
There are other categorical features: 'a', 'v', 's', 'a', 'r', 'q', 'w', 'e', 'r', 't', 'y', 'u'

There is a significant difference between the feature importance when:
Case 1: valid_features = valid_feat_categorical + valid_feat_numerical
Case 2: valid_features = valid_feat_numerical + valid_feat_categorical

Feature importance for Case 1 : valid_features = valid_feat_categorical + valid_feat_numerical

Feature importance for Case 2: valid_features = valid_feat_numerical + valid_feat_categorical

Question: Does changing the order of features change the feature importance so drastically?
Note: In the first case, ad_keyword_ed was on top (which is a categorical feature) , in the second case rate_acdk was on top (which is a numerical feature)

Reproducible example

While training the model I am using the function:

def train_lightgbm(data, weight, valid_feat_numerical, valid_feat_categorical, parameters, test_data = None):
    # Case 1:
    valid_features = valid_feat_categorical + valid_feat_numerical

    # Case 2:
    valid_features = valid_feat_numerical + valid_feat_categorical

    data[target] = data[target].astype('float64')
    data[weight] = data[weight].astype('float64')
    data = prep_data(data, valid_feat_numerical, valid_feat_categorical)
    train_dataset = lgb.Dataset(data[valid_features], label=data[target], weight=data[weight], free_raw_data=False)
    if test_data is not None:
        test_data = prep_data(test_data, valid_feat_numerical, valid_feat_categorical)
        test_dataset = lgb.Dataset(test_data[valid_features], label=test_data[target], weight=test_data[weight], free_raw_data=False)
    parameters["bin_construct_sample_cnt"] = int(data.shape[0] * 0.90)
    model_lgb = None
    if test_data is not None:
        model_lgb = lgb.train(parameters, train_dataset, valid_sets=[train_dataset, test_dataset])
    else:
        model_lgb = lgb.train(parameters, train_dataset, valid_sets=[train_dataset])
    return model_lgb


valid_numerical_features = ["a",'b']
valid_categorical_features = ['q', 'w', 'e', 'r', 't', 'y', 'u', 'i', 'o', 'p', 's', 'd']

train_lgbm(data, weight, valid_numerical_features, valid_categorical_features, params, test_df)

Environment info

LightGBM version or commit hash:

Version: 3.1.0

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-03-16T18:38:57Z

Thanks for using LightGBM.

Does changing the order of features change the feature importance so drastically?

It's not common, but yes it is possible. For example, if 2 features are very similar then they may offer very similar explanatory power, and LightGBM will tie-break by choosing the one that appears earlier in the column order: #1294 (comment).

But I strongly suspect that the difference you're observing is mostly attributable to randomness between training runs.

Try running your code twice consecutively with 0 changes to the feature order, and checking whether the models produced are identical. If they aren't, you aren't yet controlling for randomness and need to address those issues before you can investigate these feature-importance changes.

If you can provide a minimal and reproducible answer, we might be able to help more. Right now you've omitted significant details, like the definition of the prep_data() function or how you are doing train-test splitting.

See these related discussions:

Version: 3.1.0

If you can, please consider updating to lightgbm>=4.3.0. v3.1.0 is about 3.5 years old (link) and there have been significant changes and improvements to this project since then.

jameslamb added the question label Mar 16, 2024

jameslamb changed the title ~~Difference between feature importance while changing the order of features~~ [python-package] Difference between feature importance while changing the order of features Mar 16, 2024

jameslamb added the awaiting response label Mar 16, 2024

ashad-media closed this as completed Apr 10, 2024

jameslamb removed the awaiting response label Apr 10, 2024

jameslamb mentioned this issue Aug 13, 2024

Inconsistent Model Results with Identical Dataset and Parameters #6604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Difference between feature importance while changing the order of features #6362

[python-package] Difference between feature importance while changing the order of features #6362

ashad-media commented Mar 15, 2024 •

edited

Loading

jameslamb commented Mar 16, 2024

[python-package] Difference between feature importance while changing the order of features #6362

[python-package] Difference between feature importance while changing the order of features #6362

Comments

ashad-media commented Mar 15, 2024 • edited Loading

Description

Reproducible example

Environment info

jameslamb commented Mar 16, 2024

ashad-media commented Mar 15, 2024 •

edited

Loading