Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Difference between feature importance while changing the order of features #6362

Closed
ashad-media opened this issue Mar 15, 2024 · 1 comment
Labels

Comments

@ashad-media
Copy link

ashad-media commented Mar 15, 2024

Description

There are 2 numerical features: a, r
There are other categorical features: 'a', 'v', 's', 'a', 'r', 'q', 'w', 'e', 'r', 't', 'y', 'u'

There is a significant difference between the feature importance when:
Case 1: valid_features = valid_feat_categorical + valid_feat_numerical
Case 2: valid_features = valid_feat_numerical + valid_feat_categorical

Feature importance for Case 1 : valid_features = valid_feat_categorical + valid_feat_numerical

Feature importance for Case 2: valid_features = valid_feat_numerical + valid_feat_categorical

Question: Does changing the order of features change the feature importance so drastically?
Note: In the first case, ad_keyword_ed was on top (which is a categorical feature) , in the second case rate_acdk was on top (which is a numerical feature)

Reproducible example

While training the model I am using the function:

def train_lightgbm(data, weight, valid_feat_numerical, valid_feat_categorical, parameters, test_data = None):
    # Case 1:
    valid_features = valid_feat_categorical + valid_feat_numerical

    # Case 2:
    valid_features = valid_feat_numerical + valid_feat_categorical

    data[target] = data[target].astype('float64')
    data[weight] = data[weight].astype('float64')
    data = prep_data(data, valid_feat_numerical, valid_feat_categorical)
    train_dataset = lgb.Dataset(data[valid_features], label=data[target], weight=data[weight], free_raw_data=False)
    if test_data is not None:
        test_data = prep_data(test_data, valid_feat_numerical, valid_feat_categorical)
        test_dataset = lgb.Dataset(test_data[valid_features], label=test_data[target], weight=test_data[weight], free_raw_data=False)
    parameters["bin_construct_sample_cnt"] = int(data.shape[0] * 0.90)
    model_lgb = None
    if test_data is not None:
        model_lgb = lgb.train(parameters, train_dataset, valid_sets=[train_dataset, test_dataset])
    else:
        model_lgb = lgb.train(parameters, train_dataset, valid_sets=[train_dataset])
    return model_lgb


valid_numerical_features = ["a",'b']
valid_categorical_features = ['q', 'w', 'e', 'r', 't', 'y', 'u', 'i', 'o', 'p', 's', 'd']

train_lgbm(data, weight, valid_numerical_features, valid_categorical_features, params, test_df)

Environment info

LightGBM version or commit hash:

Version: 3.1.0

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

Does changing the order of features change the feature importance so drastically?

It's not common, but yes it is possible. For example, if 2 features are very similar then they may offer very similar explanatory power, and LightGBM will tie-break by choosing the one that appears earlier in the column order: #1294 (comment).

But I strongly suspect that the difference you're observing is mostly attributable to randomness between training runs.

Try running your code twice consecutively with 0 changes to the feature order, and checking whether the models produced are identical. If they aren't, you aren't yet controlling for randomness and need to address those issues before you can investigate these feature-importance changes.

If you can provide a minimal and reproducible answer, we might be able to help more. Right now you've omitted significant details, like the definition of the prep_data() function or how you are doing train-test splitting.

See these related discussions:

Version: 3.1.0

If you can, please consider updating to lightgbm>=4.3.0. v3.1.0 is about 3.5 years old (link) and there have been significant changes and improvements to this project since then.

@jameslamb jameslamb changed the title Difference between feature importance while changing the order of features [python-package] Difference between feature importance while changing the order of features Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants