Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different feature importances on different feature order even with deterministic params #6069

Open
upwindowship opened this issue Aug 29, 2023 · 4 comments
Labels

Comments

@upwindowship
Copy link

upwindowship commented Aug 29, 2023

Description

We've run into an issue where identical input data produces different feature importance if the column order is different. This happens even with feature_fraction: 1.0, 'deterministic': True, 'force_row_wise': True so it doesn't seem like an issue of subsampling.

Reproducible example

Here is example data on which we were able to reproduce it: https://github.com/upgini/upgini/blob/add-lgbm-example/notebooks/lgbm_example_data.csv.zip

import lightgbm as lgb
import pandas as pd
import numpy as np

data = pd.read_csv("lgbm_example_data.csv.zip")
params = {'objective': 'huber', 'verbosity': -1, 'random_seed': 10, 'feature_fraction': 1.0, 'deterministic': True, 'force_row_wise': True}
train_columns = sorted([f for f in data.columns if f.startswith("f_")])

def train(x_train, columns):
    d_train = lgb.Dataset(x_train[columns], label=x_train["target"])
    bst = lgb.train(params, train_set=d_train)
    return dict(zip(bst.feature_name(), bst.feature_importance())) 

splits1 = train(data, train_columns)

rng = np.random.default_rng(42)
train_columns_shuffled = train_columns.copy() 
rng.shuffle(train_columns_shuffled)

splits2 = train(data, train_columns_shuffled)

for k in set(splits1.keys()).union(set(splits2.keys())):
     v1 = splits1.get(k)
     v2 = splits2.get(k)
     if v1 != v2:
        print(f"{k}: {v1} vs {v2}")

-----
f_139: 491 vs 485
f_110: 15 vs 5
f_188: 290 vs 23
f_189: 13 vs 296

This param set produces less variations, but the results are still different:

params = {'objective': 'huber', 'verbosity': -1, 'random_seed': 10, 'max_depth': 4, 'num_leaves': 16, 'max_cat_threshold': 80, 'min_data_per_group': 25, 'cat_l2': 10, 'cat_smooth': 12, 'num_boost_round': 100, 'learning_rate': 0.1, 'min_sum_hessian_in_leaf': 5, 'feature_fraction': 1.0, 'deterministic': True, 'force_row_wise': True}
---
f_188: 27 vs 14
f_189: 0 vs 13

Environment info

LightGBM version or commit hash:
Tested both on 3.3.5 and 4.0.0

Command(s) you used to install LightGBM

pip install lightgbm

x86 build

Additional Comments

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

Before we investigate this... please see very-similar discussions in this project's issue tracker:

Once you've read those, if you're certain none of the advice there applies to your situation, let us know and someone will take a closer look.

@upwindowship
Copy link
Author

I've seen these issues before writing – sadly, none of them relate to our case. We care here about feature importance stability and not scores, because we're using feature importance in our feature selection algorithm. The example data is not duplicated, it has 11k rows, and we don't use parameters that bring randomness.

@upwindowship
Copy link
Author

@jameslamb Is there any news on this? Seems like this is a bug, at least from what I expect from the documentation.

@jameslamb
Copy link
Collaborator

Please don't leave "any updates on this?" types of comments in this project. If you're interested in investigating this and trying to find and fix the root cause, or if you have new information to add, we'd be grateful for the help.

Otherwise, being subscribed to the issue is sufficient guarantee that you'll be notified if something around it changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants