Different feature importances on different feature order even with deterministic params #6069

upwindowship · 2023-08-29T21:05:11Z

Description

We've run into an issue where identical input data produces different feature importance if the column order is different. This happens even with feature_fraction: 1.0, 'deterministic': True, 'force_row_wise': True so it doesn't seem like an issue of subsampling.

Reproducible example

Here is example data on which we were able to reproduce it: https://github.com/upgini/upgini/blob/add-lgbm-example/notebooks/lgbm_example_data.csv.zip

import lightgbm as lgb
import pandas as pd
import numpy as np

data = pd.read_csv("lgbm_example_data.csv.zip")
params = {'objective': 'huber', 'verbosity': -1, 'random_seed': 10, 'feature_fraction': 1.0, 'deterministic': True, 'force_row_wise': True}
train_columns = sorted([f for f in data.columns if f.startswith("f_")])

def train(x_train, columns):
    d_train = lgb.Dataset(x_train[columns], label=x_train["target"])
    bst = lgb.train(params, train_set=d_train)
    return dict(zip(bst.feature_name(), bst.feature_importance())) 

splits1 = train(data, train_columns)

rng = np.random.default_rng(42)
train_columns_shuffled = train_columns.copy() 
rng.shuffle(train_columns_shuffled)

splits2 = train(data, train_columns_shuffled)

for k in set(splits1.keys()).union(set(splits2.keys())):
     v1 = splits1.get(k)
     v2 = splits2.get(k)
     if v1 != v2:
        print(f"{k}: {v1} vs {v2}")

-----
f_139: 491 vs 485
f_110: 15 vs 5
f_188: 290 vs 23
f_189: 13 vs 296

This param set produces less variations, but the results are still different:

params = {'objective': 'huber', 'verbosity': -1, 'random_seed': 10, 'max_depth': 4, 'num_leaves': 16, 'max_cat_threshold': 80, 'min_data_per_group': 25, 'cat_l2': 10, 'cat_smooth': 12, 'num_boost_round': 100, 'learning_rate': 0.1, 'min_sum_hessian_in_leaf': 5, 'feature_fraction': 1.0, 'deterministic': True, 'force_row_wise': True}
---
f_188: 27 vs 14
f_189: 0 vs 13

Environment info

LightGBM version or commit hash:
Tested both on 3.3.5 and 4.0.0

Command(s) you used to install LightGBM

pip install lightgbm

x86 build

Additional Comments

The text was updated successfully, but these errors were encountered:

jameslamb · 2023-08-29T21:25:26Z

Thanks for using LightGBM.

Before we investigate this... please see very-similar discussions in this project's issue tracker:

Once you've read those, if you're certain none of the advice there applies to your situation, let us know and someone will take a closer look.

upwindowship · 2023-08-30T09:03:09Z

I've seen these issues before writing – sadly, none of them relate to our case. We care here about feature importance stability and not scores, because we're using feature importance in our feature selection algorithm. The example data is not duplicated, it has 11k rows, and we don't use parameters that bring randomness.

upwindowship · 2023-09-11T19:15:28Z

@jameslamb Is there any news on this? Seems like this is a bug, at least from what I expect from the documentation.

jameslamb · 2023-09-11T19:19:22Z

Please don't leave "any updates on this?" types of comments in this project. If you're interested in investigating this and trying to find and fix the root cause, or if you have new information to add, we'd be grateful for the help.

Otherwise, being subscribed to the issue is sufficient guarantee that you'll be notified if something around it changes.

jameslamb added question awaiting response labels Aug 29, 2023

github-actions bot removed the awaiting response label Aug 30, 2023

jameslamb mentioned this issue Mar 16, 2024

[python-package] Difference between feature importance while changing the order of features #6362

Closed

jameslamb mentioned this issue Nov 27, 2024

[RFC] make deterministic parameter more thorough? #6731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different feature importances on different feature order even with deterministic params #6069

Different feature importances on different feature order even with deterministic params #6069

upwindowship commented Aug 29, 2023 •

edited by jameslamb

Loading

jameslamb commented Aug 29, 2023

upwindowship commented Aug 30, 2023

upwindowship commented Sep 11, 2023

jameslamb commented Sep 11, 2023

Different feature importances on different feature order even with deterministic params #6069

Different feature importances on different feature order even with deterministic params #6069

Comments

upwindowship commented Aug 29, 2023 • edited by jameslamb Loading

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Aug 29, 2023

upwindowship commented Aug 30, 2023

upwindowship commented Sep 11, 2023

jameslamb commented Sep 11, 2023

upwindowship commented Aug 29, 2023 •

edited by jameslamb

Loading