Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] suppress the warning about categorical feature override #3379

Closed
guolinke opened this issue Sep 11, 2020 · 14 comments · Fixed by #4768
Closed

[python] suppress the warning about categorical feature override #3379

guolinke opened this issue Sep 11, 2020 · 14 comments · Fixed by #4768
Assignees

Comments

@guolinke
Copy link
Collaborator

C:\ProgramData\Anaconda3\lib\site-packages\lightgbm\basic.py:1555: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['C01', 'C02', 'C03', 'C04', 'C05', 'C06', 'C07', 'C08', 'C09', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26']
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
C:\ProgramData\Anaconda3\lib\site-packages\lightgbm\basic.py:1286: UserWarning: Overriding the parameters from Reference Dataset.
  warnings.warn('Overriding the parameters from Reference Dataset.')
C:\ProgramData\Anaconda3\lib\site-packages\lightgbm\basic.py:1098: UserWarning: categorical_column in param dict is overridden.
  warnings.warn('{} in param dict is overridden.'.format(cat_alias))

categorical_column could be set in both lgb.train and lgb.Dataset . But this warning seems always show up if setting categorical_column. I think this is quite annoying.

@onshek
Copy link

onshek commented Sep 23, 2020

I'm willing to take this as my first contribution to the repo.
It seems a new parameter ignore_warnings detault to be True can be added to related functions, right?

@XiaozhouWang85
Copy link

From reading other issues, it seems the "correct" way of defining categorical columns is via the Dataset. This might only be a problem when the input is a Pandas dataframe.

The problem is that when calling train the categorical column defaults to auto:
categorical_feature (list of strings or int, or 'auto', optional (default="auto"))

If we had an option of None then that might be a way of getting rid of conflicting settings that require a warning.

@istavnit
Copy link

Seems to be not a friendly way to suppress this warning:

            XX_train = XX_train[keeperCols]
            XX_valid = XX_valid[keeperCols]
            #I have only one categorical feature 'DOW'  
            trainData = Dataset(XX_train,yy_train,feature_name=keeperCols,categorical_feature=['DOW'])
            valid_data = trainData.create_valid(XX_valid,label=yy_valid)
            params["seed"]=theSeed
            bst = lgb.train(params, trainData, valid_sets=[valid_data], categorical_feature=['DOW'],feature_name=keeperCols,
                            num_boost_round=BOOSTROUNDS, early_stopping_rounds=EARLYSTOPROUNDS,verbose_eval=VERBOSEEVAL)

results in these warnings:

C:\Users\meanc\anaconda3\envs\keras\lib\site-packages\lightgbm\basic.py:1551: UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')
C:\Users\meanc\anaconda3\envs\keras\lib\site-packages\lightgbm\basic.py:1555: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['DOW']
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
C:\Users\meanc\anaconda3\envs\keras\lib\site-packages\lightgbm\basic.py:1286: UserWarning: Overriding the parameters from Reference Dataset.
  warnings.warn('Overriding the parameters from Reference Dataset.')
C:\Users\meanc\anaconda3\envs\keras\lib\site-packages\lightgbm\basic.py:1098: UserWarning: categorical_column in param dict is overridden.
  warnings.warn('{} in param dict is overridden.'.format(cat_alias))

@memeplex
Copy link

memeplex commented Dec 10, 2020

When I do a grid search using the sklearn api and pass eval_set to fit, I get this warning for every element in the grid (many times!).

I'm just passing a dataframe with categorical features as the train X, the same for eval set, never explicitly passing categorical_feature.

I don't think this behavior is desirable.

@memeplex
Copy link

Indeed it's not necessary to use the sklearn API in order to reproduce the above. I've provided simple instructions in #3640.

@tripti0125
Copy link

I get this warning when using scikit-learn wrapper of LightGBM. Dataset passed to LightGBM is through a scikit-learn pipeline which preprocesses the data in a pandas dataframe and produces a numpy array. Note that this input dataset which the model receives is NOT a Pandas dataframe but numpy array. I set the feature_name and categorical_feature parameters in fit() method as this is the only place these can be set, if you're not using LightGBM native Dataset creation.

I think the warning is useful in some situations but superfluous in the case mentioned above.

C:..\anaconda3\lib\site-packages\lightgbm\basic.py:1286: UserWarning: Overriding the parameters from Reference Dataset.
warnings.warn('Overriding the parameters from Reference Dataset.')
C:..\anaconda3\lib\site-packages\lightgbm\basic.py:1098: UserWarning: categorical_column in param dict is overridden.
warnings.warn('{} in param dict is overridden.'.format(cat_alias))

@shiyu1994 shiyu1994 self-assigned this Mar 24, 2021
@ThomasBourgeois
Copy link

ThomasBourgeois commented Jun 23, 2021

Hi all,
same issue here : I'm sepcifying categorical features pretty much everywhere : train and val datasets, plus in train, and I still get warnings.. Very strange.
Warnings obtained :
"Overriding the parameters from Reference Dataset.
categorical_column in param dict is overridden."

Code :

train_data = lgb.Dataset(train[feats], label=train[target],
                         feature_name=feats,
                         categorical_feature=cat_feats)
val_data = lgb.Dataset(val[feats], label=val[target], reference=train_data,
                          feature_name=feats,
                          categorical_feature=cat_feats)

params = {'objective': 'mean_squared_error', 
        'metric':'rmse', 'eta': 0.1, 'bagging_fraction': 0.5 }
num_round = 300
bst = lgb.train(params, train_data, num_round, valid_sets=[val_data], early_stopping_rounds=20, 
               categorical_feature=cat_feats, feature_name=feats
               )

@memeplex
Copy link

With multiple processes in a grid search it's not event possible to use a context manager to suppress this warning during fit, it seems that the context state is lost somehow, my notebook gets literally flooded of:

/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
....

@memeplex
Copy link

memeplex commented Aug 28, 2021

As a workaround I did the following:

class SilentRegressor(lgb.LGBMRegressor):
    def fit(self, *args, **kwargs):
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            return super().fit(*args, verbose=False, **kwargs)

@hzy46 hzy46 self-assigned this Oct 29, 2021
@hzy46
Copy link
Contributor

hzy46 commented Nov 1, 2021

I did some investigation on this issue. There're mainly two kinds of source for these warnings regarding categorical features.

If the dataset doens't have a reference, the warnings only come from here:

def set_categorical_feature(self, categorical_feature):
"""Set categorical features.
Parameters
----------
categorical_feature : list of int or str
Names or indices of categorical features.
Returns
-------
self : Dataset
Dataset with set categorical features.
"""
if self.categorical_feature == categorical_feature:
return self
if self.data is not None:
if self.categorical_feature is None:
self.categorical_feature = categorical_feature
return self._free_handle()
elif categorical_feature == 'auto':
_log_warning('Using categorical_feature in Dataset.')
return self
else:
_log_warning('categorical_feature in Dataset is overridden.\n'
f'New categorical_feature is {sorted(list(categorical_feature))}')
self.categorical_feature = categorical_feature
return self._free_handle()
else:
raise LightGBMError("Cannot set categorical feature after freed raw data, "
"set free_raw_data=False when construct Dataset to avoid this.")

This function will be called before the Dataset.construct() is called.

One can use the following code to reproduce:

import random
import numpy as np
import pandas as pd
import lightgbm as lgb


Categorical_Feature_When_Construct_Dataset = ["a", "b", "d"]
Categorical_Feature_When_Train = 'auto'


def get_data(N):
    data = []
    labels = []
    for i in range(N):
        sample = {
            "a": random.choice([100, 200, 300, 400]),
            "b": random.choice([222, 333]),
            "c": random.random(),
        }
        if sample["a"] == 200 or sample["a"] == 300:
            if sample["b"] == 333:
                label = 1
            else:
                label = 0
        else:
            label = 0
        labels.append(label)
        data.append(sample)
    features = pd.DataFrame(data)
    features["d"] = pd.Categorical(
        [random.choice(["x", "y", "z"]) for i in range(N)], categories=["x", "y", "z"], ordered=False
    )
    labels = pd.Series(labels)
    return features, labels
 
N = 1000
train_features, train_labels = get_data(N)
test_features, test_labels = get_data(N)
 
lgb_train = lgb.Dataset(train_features, train_labels, categorical_feature=Categorical_Feature_When_Construct_Dataset)


params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 4,
    'learning_rate': 0.5,
    'verbose': 0,
}


gbm = lgb.train(params,
                lgb_train,
                num_boost_round=1,
                categorical_feature=Categorical_Feature_When_Train,
)
 

Here, if Categorical_Feature_When_Construct_Dataset doesn't equal to Categorical_Feature_When_Train, a warning will be reported. These two parameters have the same default value: auto. It becomes confusing when thw user only sets Categorical_Feature_When_Construct_Dataset, and don't set Categorical_Feature_When_Train. His expectation should be using the same categorical feature as the dataset's setting. But an warning is reported in this case because Categorical_Feature_When_Train has an default value auto, which doesn't align with Categorical_Feature_When_Construct_Dataset.

If we use cf1 to represent Categorical_Feature_When_Construct_Dataset, and cf2 to represent Categorical_Feature_When_Train, we can have the following behavior:

case cf1 cf2 Current behavior Warning
1 auto auto auto no
2 auto specific columns use cf2 yes
3 specific columns auto use cf1 yes
4 specific columns specific columns use cf2 yes if cf1 and cf2 are different

For this first source, my proposal is:

If the user is using specific columns to override "auto", we don't report the warning. Because the user is just overriding the default parameter.

It aligns with the current behavior. What we need to do is to remove the warning information for case 2 and case 3 in the table.

The second source comes from the dataset with a reference:

reference_params = self.reference.get_params()
if self.get_params() != reference_params:
_log_warning('Overriding the parameters from Reference Dataset.')
self._update_params(reference_params)

This is always reported if the referenced dataset has any categorical features. For the referenced dataset, its self.params is changed here:

if categorical_feature is not None:
categorical_indices = set()
feature_dict = {}
if feature_name is not None:
feature_dict = {name: i for i, name in enumerate(feature_name)}
for name in categorical_feature:
if isinstance(name, str) and name in feature_dict:
categorical_indices.add(feature_dict[name])
elif isinstance(name, int):
categorical_indices.add(name)
else:
raise TypeError(f"Wrong type({type(name).__name__}) or unknown name({name}) in categorical_feature")
if categorical_indices:
for cat_alias in _ConfigAliases.get("categorical_feature"):
if cat_alias in params:
_log_warning(f'{cat_alias} in param dict is overridden.')
params.pop(cat_alias, None)
params['categorical_column'] = sorted(categorical_indices)
params_str = param_dict_to_str(params)
self.params = params

For this one, my suggestion is to ignore categorical features when comparing the params.

@shiyu1994
Copy link
Collaborator

Thanks for your detailed analysis. I think the proposed solution is feasible!

@thisisreallife
Copy link

As a workaround I did the following:

class SilentRegressor(lgb.LGBMRegressor):
    def fit(self, *args, **kwargs):
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            return super().fit(*args, verbose=False, **kwargs)

following code is fine, if we do not want to create a new class.

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

@Nevermetyou65
Copy link

Is this problem solved??
I still can get this warning even I set auto when construction dataset and train. It's so annoying

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.