Support complex data types in categorical columns of pandas DataFrame #2134

mallibus · 2019-04-29T17:27:10Z

I am converting one or more columns of float64 into categorical bins to speed up the convergence and force the boundaries of the decision points. Attempting to bin the float columns with pd.cut or pd.qcut

Environment info

Operating System: Windows 10

CPU/GPU model: Intel Core i7

C++/Python/R version: Python 3.6, Anaconda, Jupyter Notebook, pandas 0.24.2

LightGBM version or commit hash: LightGBM 2.2.2

Error message

ValueError: Circular reference detected

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-207-751795a98846> in <module>
      4                            metric         = ['binary_logloss'])
      5 
----> 6 lgbmc.fit(X_train,y_train)
      7 
      8 prob_pred = lgbmc.predict(X_test)

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    740                                         verbose=verbose, feature_name=feature_name,
    741                                         categorical_feature=categorical_feature,
--> 742                                         callbacks=callbacks)
    743         return self
    744 

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    540                               verbose_eval=verbose, feature_name=feature_name,
    541                               categorical_feature=categorical_feature,
--> 542                               callbacks=callbacks)
    543 
    544         if evals_result:

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    238         booster.best_score[dataset_name][eval_name] = score
    239     if not keep_training_booster:
--> 240         booster.model_from_string(booster.model_to_string(), False).free_dataset()
    241     return booster
    242 

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\basic.py in model_to_string(self, num_iteration, start_iteration)
   2064                 ptr_string_buffer))
   2065         ret = string_buffer.value.decode()
-> 2066         ret += _dump_pandas_categorical(self.pandas_categorical)
   2067         return ret
   2068 

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\basic.py in _dump_pandas_categorical(pandas_categorical, file_name)
    299     pandas_str = ('\npandas_categorical:'
    300                   + json.dumps(pandas_categorical, default=json_default_with_numpy)
--> 301                   + '\n')
    302     if file_name is not None:
    303         with open(file_name, 'a') as f:

~\AppData\Local\conda\conda\envs\py36\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    236         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237         separators=separators, default=default, sort_keys=sort_keys,
--> 238         **kw).encode(obj)
    239 
    240 

~\AppData\Local\conda\conda\envs\py36\lib\json\encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

~\AppData\Local\conda\conda\envs\py36\lib\json\encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

ValueError: Circular reference detected

Reproducible examples

import lightgbm as lgb
from sklearn.model_selection import train_test_split

rows = 100
fcols = 5
ccols = 5
# Let's define some ascii readable names for convenience
fnames = ['Float_'+str(chr(97+n)) for n in range(fcols)]
cnames = ['Cat_'+str(chr(97+n)) for n in range(fcols)]

# The dataset is built by concatenation of the float and the int blocks
dff = pd.DataFrame(np.random.rand(rows,fcols),columns=fnames)
dfc = pd.DataFrame(np.random.randint(0,20,(rows,ccols)),columns=cnames)
df = pd.concat([dfc,dff],axis=1)
# Target column with random output
df['Target'] = (np.random.rand(rows)>0.5).astype(int)

# Conversion into categorical
df[cnames] = df[cnames].astype('category')
df['Float_a'] = pd.cut(x=df['Float_a'],bins=10)

# Dataset split
X = df.drop('Target',axis=1)
y = df['Target'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Model instantiation
lgbmc = lgb.LGBMClassifier(objective      = 'binary',
                           boosting_type  = 'gbdt' ,
                            is_unbalance   = True,
                           metric         = ['binary_logloss'])

lgbmc.fit(X_train,y_train)

Steps to reproduce

Copy code in a Jupyter Notebook cell
Execute
Removing the line df['Float_a'] = pd.cut(x=df['Float_a'],bins=10) there is no error

The text was updated successfully, but these errors were encountered:

mallibus · 2019-04-30T07:06:40Z

I realized that the Json serializer has some issue with the Interval dtype. Chaging the index of the category to string like df['Float_a'].cat.categories = ["%6.3f-%6.3f"%(x.left,x.right) for x in df['Float_a'].cat.categories] the issue disappears.
This is strange because the pandas json serializer has no problem with Interval types.

guolinke · 2019-08-01T05:38:20Z

I tried pd.cut, and it return the interval ranges, not the binned int values. So we cannot use it for the training.
However, the error message is a little bit confusing.
@StrikerRUS Can we refine the error message when meeting the unexpected pandas dataframes?

mallibus · 2019-08-01T05:46:54Z

Actually my thinking was that interval range would work as a category not as a value. Il Gio 1 Ago 2019, 07:38 Guolin Ke <notifications@github.com> ha scritto:

…

I tried pd.cut, and it return the interval ranges, not the binned int values. So we cannot use it for the training. However, the error message is a little bit confusing. @StrikerRUS <https://github.com/StrikerRUS> Can we refine the error message when meeting the unexpected pandas dataframes? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2134?email_source=notifications&email_token=AK3HDREF7YIKTVWF2ZK5EVLQCJZGJA5CNFSM4HJFHRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JL3BQ#issuecomment-517127558>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AK3HDRFTKEQSGPIMAEMCV23QCJZGJANCNFSM4HJFHRCQ> .

StrikerRUS · 2019-08-11T21:15:39Z

@mallibus

Actually my thinking was that interval range would work as a category not
as a value.

Yeah, you're right! Interval values are treated as categorical. Unfortunately, LightGBM supports only simple types of categories, e.g. int, float, string.

During training LightGBM dumps pandas categories to json. It uses standard json.dumps() function with our simple numpy array serializer:

LightGBM/python-package/lightgbm/basic.py

Line 264 in 5cff4e8

pandas_categorical = [list(data[col].cat.categories) for col in cat_cols]

LightGBM/python-package/lightgbm/basic.py

Lines 310 to 317 in 5cff4e8

    
           def _dump_pandas_categorical(pandas_categorical, file_name=None): 
        
               pandas_str = ('\npandas_categorical:' 
        
                             + json.dumps(pandas_categorical, default=json_default_with_numpy) 
        
                             + '\n') 
        
               if file_name is not None: 
        
                   with open(file_name, 'a') as f: 
        
                       f.write(pandas_str) 
        
               return pandas_str

LightGBM/python-package/lightgbm/compat.py

Lines 52 to 59 in 5cff4e8

    
           def json_default_with_numpy(obj): 
        
               """Convert numpy classes to JSON serializable objects.""" 
        
               if isinstance(obj, (np.integer, np.floating, np.bool_)): 
        
                   return obj.item() 
        
               elif isinstance(obj, np.ndarray): 
        
                   return obj.tolist() 
        
               else: 
        
                   return obj

In your case, categories are pandas.Interval objects which cannot be serialized in such manner.

For instance, the same error can be reproduced by trying to pass the data where categories are Timestamps.

import numpy as np
import pandas as pd
import lightgbm as lgb

df = pd.DataFrame([pd.to_datetime('01/{0}/2019'.format(i % 12 + 1)) for i in range(100)], columns=['a'])
df['Target'] = (np.random.rand(100) > 0.5).astype(int)
df['a'] = df['a'].astype('category')
X = df.drop('Target', axis=1)
y = df['Target'].astype(int)
lgb_data = lgb.Dataset(X, y)
lgb.train({}, lgb_data)
print(type(lgb_data.pandas_categorical[0][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Removing pd.to_datetime() results in successful training.

I see several ways to fix this issue.

Leave everything as is 😄 .
Raise more user-friendly error for unsupported types in category (I have no idea what types we should check).
Replace complicated objects by their __repr__ string during dumping, so that categories become simple strings. However, it will not allow us to restore original objects during loading and original DataFrame will be modified.
Utilize pandas to_json() method. It will bring unwanted dependency to the library and the need to carefully maintain it. BTW, pandas-team are planing a huge refactoring of json support: add indent support to to_json method pandas-dev/pandas#12004 (comment).

Maybe someone else have other ideas?

mallibus · 2019-08-13T08:33:08Z

As a user my preference would be to combine two of the proposals: - convert complex categories into their string representation, possibly keeping their ordering properties (e.g with leasing zeros in numbers so they keep the same ordering as strings). - rise a warning to inform about the conversion explaining a bit of background. Otherwise the minimum would be a better error message with the suggestion of convert categories into strings. Thank you! Marcello Il Dom 11 Ago 2019, 23:16 Nikita Titov <notifications@github.com> ha scritto:

…

@mallibus <https://github.com/mallibus> Actually my thinking was that interval range would work as a category not as a value. Yeah, you're right! Interval values are treated as categorical. Unfortunately, LightGBM supports only simple types of categories, e.g. int, float, string. During training LightGBM dumps pandas categories to json. It uses standard json.dumps() function with our simple numpy array serializer: https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L264 https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L310-L317 https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/compat.py#L52-L59 In your case, categories are pandas.Interval objects <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.html> which cannot be serialized in such manner. For instance, the same error can be reproduced by trying to pass the data where categories are Timestamps. import numpy as np import pandas as pd import lightgbm as lgb df = pd.DataFrame([pd.to_datetime('01/{0}/2019'.format(i % 12 + 1)) for i in range(100)], columns=['a']) df['Target'] = (np.random.rand(100) > 0.5).astype(int) df['a'] = df['a'].astype('category') X = df.drop('Target', axis=1) y = df['Target'].astype(int) lgb_data = lgb.Dataset(X, y) lgb.train({}, lgb_data) print(type(lgb_data.pandas_categorical[0][0])) <class 'pandas._libs.tslibs.timestamps.Timestamp'> Removing pd.to_datetime() results in successful training. I see several ways to fix this issue. - Leave everything as is 😄 . - Raise more user-friendly error for unsupported types in category (I have no idea what types we should check). - Replace complicated objects by their __repr__ string during dumping, so that categories become simple strings. However, it will not allow us to restore original objects during loading and original DataFrame will be modified. - Utilize pandas to_json() method <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html>. It will bring unwanted dependency to the library and the need to carefully maintain it. BTW, pandas-team are planing a huge refactoring of json support: pandas-dev/pandas#12004 (comment) <pandas-dev/pandas#12004 (comment)> . Maybe someone else have other ideas? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2134?email_source=notifications&email_token=AK3HDRGIHOLEHP4MZ4TFN4TQEB6RPA5CNFSM4HJFHRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BI3QY#issuecomment-520261059>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AK3HDREBUEHGDILNSFGDNMLQEB6RPANCNFSM4HJFHRCQ> .

StrikerRUS · 2019-08-13T11:42:49Z

@mallibus Thanks a lot for your feedback!

StrikerRUS · 2019-11-11T15:18:29Z

Closing in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature.

jameslamb added awaiting review and removed awaiting review labels Jul 19, 2019

guolinke added the bug label Aug 1, 2019

This was referenced Aug 24, 2019

Add support for categorical features of any type #2242

Closed

Categorical feature: does LightGBM deal with non-int data types? #2362

Closed

guolinke mentioned this issue Nov 11, 2019

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Nov 11, 2019

StrikerRUS changed the title ~~LightGBM fit throws “ValueError: Circular reference detected” with categorical feature from pd.cut~~ Support complex data types in categorical columns of pandas DataFrame Nov 11, 2019

StrikerRUS added enhancement feature request help wanted and removed bug labels Nov 11, 2019

StrikerRUS mentioned this issue Feb 24, 2021

Unknown token nan(ind) in data file #4014

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support complex data types in categorical columns of pandas DataFrame #2134

Support complex data types in categorical columns of pandas DataFrame #2134

mallibus commented Apr 29, 2019 •

edited

Loading

mallibus commented Apr 30, 2019

guolinke commented Aug 1, 2019

mallibus commented Aug 1, 2019 via email

StrikerRUS commented Aug 11, 2019

mallibus commented Aug 13, 2019 via email

StrikerRUS commented Aug 13, 2019

StrikerRUS commented Nov 11, 2019

Support complex data types in categorical columns of pandas DataFrame #2134

Support complex data types in categorical columns of pandas DataFrame #2134

Comments

mallibus commented Apr 29, 2019 • edited Loading

Environment info

Error message

Reproducible examples

Steps to reproduce

mallibus commented Apr 30, 2019

guolinke commented Aug 1, 2019

mallibus commented Aug 1, 2019 via email

StrikerRUS commented Aug 11, 2019

mallibus commented Aug 13, 2019 via email

StrikerRUS commented Aug 13, 2019

StrikerRUS commented Nov 11, 2019

mallibus commented Apr 29, 2019 •

edited

Loading