Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support complex data types in categorical columns of pandas DataFrame #2134

Closed
mallibus opened this issue Apr 29, 2019 · 7 comments
Closed

Comments

@mallibus
Copy link

mallibus commented Apr 29, 2019

I am converting one or more columns of float64 into categorical bins to speed up the convergence and force the boundaries of the decision points. Attempting to bin the float columns with pd.cut or pd.qcut

Environment info

Operating System: Windows 10

CPU/GPU model: Intel Core i7

C++/Python/R version: Python 3.6, Anaconda, Jupyter Notebook, pandas 0.24.2

LightGBM version or commit hash: LightGBM 2.2.2

Error message

ValueError: Circular reference detected

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-207-751795a98846> in <module>
      4                            metric         = ['binary_logloss'])
      5 
----> 6 lgbmc.fit(X_train,y_train)
      7 
      8 prob_pred = lgbmc.predict(X_test)

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    740                                         verbose=verbose, feature_name=feature_name,
    741                                         categorical_feature=categorical_feature,
--> 742                                         callbacks=callbacks)
    743         return self
    744 

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    540                               verbose_eval=verbose, feature_name=feature_name,
    541                               categorical_feature=categorical_feature,
--> 542                               callbacks=callbacks)
    543 
    544         if evals_result:

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    238         booster.best_score[dataset_name][eval_name] = score
    239     if not keep_training_booster:
--> 240         booster.model_from_string(booster.model_to_string(), False).free_dataset()
    241     return booster
    242 

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\basic.py in model_to_string(self, num_iteration, start_iteration)
   2064                 ptr_string_buffer))
   2065         ret = string_buffer.value.decode()
-> 2066         ret += _dump_pandas_categorical(self.pandas_categorical)
   2067         return ret
   2068 

~\AppData\Local\conda\conda\envs\py36\lib\site-packages\lightgbm\basic.py in _dump_pandas_categorical(pandas_categorical, file_name)
    299     pandas_str = ('\npandas_categorical:'
    300                   + json.dumps(pandas_categorical, default=json_default_with_numpy)
--> 301                   + '\n')
    302     if file_name is not None:
    303         with open(file_name, 'a') as f:

~\AppData\Local\conda\conda\envs\py36\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    236         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237         separators=separators, default=default, sort_keys=sort_keys,
--> 238         **kw).encode(obj)
    239 
    240 

~\AppData\Local\conda\conda\envs\py36\lib\json\encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

~\AppData\Local\conda\conda\envs\py36\lib\json\encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

ValueError: Circular reference detected

Reproducible examples

import lightgbm as lgb
from sklearn.model_selection import train_test_split

rows = 100
fcols = 5
ccols = 5
# Let's define some ascii readable names for convenience
fnames = ['Float_'+str(chr(97+n)) for n in range(fcols)]
cnames = ['Cat_'+str(chr(97+n)) for n in range(fcols)]

# The dataset is built by concatenation of the float and the int blocks
dff = pd.DataFrame(np.random.rand(rows,fcols),columns=fnames)
dfc = pd.DataFrame(np.random.randint(0,20,(rows,ccols)),columns=cnames)
df = pd.concat([dfc,dff],axis=1)
# Target column with random output
df['Target'] = (np.random.rand(rows)>0.5).astype(int)

# Conversion into categorical
df[cnames] = df[cnames].astype('category')
df['Float_a'] = pd.cut(x=df['Float_a'],bins=10)

# Dataset split
X = df.drop('Target',axis=1)
y = df['Target'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Model instantiation
lgbmc = lgb.LGBMClassifier(objective      = 'binary',
                           boosting_type  = 'gbdt' ,
                            is_unbalance   = True,
                           metric         = ['binary_logloss'])

lgbmc.fit(X_train,y_train)

Steps to reproduce

  1. Copy code in a Jupyter Notebook cell
  2. Execute
  3. Removing the line df['Float_a'] = pd.cut(x=df['Float_a'],bins=10) there is no error
@mallibus
Copy link
Author

I realized that the Json serializer has some issue with the Interval dtype. Chaging the index of the category to string like df['Float_a'].cat.categories = ["%6.3f-%6.3f"%(x.left,x.right) for x in df['Float_a'].cat.categories] the issue disappears.
This is strange because the pandas json serializer has no problem with Interval types.

@guolinke
Copy link
Collaborator

guolinke commented Aug 1, 2019

I tried pd.cut, and it return the interval ranges, not the binned int values. So we cannot use it for the training.
However, the error message is a little bit confusing.
@StrikerRUS Can we refine the error message when meeting the unexpected pandas dataframes?

@mallibus
Copy link
Author

mallibus commented Aug 1, 2019 via email

@StrikerRUS
Copy link
Collaborator

@mallibus

Actually my thinking was that interval range would work as a category not
as a value.

Yeah, you're right! Interval values are treated as categorical. Unfortunately, LightGBM supports only simple types of categories, e.g. int, float, string.

During training LightGBM dumps pandas categories to json. It uses standard json.dumps() function with our simple numpy array serializer:

pandas_categorical = [list(data[col].cat.categories) for col in cat_cols]

def _dump_pandas_categorical(pandas_categorical, file_name=None):
pandas_str = ('\npandas_categorical:'
+ json.dumps(pandas_categorical, default=json_default_with_numpy)
+ '\n')
if file_name is not None:
with open(file_name, 'a') as f:
f.write(pandas_str)
return pandas_str

def json_default_with_numpy(obj):
"""Convert numpy classes to JSON serializable objects."""
if isinstance(obj, (np.integer, np.floating, np.bool_)):
return obj.item()
elif isinstance(obj, np.ndarray):
return obj.tolist()
else:
return obj

In your case, categories are pandas.Interval objects which cannot be serialized in such manner.

For instance, the same error can be reproduced by trying to pass the data where categories are Timestamps.

import numpy as np
import pandas as pd
import lightgbm as lgb

df = pd.DataFrame([pd.to_datetime('01/{0}/2019'.format(i % 12 + 1)) for i in range(100)], columns=['a'])
df['Target'] = (np.random.rand(100) > 0.5).astype(int)
df['a'] = df['a'].astype('category')
X = df.drop('Target', axis=1)
y = df['Target'].astype(int)
lgb_data = lgb.Dataset(X, y)
lgb.train({}, lgb_data)
print(type(lgb_data.pandas_categorical[0][0]))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Removing pd.to_datetime() results in successful training.

I see several ways to fix this issue.

  • Leave everything as is 😄 .
  • Raise more user-friendly error for unsupported types in category (I have no idea what types we should check).
  • Replace complicated objects by their __repr__ string during dumping, so that categories become simple strings. However, it will not allow us to restore original objects during loading and original DataFrame will be modified.
  • Utilize pandas to_json() method. It will bring unwanted dependency to the library and the need to carefully maintain it. BTW, pandas-team are planing a huge refactoring of json support: add indent support to to_json method pandas-dev/pandas#12004 (comment).

Maybe someone else have other ideas?

@mallibus
Copy link
Author

mallibus commented Aug 13, 2019 via email

@StrikerRUS
Copy link
Collaborator

@mallibus Thanks a lot for your feedback!

@StrikerRUS
Copy link
Collaborator

Closing in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature.

@StrikerRUS StrikerRUS changed the title LightGBM fit throws “ValueError: Circular reference detected” with categorical feature from pd.cut Support complex data types in categorical columns of pandas DataFrame Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants