-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support complex data types in categorical columns of pandas DataFrame #2134
Comments
I realized that the Json serializer has some issue with the Interval dtype. Chaging the index of the category to string like |
I tried pd.cut, and it return the interval ranges, not the binned int values. So we cannot use it for the training. |
Actually my thinking was that interval range would work as a category not
as a value.
Il Gio 1 Ago 2019, 07:38 Guolin Ke <notifications@github.com> ha scritto:
… I tried pd.cut, and it return the interval ranges, not the binned int
values. So we cannot use it for the training.
However, the error message is a little bit confusing.
@StrikerRUS <https://github.com/StrikerRUS> Can we refine the error
message when meeting the unexpected pandas dataframes?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2134?email_source=notifications&email_token=AK3HDREF7YIKTVWF2ZK5EVLQCJZGJA5CNFSM4HJFHRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JL3BQ#issuecomment-517127558>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AK3HDRFTKEQSGPIMAEMCV23QCJZGJANCNFSM4HJFHRCQ>
.
|
Yeah, you're right! Interval values are treated as categorical. Unfortunately, LightGBM supports only simple types of categories, e.g. int, float, string. During training LightGBM dumps pandas categories to json. It uses standard LightGBM/python-package/lightgbm/basic.py Line 264 in 5cff4e8
LightGBM/python-package/lightgbm/basic.py Lines 310 to 317 in 5cff4e8
LightGBM/python-package/lightgbm/compat.py Lines 52 to 59 in 5cff4e8
In your case, categories are For instance, the same error can be reproduced by trying to pass the data where categories are Timestamps.
Removing I see several ways to fix this issue.
Maybe someone else have other ideas? |
As a user my preference would be to combine two of the proposals:
- convert complex categories into their string representation, possibly
keeping their ordering properties (e.g with leasing zeros in numbers so
they keep the same ordering as strings).
- rise a warning to inform about the conversion explaining a bit of
background.
Otherwise the minimum would be a better error message with the suggestion
of convert categories into strings.
Thank you!
Marcello
Il Dom 11 Ago 2019, 23:16 Nikita Titov <notifications@github.com> ha
scritto:
… @mallibus <https://github.com/mallibus>
Actually my thinking was that interval range would work as a category not
as a value.
Yeah, you're right! Interval values are treated as categorical.
Unfortunately, LightGBM supports only simple types of categories, e.g. int,
float, string.
During training LightGBM dumps pandas categories to json. It uses standard
json.dumps() function with our simple numpy array serializer:
https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L264
https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/basic.py#L310-L317
https://github.com/microsoft/LightGBM/blob/5cff4e8e2fb2280ed302ed73c08bd95d035c7889/python-package/lightgbm/compat.py#L52-L59
In your case, categories are pandas.Interval objects
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.html>
which cannot be serialized in such manner.
For instance, the same error can be reproduced by trying to pass the data
where categories are Timestamps.
import numpy as np
import pandas as pd
import lightgbm as lgb
df = pd.DataFrame([pd.to_datetime('01/{0}/2019'.format(i % 12 + 1)) for i in range(100)], columns=['a'])
df['Target'] = (np.random.rand(100) > 0.5).astype(int)
df['a'] = df['a'].astype('category')
X = df.drop('Target', axis=1)
y = df['Target'].astype(int)
lgb_data = lgb.Dataset(X, y)
lgb.train({}, lgb_data)
print(type(lgb_data.pandas_categorical[0][0]))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Removing pd.to_datetime() results in successful training.
I see several ways to fix this issue.
- Leave everything as is 😄 .
- Raise more user-friendly error for unsupported types in category (I
have no idea what types we should check).
- Replace complicated objects by their __repr__ string during dumping,
so that categories become simple strings. However, it will not allow us to
restore original objects during loading and original DataFrame will be
modified.
- Utilize pandas to_json() method
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html>.
It will bring unwanted dependency to the library and the need to carefully
maintain it. BTW, pandas-team are planing a huge refactoring of json
support: pandas-dev/pandas#12004 (comment)
<pandas-dev/pandas#12004 (comment)>
.
Maybe someone else have other ideas?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2134?email_source=notifications&email_token=AK3HDRGIHOLEHP4MZ4TFN4TQEB6RPA5CNFSM4HJFHRC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BI3QY#issuecomment-520261059>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AK3HDREBUEHGDILNSFGDNMLQEB6RPANCNFSM4HJFHRCQ>
.
|
@mallibus Thanks a lot for your feedback! |
Closing in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open (or post a comment if you are not a topic starter) this issue if you are actively working on implementing this feature. |
I am converting one or more columns of float64 into categorical bins to speed up the convergence and force the boundaries of the decision points. Attempting to bin the float columns with pd.cut or pd.qcut
Environment info
Operating System: Windows 10
CPU/GPU model: Intel Core i7
C++/Python/R version: Python 3.6, Anaconda, Jupyter Notebook, pandas 0.24.2
LightGBM version or commit hash: LightGBM 2.2.2
Error message
ValueError: Circular reference detected
Reproducible examples
Steps to reproduce
df['Float_a'] = pd.cut(x=df['Float_a'],bins=10)
there is no errorThe text was updated successfully, but these errors were encountered: