-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Faster categorical column names selection #4787
Conversation
* Faster categorical column names selection Change slow and redundant dataframe query by select_dtypes into a dataframe.dtypes list comprehension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @Neronuser! Since pandas isn't a hard dependency we use a compat
module to import things from it. I've suggested the changes to the basic.py
file and you'd have to add the required import in
from pandas.api.types import is_sparse as is_dtype_sparse |
to add
is_categorical_dtype
and define it as None
after is_dtype_sparse = None |
python-package/lightgbm/basic.py
Outdated
@@ -566,7 +567,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica | |||
raise ValueError('Input data must be 2 dimensional and non empty.') | |||
if feature_name == 'auto' or feature_name is None: | |||
data = data.rename(columns=str) | |||
cat_cols = list(data.select_dtypes(include=['category']).columns) | |||
cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, CategoricalDtype)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, CategoricalDtype)] | |
cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if is_categorical_dtype(dtype)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jmoralez! I updated the PR with required changes, but instead of importing is_categorical_dtype from pandas I create a Dummy CategoricalDtype object. This is done because is_categorical_dtype also checks if arrays have categorical dtype which introduces minor overhead for this case. Is that OK, or do you insist on using is_categorical_dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ok. Thank you for the explanation!
Thank you @Neronuser, looks good to me! Gently ping @StrikerRUS for a review as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for this improvement!
I also checked that CategoricalDtype
has been available to be imported in a such way at least since pandas version 0.20
which was released in 2017. So, I believe we are good with backward compatibility.
https://github.com/pandas-dev/pandas/blob/0.20.x/pandas/api/types/__init__.py#L4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was wrong in my previous comment. You can import CategoricalDtype
directly from the root only since version 0.24
. For previous versions you should specify the full path:
from pandas.api.types import CategoricalDtype
python-package/lightgbm/compat.py
Outdated
@@ -6,6 +6,7 @@ | |||
from pandas import DataFrame as pd_DataFrame | |||
from pandas import Series as pd_Series | |||
from pandas import concat | |||
from pandas.api.types import CategoricalDtype as pd_CategoricalDtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine that this import works with the latest version. But will it make sense to try import like from pandas import CategoricalDtype as pd_CategoricalDtype
in case they'll change internal structure of modules in the future?
Refer to
LightGBM/python-package/lightgbm/compat.py
Lines 68 to 73 in 99e0a4b
try: | |
from sklearn.exceptions import NotFittedError | |
from sklearn.model_selection import GroupKFold, StratifiedKFold | |
except ImportError: | |
from sklearn.cross_validation import GroupKFold, StratifiedKFold | |
from sklearn.utils.validation import NotFittedError |
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this makes sense, thank you. Not sure, but it feels more likely that they are going to change pandas.api.types
than the top-level import of their types from pandas import CategoricalDtype
. Especially, given that their current top-level init goes into pandas.core.api
for CategoricalDtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you very much!
I'd like to ask about the reason behind the merge of this change without subsequent release. I am currently dealing with a large dataset that consists of multiple categorical features. However, the implementation in version 3.3.5 results in an unnecessary increase in memory usage. It would greatly benefit me to have this change included in the released version. |
Subscribe to #5153 to be notified of the next release. There's nothing specific to this change keeping it out of releases...in general we have some challenges with maintainer availability in this project that have led to such a long delay between releases. We're trying to get a release out in the next few months, sorry for the inconvenience. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Change slow and redundant dataframe query by select_dtypes into a dataframe.dtypes list comprehension