Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Better column dtype logging when column has "bad dtype" #5065

Conversation

hsorsky
Copy link
Contributor

@hsorsky hsorsky commented Mar 10, 2022

Closes #5064

@hsorsky
Copy link
Contributor Author

hsorsky commented Mar 10, 2022

Current failing tests look (seemingly) unrelated to this PR AFAICT. It's failing on test_create_tree_digraph

@jmoralez
Copy link
Collaborator

Thank you for your contribution! What do you think about changing the _get_bad_pandas_dtypes function instead? It could return something like [f'{name}: {dtype}' for name, dtype in dtypes.iteritems() if not is_allowed_numpy_dtype(dtype.type)] instead and we could use that list like:

bad_dtypes = _get_bad_pandas_dtypes(df.dtypes)
if bad_dtypes:
    raise ValueError(f"Bad dtypes: {', '.join(bad_dtypes)}")

@hsorsky
Copy link
Contributor Author

hsorsky commented Mar 10, 2022

What do you think about changing the _get_bad_pandas_dtypes function instead?

The reason I didn't initially do that was that _get_bad_pandas_dtypes is used both on the output of DataFrame.dtypes (i.e. a Series) but also on [Series.dtypes] (i.e. a list of dtype objects), in which case there, unfortunately, isn't an attached column name. I wanted to start off with the PR being as unintrusive to the rest of the code as possible, but if maintainers are happy to update either _get_bad_pandas_dtypes to handle that case, or the code that calls it like _get_bad_pandas_dtypes([Series.dtypes]) to ensure the data is passed in as a series of dtypes (with just one element), then I'm happy to make those changes.

My opinion is that the latter is cleaner and that we should integrate that with your approach.

My thinking for making the func get called on a series would be to do _get_bad_pandas_dtypes(Series.to_frame().dtypes) instead of _get_bad_pandas_dtypes([Series.dtypes])

@jmoralez
Copy link
Collaborator

jmoralez commented Mar 10, 2022

I'm +1 on using Series.to_frame().dtypes for the label case, it seems to be very fast and allows the function to stay the same. Let's wait for another opinion though.

@StrikerRUS
Copy link
Collaborator

I'm for moving duplicated post-processing code into the _get_bad_pandas_dtypes() function.

@StrikerRUS
Copy link
Collaborator

@hsorsky

Current failing tests look (seemingly) unrelated to this PR AFAICT. It's failing on test_create_tree_digraph

Sorry for the inconvenience! Fails should be fixed by #5068.

@hsorsky
Copy link
Contributor Author

hsorsky commented Mar 11, 2022

I'm for moving duplicated post-processing code into the _get_bad_pandas_dtypes() function.

just to confirm before I do any more work on this tomorrow - you mean moving everything, including error raising, inside the function? I'd be fine with that as it reduces a large amount of similar code, and we don't lose much in the error message (we can just treat them all like the DataFrame based error messages and it should be obvious from the stack trace that it was in fact a series in the singular series case)

@StrikerRUS
Copy link
Collaborator

Sounds like a good plan! Also, maybe with the aim to not confuse users with DataFrame/Series words, we can refer to dtypes as "pandas dtypes" or something similar?

Co-authored-by: José Morales <jmoralz92@gmail.com>
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you so much for the help!
Just one nit below:

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
@jameslamb jameslamb changed the title Better column dtype logging when column has "bad dtype" [python-package] Better column dtype logging when column has "bad dtype" Mar 13, 2022
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice change, thanks for the help with this! I also approve, and think this should be merged once @StrikerRUS 's one additional comment has been addressed.

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Copy link
Collaborator

@jmoralez jmoralez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

@hsorsky
Copy link
Contributor Author

hsorsky commented Mar 14, 2022

thanks for the reviews!

@StrikerRUS StrikerRUS merged commit c043be1 into microsoft:master Mar 15, 2022
@hsorsky hsorsky deleted the add-better-error-messages-for-bad-pandas-dtypes branch March 15, 2022 15:09
@jameslamb jameslamb mentioned this pull request Oct 7, 2022
40 tasks
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Improve bad pandas dtype error messages
4 participants