-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better error for str.cat with listlike of wrong dtype. #26607
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26607 +/- ##
==========================================
- Coverage 91.85% 91.84% -0.01%
==========================================
Files 174 174
Lines 50707 50709 +2
==========================================
- Hits 46578 46576 -2
- Misses 4129 4133 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26607 +/- ##
==========================================
- Coverage 91.86% 91.85% -0.01%
==========================================
Files 179 179
Lines 50700 50709 +9
==========================================
+ Hits 46574 46579 +5
- Misses 4126 4130 +4
Continue to review full report at Codecov.
|
pandas/core/strings.py
Outdated
for x in others): | ||
# data has already been checked by str-accessor | ||
raise TypeError('Can only concatenate list-likes containing only ' | ||
'strings (or missing values)!') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No exclamation points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-vetinari my main concern is the complexity of the condition.
pandas/tests/test_strings.py
Outdated
@@ -420,6 +420,16 @@ def test_str_cat_categorical(self, box, dtype_caller, dtype_target, sep): | |||
result = s.str.cat(t, sep=sep) | |||
assert_series_or_index_equal(result, expected) | |||
|
|||
@pytest.mark.parametrize('box', [Series, Index, np.array, list]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do these cover the different conditions for raising?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO yes. It would be possible to add more dtypes, but I don't consider this a great improvement.
Happy to add if you don't mind combinatorially more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if i modify the condition, we have tests that fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean here. Adding different containers (as long as they are legal inputs to .str.cat
) is possible, but I've already added the most import ones.
It would be possible add a further parametrization over the dtype and not just test that integer Series fail (as currently), but also floats, complex, datetime etc. This would work without failure.
You'd have to be more concrete with "modify the condition" for me to answer more clearly. This test serves as a relatively minimal example from #22722. If you want more dtypes, I'm happy to add them. Anything else you'll need to explain to me in more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not raising the message on an invalid case would just result in the old message be raised. rasing on valid cases is a regression.
would it not be simpler to catch the message and reraise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-vetinari thanks for the explanation. i'm not convinced that for the sake of a better error message we should be adding this complexity, especially if the old message is still raised in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not convinced that for the sake of a better error message we should be adding this complexity, especially if the old message is still raised in some cases.
Looking at #22722, I think this is a clear win - it covers 99% (guesstimate) of real-world cases where this error happens (putting a Series with a transparently incompatible dtype into .str.cat
). If someone actually has mixed strings / integers in their Series, it just won't fail as gracefully. But that doesn't imply (to me) that we shouldn't aim to do better in the 99%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: if your only worry is an inconsistent error message, I could have the 99%-fail-early-check and catch-and-reraise afterwards (with the same error message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best to wait for feedback from other maintainers before you make any changes on my account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too late. ;-)
pandas/core/strings.py
Outdated
# no NaNs - can just concatenate | ||
result = cat_core(all_cols, sep) | ||
except TypeError as exc: | ||
if re.match((r'can only concatenate str \(not [\w\"]+) to str' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we reach here, we already have a user-facing TypeError? and we are just replacing the message.
are any other TypeErrors expected, i'd be inclined to not check the message check.
if it is necessary, maybe have a _is_not_string_exception
in pandas\core\dtypes\common.py similar to _is_unorderable_exception
rather than have the condition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are any other TypeErrors expected, i'd be inclined to not check the message check.
Fair enough, that's a not gonna happen.
pandas/core/strings.py
Outdated
|
||
not_masked = ~union_mask | ||
result[not_masked] = cat_core([x[not_masked] | ||
for x in all_cols], sep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you limit the try/except to the relevant code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am; cat_core
is called in all three branches, and may raise in any one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error
maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback: so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error
maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.
Moved try/except to cat_safe
, which wraps around cat_core
to do what you describe in the second sentence.
pandas/core/strings.py
Outdated
# but others could still have Series of dtypes (e.g. integers) which | ||
# will necessarily fail in concatenation. To avoid deep and confusing | ||
# traces, we raise here for anything that's not object or all-NA float. | ||
def _legal_dtype(series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is way complicated, what exactly are you trying to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had an inline condition within any
, but @simonjayhawkins found this too complex, so I broke out that condition into a function.
Basically, I want to fail early for any Series that will necessarily fail concatenation (based on dtype). Object must obviously be allowed, but also all-NA float (which can happen if two Series completely misalign), plus needs handling of categorical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had an inline condition within
any
, but @simonjayhawkins found this too complex, so I broke out that condition into a function.
the any is applied to a generator expesssion with a for loop and then raising. so i'm not sure the use of any here is a benefit. could you not just use a for loop and do away with the separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback: do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier
The dtype has only been inferred for data
at this point, not for others
. I want to avoid inferring for other
, as that would lead to another (not insignificant) perf-hit, whereas reading out the dtypes is trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responses
pandas/core/strings.py
Outdated
# but others could still have Series of dtypes (e.g. integers) which | ||
# will necessarily fail in concatenation. To avoid deep and confusing | ||
# traces, we raise here for anything that's not object or all-NA float. | ||
def _legal_dtype(series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had an inline condition within any
, but @simonjayhawkins found this too complex, so I broke out that condition into a function.
Basically, I want to fail early for any Series that will necessarily fail concatenation (based on dtype). Object must obviously be allowed, but also all-NA float (which can happen if two Series completely misalign), plus needs handling of categorical.
pandas/core/strings.py
Outdated
|
||
not_masked = ~union_mask | ||
result[not_masked] = cat_core([x[not_masked] | ||
for x in all_cols], sep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am; cat_core
is called in all three branches, and may raise in any one of them.
pandas/core/strings.py
Outdated
# but others could still have Series of dtypes (e.g. integers) which | ||
# will necessarily fail in concatenation. To avoid deep and confusing | ||
# traces, we raise here for anything that's not object or all-NA float. | ||
def _legal_dtype(series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier
pandas/core/strings.py
Outdated
|
||
not_masked = ~union_mask | ||
result[not_masked] = cat_core([x[not_masked] | ||
for x in all_cols], sep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error
maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the try/except + raise into a separate wrapper, as @jreback suggested. This allows to remove the extra check block I had introduced.
However, removing that extra check has the edge case that wrongly dtyped data that is completely misaligned will slip through (i.e. not raise):
>>> s = pd.Series(['a', 'b', 'c'])
>>> t = pd.Series([0, 1, 2], index=[10, 11, 12])
>>> s.str.cat(t, join='left')
0 NaN
1 NaN
2 NaN
dtype: object
It is a downside, but I would be fine with it, not least considering the much less complicated code now.
pandas/core/strings.py
Outdated
# but others could still have Series of dtypes (e.g. integers) which | ||
# will necessarily fail in concatenation. To avoid deep and confusing | ||
# traces, we raise here for anything that's not object or all-NA float. | ||
def _legal_dtype(series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback: do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier
The dtype has only been inferred for data
at this point, not for others
. I want to avoid inferring for other
, as that would lead to another (not insignificant) perf-hit, whereas reading out the dtypes is trivial.
pandas/core/strings.py
Outdated
|
||
not_masked = ~union_mask | ||
result[not_masked] = cat_core([x[not_masked] | ||
for x in all_cols], sep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback: so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error
maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.
Moved try/except to cat_safe
, which wraps around cat_core
to do what you describe in the second sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks much more reasonable. typing & doc-string comments, ping on green.
pandas/core/strings.py
Outdated
@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep): | |||
return np.sum(list_with_sep, axis=0) | |||
|
|||
|
|||
def cat_safe(list_of_columns, sep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type the args & return value & add a Parameters / Returns section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your expectation for typing the args? Just List
? It would strictly speaking be List[np.array]
, but AFAICT, mypy resp. the typing module doesn't yet support numpy stubs natively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that would be fine
pandas/core/strings.py
Outdated
Same signature as cat_core, but handles TypeErrors in concatenation, which | ||
happen if the Series in list_of columns have the wrong dtypes or content. | ||
""" | ||
# if there are any non-string values (wrong dtype or hidden behind object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the comment to the except
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -607,7 +607,7 @@ Strings | |||
^^^^^^^ | |||
|
|||
- Bug in the ``__name__`` attribute of several methods of :class:`Series.str`, which were set incorrectly (:issue:`23551`) | |||
- | |||
- Improved error message when passing ``Series`` of wrong dtype to :meth:`Series.str.cat` (:issue:`22722`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use :class:`Series`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typing comments
pandas/core/strings.py
Outdated
@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep): | |||
return np.sum(list_with_sep, axis=0) | |||
|
|||
|
|||
def cat_safe(list_of_columns, sep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your expectation for typing the args? Just List
? It would strictly speaking be List[np.array]
, but AFAICT, mypy resp. the typing module doesn't yet support numpy stubs natively.
@h-vetinari if you merge master the checks ci failure should be fixed. I think i've seen the Windows py37_np141 failure before. probably a flaky test. likely pass next time around. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing request, otherwise lgtm. ping on green.
pandas/core/strings.py
Outdated
@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep): | |||
return np.sum(list_with_sep, axis=0) | |||
|
|||
|
|||
def cat_safe(list_of_columns, sep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that would be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much more comfortable without the pre-check to fail early and simpler code.
Thanks @h-vetinari
raise TypeError( | ||
'Concatenation requires list-likes containing only ' | ||
'strings (or missing values). Offending values found in ' | ||
'column {}'.format(dtype)) from None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth having a raise
outside the for loop to ensure we don't slip through, or is that not going to happen?
thanks @h-vetinari |
git diff upstream/master -u -- "*.py" | flake8 --diff
This had been blocked on #23167.