Better error for str.cat with listlike of wrong dtype. #26607

h-vetinari · 2019-06-01T15:51:00Z

closes Improve TypeError message for str.cat #22722
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This had been blocked on #23167.

codecov · 2019-06-01T16:46:07Z

Codecov Report

Merging #26607 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26607      +/-   ##
==========================================
- Coverage   91.85%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50707    50709       +2     
==========================================
- Hits        46578    46576       -2     
- Misses       4129     4133       +4

Flag	Coverage Δ
#multiple	`90.39% <100%> (ø)`	⬆️
#single	`41.76% <0%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.92% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dbb99e...81451bd. Read the comment docs.

codecov · 2019-06-01T16:46:07Z

Codecov Report

Merging #26607 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26607      +/-   ##
==========================================
- Coverage   91.86%   91.85%   -0.01%     
==========================================
  Files         179      179              
  Lines       50700    50709       +9     
==========================================
+ Hits        46574    46579       +5     
- Misses       4126     4130       +4

Flag	Coverage Δ
#multiple	`90.44% <100%> (ø)`	⬆️
#single	`41.09% <21.42%> (-0.1%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.93% <100%> (+0.01%)`	⬆️
pandas/io/gbq.py	`88.88% <0%> (-11.12%)`	⬇️
pandas/core/frame.py	`96.88% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a60888c...9752aa7. Read the comment docs.

gfyoung · 2019-06-01T17:31:31Z

pandas/core/strings.py

+               for x in others):
+            # data has already been checked by str-accessor
+            raise TypeError('Can only concatenate list-likes containing only '
+                            'strings (or missing values)!')


No exclamation points

pandas/core/strings.py

pandas/tests/test_strings.py

doc/source/whatsnew/v0.25.0.rst

simonjayhawkins

@h-vetinari my main concern is the complexity of the condition.

pandas/core/strings.py

pandas/tests/test_strings.py

simonjayhawkins · 2019-06-01T22:04:41Z

pandas/tests/test_strings.py

@@ -420,6 +420,16 @@ def test_str_cat_categorical(self, box, dtype_caller, dtype_target, sep):
            result = s.str.cat(t, sep=sep)
            assert_series_or_index_equal(result, expected)

+    @pytest.mark.parametrize('box', [Series, Index, np.array, list])


do these cover the different conditions for raising?

IMO yes. It would be possible to add more dtypes, but I don't consider this a great improvement.

Happy to add if you don't mind combinatorially more tests.

so if i modify the condition, we have tests that fail?

I don't understand what you mean here. Adding different containers (as long as they are legal inputs to .str.cat) is possible, but I've already added the most import ones.

It would be possible add a further parametrization over the dtype and not just test that integer Series fail (as currently), but also floats, complex, datetime etc. This would work without failure.

You'd have to be more concrete with "modify the condition" for me to answer more clearly. This test serves as a relatively minimal example from #22722. If you want more dtypes, I'm happy to add them. Anything else you'll need to explain to me in more detail.

not raising the message on an invalid case would just result in the old message be raised. rasing on valid cases is a regression.

would it not be simpler to catch the message and reraise?

@h-vetinari thanks for the explanation. i'm not convinced that for the sake of a better error message we should be adding this complexity, especially if the old message is still raised in some cases.

@jreback

i'm not convinced that for the sake of a better error message we should be adding this complexity, especially if the old message is still raised in some cases.

Looking at #22722, I think this is a clear win - it covers 99% (guesstimate) of real-world cases where this error happens (putting a Series with a transparently incompatible dtype into .str.cat). If someone actually has mixed strings / integers in their Series, it just won't fail as gracefully. But that doesn't imply (to me) that we shouldn't aim to do better in the 99%.

Side note: if your only worry is an inconsistent error message, I could have the 99%-fail-early-check and catch-and-reraise afterwards (with the same error message).

best to wait for feedback from other maintainers before you make any changes on my account.

Too late. ;-)

simonjayhawkins · 2019-06-02T16:58:26Z

pandas/core/strings.py

+                # no NaNs - can just concatenate
+                result = cat_core(all_cols, sep)
+        except TypeError as exc:
+            if re.match((r'can only concatenate str \(not [\w\"]+) to str'


if we reach here, we already have a user-facing TypeError? and we are just replacing the message.

are any other TypeErrors expected, i'd be inclined to not check the message check.

if it is necessary, maybe have a _is_not_string_exception in pandas\core\dtypes\common.py similar to _is_unorderable_exception rather than have the condition here.

are any other TypeErrors expected, i'd be inclined to not check the message check.

Fair enough, that's a not gonna happen.

doc/source/whatsnew/v0.25.0.rst

jreback · 2019-06-02T23:36:40Z

pandas/core/strings.py

+
+                not_masked = ~union_mask
+                result[not_masked] = cat_core([x[not_masked]
+                                               for x in all_cols], sep)


can you limit the try/except to the relevant code

I am; cat_core is called in all three branches, and may raise in any one of them.

so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error

maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

@jreback: so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error
maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

Moved try/except to cat_safe, which wraps around cat_core to do what you describe in the second sentence.

jreback · 2019-06-02T23:36:55Z

pandas/core/strings.py

+        # but others could still have Series of dtypes (e.g. integers) which
+        # will necessarily fail in concatenation. To avoid deep and confusing
+        # traces, we raise here for anything that's not object or all-NA float.
+        def _legal_dtype(series):


this is way complicated, what exactly are you trying to do here

I originally had an inline condition within any, but @simonjayhawkins found this too complex, so I broke out that condition into a function.

Basically, I want to fail early for any Series that will necessarily fail concatenation (based on dtype). Object must obviously be allowed, but also all-NA float (which can happen if two Series completely misalign), plus needs handling of categorical.

do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier

I originally had an inline condition within any, but @simonjayhawkins found this too complex, so I broke out that condition into a function.

the any is applied to a generator expesssion with a for loop and then raising. so i'm not sure the use of any here is a benefit. could you not just use a for loop and do away with the separate function?

@jreback: do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier

The dtype has only been inferred for data at this point, not for others. I want to avoid inferring for other, as that would lead to another (not insignificant) perf-hit, whereas reading out the dtypes is trivial.

h-vetinari

Responses

doc/source/whatsnew/v0.25.0.rst

h-vetinari · 2019-06-03T05:30:15Z

pandas/core/strings.py

+        # but others could still have Series of dtypes (e.g. integers) which
+        # will necessarily fail in concatenation. To avoid deep and confusing
+        # traces, we raise here for anything that's not object or all-NA float.
+        def _legal_dtype(series):


I originally had an inline condition within any, but @simonjayhawkins found this too complex, so I broke out that condition into a function.

Basically, I want to fail early for any Series that will necessarily fail concatenation (based on dtype). Object must obviously be allowed, but also all-NA float (which can happen if two Series completely misalign), plus needs handling of categorical.

h-vetinari · 2019-06-03T05:30:40Z

pandas/core/strings.py

+
+                not_masked = ~union_mask
+                result[not_masked] = cat_core([x[not_masked]
+                                               for x in all_cols], sep)


I am; cat_core is called in all three branches, and may raise in any one of them.

jreback · 2019-06-06T14:45:57Z

pandas/core/strings.py

+        # but others could still have Series of dtypes (e.g. integers) which
+        # will necessarily fail in concatenation. To avoid deep and confusing
+        # traces, we raise here for anything that's not object or all-NA float.
+        def _legal_dtype(series):


do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier

jreback · 2019-06-06T14:47:23Z

pandas/core/strings.py

+
+                not_masked = ~union_mask
+                result[not_masked] = cat_core([x[not_masked]
+                                               for x in all_cols], sep)


so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error

maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

h-vetinari

@jreback @simonjayhawkins

I moved the try/except + raise into a separate wrapper, as @jreback suggested. This allows to remove the extra check block I had introduced.

However, removing that extra check has the edge case that wrongly dtyped data that is completely misaligned will slip through (i.e. not raise):

>>> s = pd.Series(['a', 'b', 'c'])
>>> t = pd.Series([0, 1, 2], index=[10, 11, 12])
>>> s.str.cat(t, join='left')
0    NaN
1    NaN
2    NaN
dtype: object

It is a downside, but I would be fine with it, not least considering the much less complicated code now.

h-vetinari · 2019-06-10T13:06:21Z

pandas/core/strings.py

+        # but others could still have Series of dtypes (e.g. integers) which
+        # will necessarily fail in concatenation. To avoid deep and confusing
+        # traces, we raise here for anything that's not object or all-NA float.
+        def _legal_dtype(series):


@jreback: do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier

The dtype has only been inferred for data at this point, not for others. I want to avoid inferring for other, as that would lead to another (not insignificant) perf-hit, whereas reading out the dtypes is trivial.

h-vetinari · 2019-06-10T13:46:48Z

pandas/core/strings.py

+
+                not_masked = ~union_mask
+                result[not_masked] = cat_core([x[not_masked]
+                                               for x in all_cols], sep)


@jreback: so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error
maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

Moved try/except to cat_safe, which wraps around cat_core to do what you describe in the second sentence.

jreback

looks much more reasonable. typing & doc-string comments, ping on green.

jreback · 2019-06-12T14:41:58Z

pandas/core/strings.py

@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep):
    return np.sum(list_with_sep, axis=0)


+def cat_safe(list_of_columns, sep):


can you type the args & return value & add a Parameters / Returns section

What's your expectation for typing the args? Just List? It would strictly speaking be List[np.array], but AFAICT, mypy resp. the typing module doesn't yet support numpy stubs natively.

yes that would be fine

jreback · 2019-06-12T14:42:20Z

pandas/core/strings.py

+    Same signature as cat_core, but handles TypeErrors in concatenation, which
+    happen if the Series in list_of columns have the wrong dtypes or content.
+    """
+    # if there are any non-string values (wrong dtype or hidden behind object


move the comment to the except

jreback · 2019-06-12T14:43:17Z

doc/source/whatsnew/v0.25.0.rst

@@ -607,7 +607,7 @@ Strings
 ^^^^^^^

 - Bug in the ``__name__`` attribute of several methods of :class:`Series.str`, which were set incorrectly (:issue:`23551`)
-
+- Improved error message when passing ``Series`` of wrong dtype to :meth:`Series.str.cat` (:issue:`22722`)


use :class:`Series`

pandas/core/strings.py

h-vetinari

Typing comments

h-vetinari · 2019-06-12T20:48:29Z

pandas/core/strings.py

@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep):
    return np.sum(list_with_sep, axis=0)


+def cat_safe(list_of_columns, sep):


What's your expectation for typing the args? Just List? It would strictly speaking be List[np.array], but AFAICT, mypy resp. the typing module doesn't yet support numpy stubs natively.

simonjayhawkins · 2019-06-13T11:44:21Z

@h-vetinari if you merge master the checks ci failure should be fixed. I think i've seen the Windows py37_np141 failure before. probably a flaky test. likely pass next time around.

jreback

typing request, otherwise lgtm. ping on green.

jreback · 2019-06-13T18:49:27Z

pandas/core/strings.py

@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep):
    return np.sum(list_with_sep, axis=0)


+def cat_safe(list_of_columns, sep):


yes that would be fine

simonjayhawkins

much more comfortable without the pre-check to fail early and simpler code.

Thanks @h-vetinari

simonjayhawkins · 2019-06-14T11:10:47Z

pandas/core/strings.py

+                raise TypeError(
+                    'Concatenation requires list-likes containing only '
+                    'strings (or missing values). Offending values found in '
+                    'column {}'.format(dtype)) from None


is it worth having a raise outside the for loop to ensure we don't slip through, or is that not going to happen?

jreback · 2019-06-14T12:27:05Z

thanks @h-vetinari

gfyoung added Error Reporting Incorrect or improved errors from pandas Series Series data structure labels Jun 1, 2019

gfyoung reviewed Jun 1, 2019

View reviewed changes

pandas/core/strings.py Outdated Show resolved Hide resolved

gfyoung reviewed Jun 1, 2019

View reviewed changes

pandas/tests/test_strings.py Outdated Show resolved Hide resolved

gfyoung reviewed Jun 1, 2019

View reviewed changes

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

h-vetinari force-pushed the str_cat_err branch from 48ff187 to 55817c7 Compare June 1, 2019 19:46

simonjayhawkins requested changes Jun 1, 2019

View reviewed changes

pandas/core/strings.py Outdated Show resolved Hide resolved

pandas/tests/test_strings.py Outdated Show resolved Hide resolved

pandas/tests/test_strings.py Outdated Show resolved Hide resolved

simonjayhawkins added this to the 0.25.0 milestone Jun 1, 2019

simonjayhawkins reviewed Jun 1, 2019

View reviewed changes

simonjayhawkins reviewed Jun 2, 2019

View reviewed changes

Better TypeError for wrong dtype in str.cat

cd9aa24

h-vetinari force-pushed the str_cat_err branch from 0057702 to cd9aa24 Compare June 2, 2019 17:35

jreback requested changes Jun 2, 2019

View reviewed changes

jreback removed this from the 0.25.0 milestone Jun 2, 2019

h-vetinari commented Jun 3, 2019

View reviewed changes

jreback requested changes Jun 6, 2019

View reviewed changes

h-vetinari added 2 commits June 10, 2019 14:53

Merge remote-tracking branch 'upstream/master' into str_cat_err

fee9612

Review (jreback)

fd710de

h-vetinari commented Jun 10, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into str_cat_err

e7f0d7e

jreback requested changes Jun 12, 2019

View reviewed changes

simonjayhawkins reviewed Jun 12, 2019

View reviewed changes

pandas/core/strings.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into str_cat_err

bfca6d1

h-vetinari commented Jun 12, 2019

View reviewed changes

Review (jreback & simonjayhawkins)

02f6429

Merge remote-tracking branch 'upstream/master' into str_cat_err

cb73704

jreback requested changes Jun 13, 2019

View reviewed changes

jreback added this to the 0.25.0 milestone Jun 13, 2019

h-vetinari added 2 commits June 13, 2019 21:59

Merge remote-tracking branch 'upstream/master' into str_cat_err

3fb1411

Add typing to cat_core/cat_safe (review jreback)

9752aa7

simonjayhawkins approved these changes Jun 14, 2019

View reviewed changes

jreback approved these changes Jun 14, 2019

View reviewed changes

jreback merged commit 5d0ff69 into pandas-dev:master Jun 14, 2019

h-vetinari deleted the str_cat_err branch June 14, 2019 18:16

h-vetinari mentioned this pull request Jul 26, 2019

DEPR: execute deprecations for str.cat in v1.0 #27611

Merged

3 tasks

		@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep):
		return np.sum(list_with_sep, axis=0)


		def cat_safe(list_of_columns, sep):

Better error for str.cat with listlike of wrong dtype. #26607

Better error for str.cat with listlike of wrong dtype. #26607

Conversation

h-vetinari commented Jun 1, 2019

codecov bot commented Jun 1, 2019

Codecov Report

codecov bot commented Jun 1, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

simonjayhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari Jun 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins commented Jun 13, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 14, 2019

codecov bot commented Jun 1, 2019 •

edited

Loading

h-vetinari Jun 1, 2019 •

edited

Loading

h-vetinari left a comment •

edited

Loading