CLN/ERR: str.cat internals #22725

h-vetinari · 2018-09-16T00:11:27Z

This is mainly a clean-up of internal methods for str.cat that I didn't want to touch within #20347.

~~As a side benefit of changing the implementation, this also solves #22721. Finally, I've also added a better message for TypeErrors (closes #22722)~~

~~closes #22721~~
~~closes #22722~~

tests modified / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Here's the ASV output (the original implementation of this PR (see first commit) that used more higher-level pandas-functions like fillna, drop_na, etc. was up to three times slower, so I tweaked it some more, and actually believe that the last solution with interleave_sep is the most elegant anyway):

       before           after         ratio
     [37455764]       [4d1710f1]
         10.9±1ms         9.11±1ms    ~0.83  strings.Cat.time_cat(0, ',', '-', 0.0)
         9.55±1ms         10.9±0ms    ~1.15  strings.Cat.time_cat(0, ',', '-', 0.001)
         12.5±2ms         14.1±2ms    ~1.12  strings.Cat.time_cat(0, ',', '-', 0.15)
         9.94±1ms       9.23±0.7ms     0.93  strings.Cat.time_cat(0, ',', None, 0.0)
         14.3±2ms         8.68±1ms    ~0.61  strings.Cat.time_cat(0, ',', None, 0.001)
-        13.7±1ms       11.7±0.8ms     0.86  strings.Cat.time_cat(0, ',', None, 0.15)
       9.11±0.7ms         7.81±2ms    ~0.86  strings.Cat.time_cat(0, None, '-', 0.0)
       9.38±0.8ms         10.9±1ms    ~1.17  strings.Cat.time_cat(0, None, '-', 0.001)
         11.4±2ms         10.9±2ms     0.96  strings.Cat.time_cat(0, None, '-', 0.15)
         13.4±2ms       9.38±0.6ms    ~0.70  strings.Cat.time_cat(0, None, None, 0.0)
         9.23±2ms         11.7±1ms    ~1.27  strings.Cat.time_cat(0, None, None, 0.001)
         10.2±2ms         10.9±1ms     1.08  strings.Cat.time_cat(0, None, None, 0.15)
-        70.3±4ms         54.7±4ms     0.78  strings.Cat.time_cat(3, ',', '-', 0.0)
         62.5±4ms        46.9±20ms    ~0.75  strings.Cat.time_cat(3, ',', '-', 0.001)
        93.8±10ms        66.4±10ms    ~0.71  strings.Cat.time_cat(3, ',', '-', 0.15)
         62.5±4ms         62.5±4ms     1.00  strings.Cat.time_cat(3, ',', None, 0.0)
+        46.9±4ms         85.9±4ms     1.83  strings.Cat.time_cat(3, ',', None, 0.001)
         52.1±2ms         50.8±6ms     0.97  strings.Cat.time_cat(3, ',', None, 0.15)
         50.8±8ms         54.7±6ms     1.08  strings.Cat.time_cat(3, None, '-', 0.0)
         54.7±4ms         62.5±4ms    ~1.14  strings.Cat.time_cat(3, None, '-', 0.001)
         62.5±5ms         54.7±3ms    ~0.88  strings.Cat.time_cat(3, None, '-', 0.15)
         46.9±4ms         39.1±0ms    ~0.83  strings.Cat.time_cat(3, None, None, 0.0)
         46.9±8ms         58.6±6ms    ~1.25  strings.Cat.time_cat(3, None, None, 0.001)
         46.9±9ms         54.7±4ms    ~1.17  strings.Cat.time_cat(3, None, None, 0.15)
                                                                  ^  ^     ^     ^
                                                                  |  |     |     |
                                                         other_cols  |   na_rep  |
                                                                     |           |
                                                                    sep        na_frac

There's a bunch of noise in there, but by and large, things don't look so bad IMO. Especially, when one excludes the not-so-common worst-case scenario of a very small but non-zero amount of NaNs (na_frac=0.001):

       before           after         ratio
     [37455764]       [4d1710f1]
         10.9±1ms         9.11±1ms    ~0.83  strings.Cat.time_cat(0, ',', '-', 0.0)
         12.5±2ms         14.1±2ms    ~1.12  strings.Cat.time_cat(0, ',', '-', 0.15)
         9.94±1ms       9.23±0.7ms     0.93  strings.Cat.time_cat(0, ',', None, 0.0)
-        13.7±1ms       11.7±0.8ms     0.86  strings.Cat.time_cat(0, ',', None, 0.15)
       9.11±0.7ms         7.81±2ms    ~0.86  strings.Cat.time_cat(0, None, '-', 0.0)
         11.4±2ms         10.9±2ms     0.96  strings.Cat.time_cat(0, None, '-', 0.15)
         13.4±2ms       9.38±0.6ms    ~0.70  strings.Cat.time_cat(0, None, None, 0.0)
         10.2±2ms         10.9±1ms     1.08  strings.Cat.time_cat(0, None, None, 0.15)
-        70.3±4ms         54.7±4ms     0.78  strings.Cat.time_cat(3, ',', '-', 0.0)
         62.5±4ms        46.9±20ms    ~0.75  strings.Cat.time_cat(3, ',', '-', 0.001)
        93.8±10ms        66.4±10ms    ~0.71  strings.Cat.time_cat(3, ',', '-', 0.15)
         62.5±4ms         62.5±4ms     1.00  strings.Cat.time_cat(3, ',', None, 0.0)
         52.1±2ms         50.8±6ms     0.97  strings.Cat.time_cat(3, ',', None, 0.15)
         50.8±8ms         54.7±6ms     1.08  strings.Cat.time_cat(3, None, '-', 0.0)
         62.5±5ms         54.7±3ms    ~0.88  strings.Cat.time_cat(3, None, '-', 0.15)
         46.9±4ms         39.1±0ms    ~0.83  strings.Cat.time_cat(3, None, None, 0.0)
         46.9±9ms         54.7±4ms    ~1.17  strings.Cat.time_cat(3, None, None, 0.15)
                                                                  ^  ^     ^     ^
                                                                  |  |     |     |
                                                         other_cols  |   na_rep  |
                                                                     |           |
                                                                    sep        na_frac

pep8speaks · 2018-09-16T00:11:32Z

Hello @h-vetinari! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/strings.py !
There are no PEP8 issues in the file pandas/tests/test_strings.py !

Comment last updated on September 17, 2018 at 09:16 Hours UTC

gfyoung · 2018-09-16T05:01:18Z

@WillAyd @jreback : The conversations went stale a little in the original issues, and I'm not sure how well aligned these changes with what you guys were suggesting or saying in them.

h-vetinari · 2018-09-16T08:44:17Z

@gfyoung @WillAyd @jreback
I opened #22721 because the behaviour was changed through this refactor. As I said in the issue, I'm happy to disable it, but that would need adding in some check against binary data (because np.sum -- contrary to the previous implementation -- doesn't throw an error for binary data).

I do however stand by:

Beyond that, it's IMO the .str-accessor that should be enforcing the correct types, but since there's no dedicated string-dtype yet, the methods that do work for other object data (e.g. lists) are also used like that (e.g. people use .str.len() to get the different length of a Series of lists).

In other words, I'm +epsilon on allowing (consistent, reasonable) off-label use of the .str-accessor until there is a string dtype.

#22722 is equally something that I'm not going to fight for. I got some less than ideal error messages while testing, and decided this could/should be improved. If you disagree, it's easy to remove the offending lines.

gfyoung · 2018-09-16T20:38:52Z

@h-vetinari : Not a problem. I was only reading through the issues and your PR and matching up what you were doing with what was being said. Just pinging to get their eyes on this.

WillAyd · 2018-09-17T03:49:01Z

I'm still very much against setting the expectation that the .str accessor will work with byte objects.

Also generally don't think that a "cleanup" should be performed in a same PR that changes the expected functionality of the codebase. Would be much easier to stick to one thing at a time, i.e. a clean up that doesn't introduce or change any existing behavior

h-vetinari · 2018-09-17T09:02:12Z

@WillAyd

Would be much easier to stick to one thing at a time, i.e. a clean up that doesn't introduce or change any existing behavior

I reverted the (unrelated) fix for #22721, but the situation for #22722 is more delicate. The genesis was as follows:

refactor
see that behaviour changed
(sorta) agree with new behaviour
open issue so that changed behaviour in PR is explained

Turns out, that the situation is even a bit more delicate, as currently on master, bytes can be successfully concatenated as long as sep is explicitly set:

(pandas-dev) C:\Users\[...]\pddev>python
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'0.24.0.dev0+586.g8a1c8ad4b'
>>> s = pd.Series(np.array(list('abc'), 'S1').astype(object))
>>> t = pd.Series(np.array(list('def'), 'S1').astype(object))
>>> s.str.cat(t)
TypeError: sequence item 0: expected str instance, bytes found
>>> s.str.cat(t, sep=b'')
0    b'ad'
1    b'be'
2    b'cf'
dtype: object
>>> s.str.cat(t, sep=b',')
0    b'a,d'
1    b'b,e'
2    b'c,f'
dtype: object

The problem was that tests/frame/test_strings.test_method_on_bytes only tests sep=None.

This is the sort of thing I was talking about with the missing string dtype. Without one, there are legitimate off-label uses for the .str-accessor, and concatenating bytes workes like a charm already (not to mention several other methods from .str). The only real change here then would be that sep=None does not automatically trigger the TypeError.

jreback · 2018-09-18T11:22:49Z

@h-vetinari pls pls pls 1 thing per PR. We do NOT handle bytes in .str if you want to add tests and raise, pls do so, but not going to 'make it work better'. It is amazingly confusing and causes all sorts of errors. We probably don't have explicit checks on this (though I thought that we always infer on the strings that must be string/unicode and never bytes).

jreback

comments

h-vetinari · 2018-09-18T11:51:48Z

@jreback

We do NOT handle bytes in .str

Yes you (currently) do. Just try the code I posted above.

pls pls pls 1 thing per PR.

The whatsnew-note notwithstanding, this PR only changes the implementation (you'll see in the test that I've not changed anything substantial)

I understand that you don't want people using .str for byte data, but it works currently. The problem is that there's no good dtype distinction, and inspecting every element of a Series when calling .str would come with a big perf hit.

jreback · 2018-09-18T11:58:10Z

Yes you (currently) do. Just try the code I posted above.

It may happen to work. Instead of refactoring this as I said above, would prefer tests / and better error messages with bytes inputs.

h-vetinari · 2018-09-18T14:05:32Z

@jreback

It may happen to work. Instead of refactoring this as I said above, would prefer tests / and better error messages with bytes inputs.

The point is that this PR does not change the current behaviour and should stand on its merits, unrelated to the fact that you'd like to disallow .str on bytes.

jreback · 2018-09-18T14:27:38Z

@h-vetinari you removed tests, so clearly you are changing things.

h-vetinari · 2018-09-18T14:40:06Z

@jreback

you removed tests, so clearly you are changing things.

I removed one test for the internal method that has been factored away. Furthermore, this removed test (test_cat) is exactly replicated in the test directly below (test_str_cat).

codecov · 2018-09-19T06:41:14Z

Codecov Report

Merging #22725 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22725      +/-   ##
==========================================
- Coverage    92.2%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50924    50900      -24     
==========================================
- Hits        46952    46928      -24     
  Misses       3972     3972

Flag	Coverage Δ
#multiple	`90.61% <100%> (-0.01%)`	⬇️
#single	`42.32% <6.45%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.58% <100%> (-0.05%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8ce3d0...e58ec9d. Read the comment docs.

h-vetinari

@jreback

PTAL

h-vetinari · 2018-09-20T07:25:34Z

pandas/tests/test_strings.py

-        two = np.array(['a', NA, 'b', 'd', 'foo', NA], dtype=np.object_)
-
-        # single array
-        result = strings.str_cat(one)


@jreback

I removed one test for the internal method that has been factored away

Please have a look here - this is directly importing the internal method and testing it (not str.cat)

h-vetinari · 2018-09-20T07:34:46Z

pandas/tests/test_strings.py

-        rgx = 'All arrays must be same length'
-        three = Series(['1', '2', '3'])
-
-        with tm.assert_raises_regex(ValueError, rgx):


Furthermore, this removed test (test_cat) is exactly replicated in the test directly below (test_str_cat).

I can't mark lines that are not in the diff, but check out

pandas/pandas/tests/test_strings.py

Line 119 in fcd11b5

result = s.str.cat()

I replicated the removed test (acting on strings.test_cat) as a test acting on str.cat within #20347.

h-vetinari · 2018-09-23T16:12:12Z

@jreback
While you're at it with the reviewing, please don't forget this one. :)

jreback · 2018-09-23T16:44:47Z

pandas/core/strings.py

-def str_cat(arr, others=None, sep=None, na_rep=None):
-    """
+def interleave_sep(all_cols, sep):
+    '''


use triple-double quotes

jreback · 2018-09-23T16:45:03Z

pandas/core/strings.py

-def str_cat(arr, others=None, sep=None, na_rep=None):
-    """
+def interleave_sep(all_cols, sep):
+    '''


all_cols -> list_of_columns

jreback · 2018-09-23T16:47:15Z

pandas/core/strings.py

-            result = str_cat(data, others=others, sep=sep, na_rep=na_rep)
-            return self._wrap_result(result,
-                                     use_codes=(not self._is_categorical))
+            data = data.astype(object).values


why is this astype needed?

I used it because data may be categorical, and then values is not necessarily a numpy array. Changed to ensure_object which you mentioned below, hope this is better.

jreback · 2018-09-23T16:48:00Z

pandas/core/strings.py

+                if na_rep is None:
+                    return sep.join(data[~mask])
+                return sep.join(np.where(mask, na_rep, data))
+            return sep.join(data)


can we do a single sep.join, and just have the branches mask the data as needed

jreback · 2018-09-23T16:48:42Z

pandas/core/strings.py

            data, others = data.align(others, join=join)
            others = [others[x] for x in others]  # again list of Series

-        # str_cat discards index
-        res = str_cat(data, others=others, sep=sep, na_rep=na_rep)
+        all_cols = [x.astype(object).values for x in [data] + others]


why do you need the astype, much prefer ensure_object generally

jreback · 2018-09-23T16:49:18Z

pandas/core/strings.py

+        masks = np.array([isna(x) for x in all_cols])
+        union_mask = np.logical_or.reduce(masks, axis=0)
+
+        if na_rep is None and union_mask.any():


comment on these cases

Added comments

jreback · 2018-09-23T16:50:07Z

pandas/core/strings.py

+            result[not_masked] = np.sum(all_cols, axis=0)
+        elif na_rep is not None and union_mask.any():
+            # fill NaNs
+            all_cols = [np.where(masks[i], na_rep, all_cols[i])


use zip(masks, all_cols)

jreback · 2018-09-23T16:50:45Z

pandas/core/strings.py

+        return all_cols
+    result = [sep] * (2 * len(all_cols) - 1)
+    result[::2] = all_cols
+    return result


I would simply do np.sum(result) here, no?

OK, that's reasonable. Refactored the function as necessary

jreback · 2018-09-23T16:51:36Z

pandas/tests/test_strings.py

-        two = np.array(['a', NA, 'b', 'd', 'foo', NA], dtype=np.object_)
-
-        # single array
-        result = strings.str_cat(one)


jreback · 2018-09-23T16:52:18Z

pandas/tests/test_strings.py

@@ -3136,7 +3089,7 @@ def test_method_on_bytes(self):
        lhs = Series(np.array(list('abc'), 'S1').astype(object))
        rhs = Series(np.array(list('def'), 'S1').astype(object))
        if compat.PY3:
-            pytest.raises(TypeError, lhs.str.cat, rhs)
+            pytest.raises(TypeError, lhs.str.cat, rhs, sep=',')


is this the bytes concat?

h-vetinari

Thanks for review; pushed new commits

h-vetinari · 2018-09-23T17:14:52Z

pandas/core/strings.py

+        return all_cols
+    result = [sep] * (2 * len(all_cols) - 1)
+    result[::2] = all_cols
+    return result


OK, that's reasonable. Refactored the function as necessary

h-vetinari · 2018-09-23T17:15:40Z

pandas/core/strings.py

-            result = str_cat(data, others=others, sep=sep, na_rep=na_rep)
-            return self._wrap_result(result,
-                                     use_codes=(not self._is_categorical))
+            data = data.astype(object).values


I used it because data may be categorical, and then values is not necessarily a numpy array. Changed to ensure_object which you mentioned below, hope this is better.

h-vetinari · 2018-09-23T17:15:46Z

pandas/core/strings.py

+                if na_rep is None:
+                    return sep.join(data[~mask])
+                return sep.join(np.where(mask, na_rep, data))
+            return sep.join(data)


h-vetinari · 2018-09-23T17:15:53Z

pandas/core/strings.py

            data, others = data.align(others, join=join)
            others = [others[x] for x in others]  # again list of Series

-        # str_cat discards index
-        res = str_cat(data, others=others, sep=sep, na_rep=na_rep)
+        all_cols = [x.astype(object).values for x in [data] + others]


h-vetinari · 2018-09-23T17:16:01Z

pandas/core/strings.py

+        masks = np.array([isna(x) for x in all_cols])
+        union_mask = np.logical_or.reduce(masks, axis=0)
+
+        if na_rep is None and union_mask.any():


Added comments

h-vetinari · 2018-09-23T17:16:16Z

pandas/tests/test_strings.py

@@ -3136,7 +3089,7 @@ def test_method_on_bytes(self):
        lhs = Series(np.array(list('abc'), 'S1').astype(object))
        rhs = Series(np.array(list('def'), 'S1').astype(object))
        if compat.PY3:
-            pytest.raises(TypeError, lhs.str.cat, rhs)
+            pytest.raises(TypeError, lhs.str.cat, rhs, sep=',')


h-vetinari · 2018-09-30T17:28:46Z

@jreback

PTAL

h-vetinari · 2018-10-05T06:18:15Z

@jreback

PTAL

h-vetinari · 2018-10-11T19:58:18Z

@WillAyd @jreback
Unfortunately, I have bad news. I started out in this PR with a very idiomatic solution (see the first couple commits), and it was just too slow.

Here's the ASV for the last commit:

All benchmarks:

       before           after         ratio
     [b28cf5aa]       [a97fe67e]
     <master>         <cln_str_cat>
         9.38±0ms         9.38±0ms     1.00  strings.Cat.time_cat(0, ',', '-', 0.0)
       9.38±0.8ms         10.9±0ms    ~1.17  strings.Cat.time_cat(0, ',', '-', 0.001)
         10.9±0ms       12.5±0.6ms    ~1.14  strings.Cat.time_cat(0, ',', '-', 0.15)
       9.38±0.6ms       8.59±0.8ms     0.92  strings.Cat.time_cat(0, ',', None, 0.0)
-      12.5±0.8ms         10.9±0ms     0.87  strings.Cat.time_cat(0, ',', None, 0.001)
-        14.1±0ms         10.9±0ms     0.78  strings.Cat.time_cat(0, ',', None, 0.15)
         9.38±0ms       9.38±0.8ms     1.00  strings.Cat.time_cat(0, None, '-', 0.0)
         9.38±0ms       9.38±0.8ms     1.00  strings.Cat.time_cat(0, None, '-', 0.001)
       10.9±0.8ms       12.5±0.8ms    ~1.14  strings.Cat.time_cat(0, None, '-', 0.15)
         10.9±2ms       7.81±0.8ms    ~0.71  strings.Cat.time_cat(0, None, None, 0.0)
         7.81±0ms       10.9±0.8ms    ~1.40  strings.Cat.time_cat(0, None, None, 0.001)
         10.9±2ms       10.9±0.8ms     1.00  strings.Cat.time_cat(0, None, None, 0.15)
         78.1±8ms         93.8±8ms    ~1.20  strings.Cat.time_cat(3, ',', '-', 0.0)
+        62.5±8ms          109±0ms     1.75  strings.Cat.time_cat(3, ',', '-', 0.001)
+        78.1±8ms          125±8ms     1.60  strings.Cat.time_cat(3, ',', '-', 0.15)
+        46.9±8ms         93.8±0ms     2.00  strings.Cat.time_cat(3, ',', None, 0.0)
         62.5±8ms          109±0ms    ~1.75  strings.Cat.time_cat(3, ',', None, 0.001)
+        46.9±8ms          102±8ms     2.17  strings.Cat.time_cat(3, ',', None, 0.15)
         62.5±0ms         78.1±6ms    ~1.25  strings.Cat.time_cat(3, None, '-', 0.0)
+        46.9±8ms         93.8±0ms     2.00  strings.Cat.time_cat(3, None, '-', 0.001)
+        62.5±8ms          125±6ms     2.00  strings.Cat.time_cat(3, None, '-', 0.15)
+        62.5±8ms         85.9±8ms     1.38  strings.Cat.time_cat(3, None, None, 0.0)
         46.9±6ms         93.8±0ms    ~2.00  strings.Cat.time_cat(3, None, None, 0.001)
         46.9±0ms         93.8±0ms    ~2.00  strings.Cat.time_cat(3, None, None, 0.15)

So especially when others is not None (all the pd.concat and dealing with DataFrames) we lose perf.
As a comparison, here's the ASV before the idiomatic changes @WillAyd requested:

All benchmarks:

       before           after         ratio
     [b28cf5aa]       [0d3c6d21]
     <master>         <cln_str_cat~2>
         9.38±0ms         9.38±0ms     1.00  strings.Cat.time_cat(0, ',', '-', 0.0)
         9.38±0ms         10.9±0ms    ~1.17  strings.Cat.time_cat(0, ',', '-', 0.001)
         10.9±0ms         12.5±0ms    ~1.14  strings.Cat.time_cat(0, ',', '-', 0.15)
       9.38±0.6ms         9.38±0ms     1.00  strings.Cat.time_cat(0, ',', None, 0.0)
       14.1±0.8ms         10.9±0ms    ~0.78  strings.Cat.time_cat(0, ',', None, 0.001)
       14.1±0.6ms       11.7±0.8ms    ~0.83  strings.Cat.time_cat(0, ',', None, 0.15)
         9.38±0ms         9.38±0ms     1.00  strings.Cat.time_cat(0, None, '-', 0.0)
       9.38±0.6ms       10.9±0.6ms    ~1.17  strings.Cat.time_cat(0, None, '-', 0.001)
         10.9±0ms         12.5±0ms    ~1.14  strings.Cat.time_cat(0, None, '-', 0.15)
         10.9±0ms         7.81±2ms    ~0.71  strings.Cat.time_cat(0, None, None, 0.0)
         9.38±0ms       10.9±0.8ms    ~1.17  strings.Cat.time_cat(0, None, None, 0.001)
       10.9±0.8ms       11.7±0.8ms     1.07  strings.Cat.time_cat(0, None, None, 0.15)
         78.1±0ms         78.1±8ms     1.00  strings.Cat.time_cat(3, ',', '-', 0.0)
         70.3±8ms         78.1±0ms    ~1.11  strings.Cat.time_cat(3, ',', '-', 0.001)
         93.8±8ms         93.8±0ms     1.00  strings.Cat.time_cat(3, ',', '-', 0.15)
         46.9±6ms         78.1±0ms    ~1.67  strings.Cat.time_cat(3, ',', None, 0.0)
         46.9±8ms         78.1±0ms    ~1.67  strings.Cat.time_cat(3, ',', None, 0.001)
         46.9±6ms         62.5±8ms    ~1.33  strings.Cat.time_cat(3, ',', None, 0.15)
         54.7±8ms         54.7±8ms     1.00  strings.Cat.time_cat(3, None, '-', 0.0)
         46.9±8ms         62.5±8ms    ~1.33  strings.Cat.time_cat(3, None, '-', 0.001)
         62.5±0ms        78.1±10ms    ~1.25  strings.Cat.time_cat(3, None, '-', 0.15)
         62.5±6ms         46.9±8ms    ~0.75  strings.Cat.time_cat(3, None, None, 0.0)
         46.9±6ms         62.5±0ms    ~1.33  strings.Cat.time_cat(3, None, None, 0.001)
         46.9±0ms         62.5±6ms    ~1.33  strings.Cat.time_cat(3, None, None, 0.15)

This isn't great, but not too bad IMO. Obviously it costs us to uselessly add in sep='' just to catch TypeErrors that should already be caught in the .str-accessor. I have something in mind there as well.
The direct comparison:

                        HEAD~2 vs. master  HEAD vs. master HEAD vs. HEAD~2 HvH2 increase
(3, ',', '-', 0.0)                   1.00             1.20            1.20       +20.00%
(3, ',', '-', 0.001)                 1.11             1.75            1.58       +57.66%
(3, ',', '-', 0.15)                  1.00             1.60            1.60       +60.00%
(3, ',', None, 0.0)                  1.67             2.00            1.20       +19.76%
(3, ',', None, 0.001)                1.67             1.75            1.05        +4.79%
(3, ',', None, 0.15)                 1.33             2.17            1.63       +63.16%
(3, None, '-', 0.0)                  1.00             1.25            1.25       +25.00%
(3, None, '-', 0.001)                1.33             2.00            1.50       +50.38%
(3, None, '-', 0.15)                 1.25             2.00            1.60       +60.00%
(3, None, None, 0.0)                 0.75             1.38            1.84       +84.00%
(3, None, None, 0.001)               1.33             2.00            1.50       +50.38%
(3, None, None, 0.15)                1.33             2.00            1.50       +50.38%

Finally, as a general warning about the results right after the run with SOME BENCHMARKS CHANGED SIGNIFICANTLY etc.: all benchmarks with a ~ in their ratio are falsely omitted from those results. I've opened airspeed-velocity/asv#752 for that.

WillAyd · 2018-10-11T22:04:09Z

IIUC you are saying the last batch of changes requested are causing performance to suffer anywhere between 20-80%? I've been wrong before but at the same time I've never seen instances where applying a function via list comprehension would be significantly faster than applying to the entire frame. Would be helpful if you could profile and debug further

h-vetinari · 2018-10-11T22:14:13Z

IIUC you are saying the last batch of changes requested are causing performance to suffer anywhere between 20-80%? I've been wrong before but at the same time I've never seen instances where applying a function via list comprehension would be significantly faster than applying to the entire frame.

You do understand correctly. The list-comps themselves (i.e. not counting what goes on inside) are pretty fast, but the more important part is staying in numpy-land, and only going to pandas-land where absolutely necessary. You can check some of the earlier commits (and the ASVs at the top) yourself. In short: working on pandas objects like DataFrame, pd.concat is expensive compared to pure numpy.

Would be helpful if you could profile and debug further

I did it in the beginning of this PR already, with the above conclusion. Since IMO "Non-idiomatic" < PERF (by a large margin), this case is settled for me.

h-vetinari · 2018-10-11T22:22:23Z

I've never seen instances where applying a function via list comprehension would be significantly faster than applying to the entire frame.

It's not just that single comprehension either, we're concatenating more often (before it was just for the alignment), to always get a DataFrame.

h-vetinari

@WillAyd
Some further explanation in the diff of the last commit:
https://github.com/pandas-dev/pandas/pull/22725/commits/e58ec9dfa82a459d9b316b678b77d50fc4901e9e

h-vetinari · 2018-10-12T06:48:13Z

pandas/core/strings.py

-        # concatenate others into DataFrame; need to add keys for uniqueness in
-        # case of duplicate columns (for join is None, all indexes are already
-        # the same after _get_series_list, which forces alignment in this case)
-        others = concat(others, axis=1,


@WillAyd, this is the main reason for the slow-down. For working with a DataFrame below (as you wished), we first need to create it with pd.concat (expensive). Before, we were only using pd.concat if the indices need to be aligned (which they don't in the benchmarks).

h-vetinari · 2018-10-12T22:48:55Z

@jreback @WillAyd
Can we sacrifice the idiomatic code for perf? Or how do we proceed here?

WillAyd · 2018-10-12T23:01:49Z

pd.concat is not expensive. In fact here's a small comparison of the initial part of both code branches:

In [50]: sers = [pd.Series(np.arange(100_000)) for x in range(10)] 
 
In [57]: %%timeit  
    ...: all_cols = [ensure_object(x) for x in sers]                                                                     
38.2 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [60]: %%timeit  
    ...: df = pd.concat(sers, axis=1) 
    ...: all_cols_df = ensure_object(df)                                                                 
37.8 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I think the problem may be that ensure_object against a DataFrame returns an ndarray of ndarrays with shape of (100_000, 10) whereas the shape of all_cols has a shape of (10, 100_000). If you can inspect more closely would be helpful

Unless one of the other devs objects, I would really prefer this to be idiomatic from a pandas perspective. And just to be clear on what that actually means, list comprehensions over 2D data are NOT idiomatic when operations can be performed against a DataFrame instead. We can always optimize operations with the latter but are limited in regards to the former.

h-vetinari · 2018-10-12T23:21:57Z

There's not much to go on - the last commit shows how little changed: https://github.com/pandas-dev/pandas/pull/22725/commits/e58ec9dfa82a459d9b316b678b77d50fc4901e9e (I make this format as code to prevent github from mangling the actual comparison url).

cat_core doesn't have to np.split
one less pd.concat (because it's now only necessary for different indexes)
list comp instead of DataFrame operations

I honestly don't see where it's coming from if not the useless concatenating and then splitting again (because, for interleaving sep in cat_core, we need a list of Series anyway). Of course it'd be nicer to have idiomatic code (I tried that right off the bat, but it was 2-3x slower), but ultimately perf should dictate this. All of the cython code isn't pandas-idiomatic either. ;-)

jreback · 2018-10-12T23:43:17Z

@h-vetinari code maintainablinity is actually the most important property of changes
pls use more idiomatic constructions as @WillAyd indicates

jreback · 2018-10-14T17:42:05Z

looks ok to me; @WillAyd merge when satisfied

WillAyd · 2018-10-14T21:55:04Z

Thanks @h-vetinari !

This was referenced Sep 16, 2018

Improve TypeError message for str.cat #22722

Closed

str.cat not working with binary data on Python3 #22721

Closed

gfyoung added Strings String extension data type and string data Error Reporting Incorrect or improved errors from pandas Clean labels Sep 16, 2018

h-vetinari force-pushed the cln_str_cat branch from 5506842 to 87b815b Compare September 17, 2018 09:00

h-vetinari force-pushed the cln_str_cat branch 2 times, most recently from a0975fd to f79c707 Compare September 17, 2018 09:16

jreback requested changes Sep 18, 2018

View reviewed changes

h-vetinari force-pushed the cln_str_cat branch from f79c707 to fcd11b5 Compare September 18, 2018 13:38

h-vetinari commented Sep 20, 2018

View reviewed changes

jreback requested changes Sep 23, 2018

View reviewed changes

h-vetinari commented Sep 23, 2018

View reviewed changes

jreback approved these changes Oct 5, 2018

View reviewed changes

h-vetinari added 5 commits October 11, 2018 09:38

Lint; more consistent naming for masks

807f18e

Add comment

ed27c66

Review (WillAyd)

0d3c6d2

Review (WillAyd)

36c6240

Add np-compat

a97fe67

h-vetinari force-pushed the cln_str_cat branch from fad997f to a97fe67 Compare October 11, 2018 19:59

Revert using more idiomatic code due to perf

e58ec9d

h-vetinari commented Oct 12, 2018

View reviewed changes

WillAyd merged commit f9d237b into pandas-dev:master Oct 14, 2018

h-vetinari mentioned this pull request Oct 15, 2018

.str._validate should infer for Series, not raise for all-na Index #23163

Closed

h-vetinari deleted the cln_str_cat branch October 15, 2018 18:29

h-vetinari mentioned this pull request Oct 15, 2018

API: Series.str-accessor infers dtype (and Index.str does not raise on all-NA) #23167

Merged

3 tasks

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Oct 15, 2018

Reapply perf improvement that was struck from pandas-dev#22725

b3337d0

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Oct 30, 2018

Reapply perf improvement that was struck from pandas-dev#22725

2f94c13

h-vetinari mentioned this pull request Nov 1, 2018

API: change default for sep in str.cat (in docstring) #23443

Merged

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Nov 2, 2018

Reapply perf improvement that was struck from pandas-dev#22725

3ae999a

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

CLN/ERR: str.cat internals (pandas-dev#22725)

4dc53a4

This was referenced Jun 1, 2019

PERF: Add if branch for empty sep in str.cat #26605

Merged

Possible performance regression in 9a42cbe85461c28417a5130bc80b035044c5575a #26639

Closed

h-vetinari mentioned this pull request Jul 26, 2019

DEPR: execute deprecations for str.cat in v1.0 #27611

Merged

3 tasks

CLN/ERR: str.cat internals #22725

CLN/ERR: str.cat internals #22725

Conversation

h-vetinari commented Sep 16, 2018 • edited Loading

pep8speaks commented Sep 16, 2018 • edited Loading

Comment last updated on September 17, 2018 at 09:16 Hours UTC

gfyoung commented Sep 16, 2018

h-vetinari commented Sep 16, 2018 • edited Loading

gfyoung commented Sep 16, 2018

WillAyd commented Sep 17, 2018

h-vetinari commented Sep 17, 2018 • edited Loading

jreback commented Sep 18, 2018

jreback left a comment

Choose a reason for hiding this comment

h-vetinari commented Sep 18, 2018

jreback commented Sep 18, 2018

h-vetinari commented Sep 18, 2018

jreback commented Sep 18, 2018

h-vetinari commented Sep 18, 2018 • edited Loading

codecov bot commented Sep 19, 2018 • edited Loading

Codecov Report

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Sep 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Sep 30, 2018

h-vetinari commented Oct 5, 2018

h-vetinari commented Oct 11, 2018

WillAyd commented Oct 11, 2018

h-vetinari commented Oct 11, 2018

h-vetinari commented Oct 11, 2018 • edited Loading

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Oct 12, 2018

WillAyd commented Oct 12, 2018

h-vetinari commented Oct 12, 2018

jreback commented Oct 12, 2018

jreback commented Oct 14, 2018

WillAyd commented Oct 14, 2018

h-vetinari commented Sep 16, 2018 •

edited

Loading

pep8speaks commented Sep 16, 2018 •

edited

Loading

h-vetinari commented Sep 16, 2018 •

edited

Loading

h-vetinari commented Sep 17, 2018 •

edited

Loading

h-vetinari commented Sep 18, 2018 •

edited

Loading

codecov bot commented Sep 19, 2018 •

edited

Loading

h-vetinari commented Oct 11, 2018 •

edited

Loading