PERF: Add if branch for empty sep in str.cat #26605

h-vetinari · 2019-06-01T15:32:46Z

Follow-up to #23167, resp dropping py2. The branch I'm readding here had to be deleted originally to pass some python2 bytes-tests. In case there is no separator, we can avoid all the list-ops and speed up cat_core by a fair bit.

jreback · 2019-06-01T15:35:34Z

pandas/core/strings.py

@@ -48,6 +48,8 @@ def cat_core(list_of_columns, sep):
    nd.array
        The concatenation of list_of_columns with sep
    """
+    if sep == '':


can you show the benchmarks which change here?

See the middle block of this comment

that was 8 months ago
pls show a new one

This is a clear and unambiguous reduction in necessary computation, I fail to see how an ASV run is necessary (beyond what was done already, and especially since that code hasn't changed since then).

Maybe I'll get around to it in the next few weeks.

I don't consider this totally unambiguous - is this only applicable to extremely wide data frames? Benchmarks and context would be very useful

this adds complexity. unless this is like 2x faster would close this.

@WillAyd: I don't consider this totally unambiguous - is this only applicable to extremely wide data frames? Benchmarks and context would be very useful

Maybe the issue is not as apparent as I thought. Assuming a list_of_columns=[s, t, u, v] where (s, t, u, v are Series), then cat_core takes that list and interleaves sep to arrive at [s, sep, t, sep, u, sep, v] and pushes the whole thing into np.sum.

However, if the sep='', there is effectively nothing to add, and we're just uselessly wasting cycles to interleave (and then add) an empty sep.

@jreback: this adds complexity. unless this is like 2x faster would close this.

.str.cat has about 200LoC (not counting docstrings). Having 2 lines (1% of total) required to achieve 100% speedup sounds like an impossible standard for any perf-related change.

Based on the last runs in October (will try to do new ones as I said), I expect a 20-30% speedup for the (very common) case that sep=''. That should be more than enough to justify two trivial lines of code, IMO.

Based on the last runs in October (will try to do new ones as I said), I expect a 20-30% speedup for the (very common) case that sep=''. That should be more than enough to justify two trivial lines of code, IMO.

if that's true, then great. pls show benchmarks.

Running the ASV is on my todo-list...

codecov · 2019-06-01T16:15:04Z

Codecov Report

Merging #26605 into master will decrease coverage by 1.17%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26605      +/-   ##
==========================================
- Coverage   93.02%   91.84%   -1.18%     
==========================================
  Files         182      174       -8     
  Lines       50253    50709     +456     
==========================================
- Hits        46746    46576     -170     
- Misses       3507     4133     +626

Flag	Coverage Δ
#multiple	`90.39% <100%> (-1.3%)`	⬇️
#single	`41.76% <0%> (-0.75%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.92% <100%> (-0.02%)`	⬇️
pandas/plotting/_misc.py	`38.23% <0%> (-26.63%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-21.06%)`	⬇️
pandas/io/gcs.py	`80% <0%> (-20%)`	⬇️
pandas/io/s3.py	`89.47% <0%> (-10.53%)`	⬇️
pandas/core/groupby/base.py	`91.83% <0%> (-8.17%)`	⬇️
pandas/io/excel/_xlrd.py	`94.54% <0%> (-5.46%)`	⬇️
pandas/util/_decorators.py	`91.34% <0%> (-4.01%)`	⬇️
pandas/plotting/_core.py	`83.77% <0%> (-3.88%)`	⬇️
pandas/io/formats/printing.py	`85.56% <0%> (-3.28%)`	⬇️
... and 171 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0fd888c...2e0d03a. Read the comment docs.

h-vetinari · 2019-07-30T06:52:51Z

Finally got around to running the ASVs:

>asv continuous -f 1.1 upstream/master HEAD -b ^strings.Cat
[...]
       before           after         ratio
     [0fd888c8]       [2e0d03a9]
     <master>         <str_cat_perf>
-      59.9±0.6ms       53.0±0.4ms     0.88  strings.Cat.time_cat(3, None, '-', 0.0)
-      60.0±0.2ms       52.7±0.1ms     0.88  strings.Cat.time_cat(3, None, None, 0.0)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Note that this is for N=10000:

pandas/asv_bench/benchmarks/strings.py

Line 116 in 0fd888c

N = 10 ** 5

I believe the gains would be even better for larger arrays.

jreback · 2019-07-31T12:30:02Z

k thanks

jreback requested changes Jun 1, 2019

View reviewed changes

jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Jun 1, 2019

h-vetinari mentioned this pull request Jun 3, 2019

Possible performance regression in 9a42cbe85461c28417a5130bc80b035044c5575a #26639

Closed

h-vetinari mentioned this pull request Jul 26, 2019

DEPR: execute deprecations for str.cat in v1.0 #27611

Merged

3 tasks

PERF: Add if branch for empty sep in str.cat

2e0d03a

h-vetinari force-pushed the str_cat_perf branch from f7bf5ab to 2e0d03a Compare July 30, 2019 06:48

h-vetinari added 2 commits July 30, 2019 09:49

blackify

c1083f4

add comment

637686b

jreback added this to the 1.0 milestone Jul 31, 2019

jreback approved these changes Jul 31, 2019

View reviewed changes

jreback merged commit c046dfb into pandas-dev:master Jul 31, 2019

h-vetinari deleted the str_cat_perf branch July 31, 2019 13:26

quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019

PERF: Add if branch for empty sep in str.cat (pandas-dev#26605)

aaf4ea3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Add if branch for empty sep in str.cat #26605

PERF: Add if branch for empty sep in str.cat #26605

h-vetinari commented Jun 1, 2019 •

edited

Loading

jreback Jun 1, 2019

h-vetinari Jun 1, 2019

jreback Jun 1, 2019

h-vetinari Jun 1, 2019

WillAyd Jun 2, 2019

jreback Jun 2, 2019

h-vetinari Jun 3, 2019

jreback Jun 27, 2019

h-vetinari Jun 28, 2019

codecov bot commented Jun 1, 2019 •

edited

Loading

h-vetinari commented Jul 30, 2019

jreback commented Jul 31, 2019

PERF: Add if branch for empty sep in str.cat #26605

PERF: Add if branch for empty sep in str.cat #26605

Conversation

h-vetinari commented Jun 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 1, 2019 • edited Loading

Codecov Report

h-vetinari commented Jul 30, 2019

jreback commented Jul 31, 2019

h-vetinari commented Jun 1, 2019 •

edited

Loading

codecov bot commented Jun 1, 2019 •

edited

Loading