BENCH: asv csv reading benchmarks now rewind StringIO objects #21807

tylerjereddy · 2018-07-07T23:55:10Z

In short, the asv benchmarks used for timing read_csv are actually reading in empty file-like objects most of the time at the moment. This is because the setup() method of the benchmarks is only called between repeats and not between iterations of repeats--see writing benchmarks docs.

You can confirm this by simply printing the size of the dataframes you get from read_csv in any StringIO-dependent benchmarks--the first iteration of a repeat should have the correct size, but the rest will all be 0. You could also just print the memory address (StringIO object)--they'll all be the same after the first iteration of a repeat. I did this with @stefanv and @mattip in the NumPy context.

This can be quite confusing--a bit different than the paradigm you might expect in i.e, a unit test context--for more gory details see my related comment in NumPy loadtxt() asv benchmarks. As noted there, one could probably avoid this by using an actual file object instead / if preferred, but realistically this just boils down to how timeit works. Confounding the results with the seek is unfortunate, so working with file objects may be preferred if dumping files in the bench env is acceptable.

codecov · 2018-07-08T10:03:33Z

Codecov Report

Merging #21807 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #21807   +/-   ##
=======================================
  Coverage   92.05%   92.05%           
=======================================
  Files         170      170           
  Lines       50708    50708           
=======================================
  Hits        46677    46677           
  Misses       4031     4031

Flag	Coverage Δ
#multiple	`90.45% <ø> (ø)`	⬆️
#single	`42.36% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a2fbce...90d3dda. Read the comment docs.

jreback · 2018-07-08T12:55:10Z

asv_bench/benchmarks/io/csv.py

@@ -69,6 +69,7 @@ def setup(self, infer_datetime_format, format):
        self.data = StringIO('\n'.join(rng.strftime(dt_format).tolist()))

    def time_read_csv(self, infer_datetime_format, format):
+        self.data.seek(0)


hmm can we instead make data a property which does this, otherwise this is likely to be forgotten the next time a benchmark is written. can you show a before /after on these benchmarks?

pep8speaks · 2018-07-12T14:04:19Z

Hello @tylerjereddy! Thanks for updating the PR.

In the file asv_bench/benchmarks/io/csv.py, following are the PEP8 issues :

Line 184:80: E501 line too long (87 > 79 characters)

Comment last updated on July 26, 2018 at 20:11 Hours UTC

tylerjereddy · 2018-07-12T14:49:26Z

Revised accordingly with before / after log files from asv run -e -b "ReadCSV*" >& bench_results.txt below. Note that this won't scrape in all of the affected benchmarks because not all CSV reading benchmarks follow a consistent naming scheme for regular expression selection, but should pull in many of them. Not sure if pandas CI does a quick (--quick) check of asv suite to make sure there are no bench errors on PRs (proposals to do this are in progress for NumPy / SciPy).

Before rewinding StringIO objects in csv reading benchmarks (commit hash: 7829ad3): old_bench_results.txt

After rewinding StringIO objects in csv reading benchmarks: bench_results.txt

I agree that using the property approach may alleviate future forgetting-to-rewind issues, though you'll likely notice that one of the classes where two data1/2 objects are used instead of parametrization becomes slightly more complicated to think about.

Most csv reading benchmarks are now a bit slower (based on visual inspection of above logs), but this is obviously going to happen when you don't read in empty file objects by mistake for most of the iterations, and there's also the conflated time from adding in rewinds now too.

There's another alternative here if you prefer--setting number = 1 in each of the class objects will force setup() to run between each iteration; I've preferred not to do that and instead allow the decisions about appropriate number of repeats & iterations to be offloaded to asv and timeit under the hood.

jreback · 2018-07-20T13:21:05Z

asv_bench/benchmarks/io/csv.py

-class ReadCSVDInferDatetimeFormat(object):
+class StringIORewind(object):
+
+    @property


just make this a function I think and pass in the StringIO object

jbrockmendel · 2018-07-26T19:24:52Z

Travis failure looks unrelated. @tylerjereddy this is almost over the finish line.

… the end * benchmarks for read_csv() now properly rewind StringIO objects prior to reading them in; previously, all iterations of an asv repeat timing run would read in no data because the StringIO object was pointing to its end after the first iteration--setup() only runs between repeats, not iterations within repeats of timeit

tylerjereddy · 2018-07-26T20:14:49Z

Revised / squashed / rebased / force pushed. Hopefully better now.

jreback · 2018-07-28T14:29:35Z

thanks @tylerjereddy

* master: BENCH: asv csv reading benchmarks no longer read StringIO objects off the end (pandas-dev#21807) BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 (pandas-dev#21224) BUG: bug in GroupBy.count where arg minlength passed to np.bincount must be None for np<1.13 (pandas-dev#21957) CLN: Vbench to asv conversion script (pandas-dev#22089) consistent docstring (pandas-dev#22066) TST: skip pytables test with not-updated pytables conda package (pandas-dev#22099) CLN: Remove Legacy MultiIndex Index Compatibility (pandas-dev#21740) DOC: Reword doc for filepath_or_buffer in read_csv (pandas-dev#22058) BUG: rolling with MSVC 2017 build (pandas-dev#21813)

… the end (pandas-dev#21807) * benchmarks for read_csv() now properly rewind StringIO objects prior to reading them in; previously, all iterations of an asv repeat timing run would read in no data because the StringIO object was pointing to its end after the first iteration--setup() only runs between repeats, not iterations within repeats of timeit

tylerjereddy mentioned this pull request Jul 8, 2018

setup() granularity and subtracting related tasks airspeed-velocity/asv#664

Closed

jreback requested changes Jul 8, 2018

View reviewed changes

jreback added the Performance Memory or execution speed performance label Jul 8, 2018

jreback added this to the 0.24.0 milestone Jul 8, 2018

jreback requested changes Jul 20, 2018

View reviewed changes

tylerjereddy mentioned this pull request Jul 23, 2018

BENCH: Add benchmarks for np.loadtxt reading from CSV format files numpy/numpy#11422

Merged

tylerjereddy force-pushed the asv_csv_rewind branch from 7d606f4 to 90d3dda Compare July 26, 2018 20:11

jreback approved these changes Jul 28, 2018

View reviewed changes

jreback merged commit 0b7a08b into pandas-dev:master Jul 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCH: asv csv reading benchmarks now rewind StringIO objects #21807

BENCH: asv csv reading benchmarks now rewind StringIO objects #21807

tylerjereddy commented Jul 7, 2018

codecov bot commented Jul 8, 2018 •

edited

Loading

jreback Jul 8, 2018

pep8speaks commented Jul 12, 2018 •

edited

Loading

tylerjereddy commented Jul 12, 2018

jreback Jul 20, 2018

jbrockmendel commented Jul 26, 2018

tylerjereddy commented Jul 26, 2018

jreback commented Jul 28, 2018

BENCH: asv csv reading benchmarks now rewind StringIO objects #21807

BENCH: asv csv reading benchmarks now rewind StringIO objects #21807

Conversation

tylerjereddy commented Jul 7, 2018

codecov bot commented Jul 8, 2018 • edited Loading

Codecov Report

jreback Jul 8, 2018

Choose a reason for hiding this comment

pep8speaks commented Jul 12, 2018 • edited Loading

Comment last updated on July 26, 2018 at 20:11 Hours UTC

tylerjereddy commented Jul 12, 2018

jreback Jul 20, 2018

Choose a reason for hiding this comment

jbrockmendel commented Jul 26, 2018

tylerjereddy commented Jul 26, 2018

jreback commented Jul 28, 2018

codecov bot commented Jul 8, 2018 •

edited

Loading

pep8speaks commented Jul 12, 2018 •

edited

Loading