Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BENCH: asv csv reading benchmarks now rewind StringIO objects #21807

Merged
merged 1 commit into from
Jul 28, 2018

Conversation

tylerjereddy
Copy link
Contributor

In short, the asv benchmarks used for timing read_csv are actually reading in empty file-like objects most of the time at the moment. This is because the setup() method of the benchmarks is only called between repeats and not between iterations of repeats--see writing benchmarks docs.

You can confirm this by simply printing the size of the dataframes you get from read_csv in any StringIO-dependent benchmarks--the first iteration of a repeat should have the correct size, but the rest will all be 0. You could also just print the memory address (StringIO object)--they'll all be the same after the first iteration of a repeat. I did this with @stefanv and @mattip in the NumPy context.

This can be quite confusing--a bit different than the paradigm you might expect in i.e, a unit test context--for more gory details see my related comment in NumPy loadtxt() asv benchmarks. As noted there, one could probably avoid this by using an actual file object instead / if preferred, but realistically this just boils down to how timeit works. Confounding the results with the seek is unfortunate, so working with file objects may be preferred if dumping files in the bench env is acceptable.

@codecov
Copy link

codecov bot commented Jul 8, 2018

Codecov Report

Merging #21807 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21807   +/-   ##
=======================================
  Coverage   92.05%   92.05%           
=======================================
  Files         170      170           
  Lines       50708    50708           
=======================================
  Hits        46677    46677           
  Misses       4031     4031
Flag Coverage Δ
#multiple 90.45% <ø> (ø) ⬆️
#single 42.36% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a2fbce...90d3dda. Read the comment docs.

@@ -69,6 +69,7 @@ def setup(self, infer_datetime_format, format):
self.data = StringIO('\n'.join(rng.strftime(dt_format).tolist()))

def time_read_csv(self, infer_datetime_format, format):
self.data.seek(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm can we instead make data a property which does this, otherwise this is likely to be forgotten the next time a benchmark is written. can you show a before /after on these benchmarks?

@jreback jreback added the Performance Memory or execution speed performance label Jul 8, 2018
@jreback jreback added this to the 0.24.0 milestone Jul 8, 2018
@pep8speaks
Copy link

pep8speaks commented Jul 12, 2018

Hello @tylerjereddy! Thanks for updating the PR.

Line 184:80: E501 line too long (87 > 79 characters)

Comment last updated on July 26, 2018 at 20:11 Hours UTC

@tylerjereddy
Copy link
Contributor Author

Revised accordingly with before / after log files from asv run -e -b "ReadCSV*" >& bench_results.txt below. Note that this won't scrape in all of the affected benchmarks because not all CSV reading benchmarks follow a consistent naming scheme for regular expression selection, but should pull in many of them. Not sure if pandas CI does a quick (--quick) check of asv suite to make sure there are no bench errors on PRs (proposals to do this are in progress for NumPy / SciPy).

Before rewinding StringIO objects in csv reading benchmarks (commit hash: 7829ad3): old_bench_results.txt

After rewinding StringIO objects in csv reading benchmarks: bench_results.txt

I agree that using the property approach may alleviate future forgetting-to-rewind issues, though you'll likely notice that one of the classes where two data1/2 objects are used instead of parametrization becomes slightly more complicated to think about.

Most csv reading benchmarks are now a bit slower (based on visual inspection of above logs), but this is obviously going to happen when you don't read in empty file objects by mistake for most of the iterations, and there's also the conflated time from adding in rewinds now too.

There's another alternative here if you prefer--setting number = 1 in each of the class objects will force setup() to run between each iteration; I've preferred not to do that and instead allow the decisions about appropriate number of repeats & iterations to be offloaded to asv and timeit under the hood.

class ReadCSVDInferDatetimeFormat(object):
class StringIORewind(object):

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make this a function I think and pass in the StringIO object

@jbrockmendel
Copy link
Member

Travis failure looks unrelated. @tylerjereddy this is almost over the finish line.

… the end

* benchmarks for read_csv() now properly rewind StringIO objects prior to
reading them in; previously, all iterations of an asv repeat timing run
would read in no data because the StringIO object was pointing to its end after
the first iteration--setup() only runs between repeats, not iterations within
repeats of timeit
@tylerjereddy
Copy link
Contributor Author

Revised / squashed / rebased / force pushed. Hopefully better now.

@jreback jreback merged commit 0b7a08b into pandas-dev:master Jul 28, 2018
@jreback
Copy link
Contributor

jreback commented Jul 28, 2018

thanks @tylerjereddy

minggli added a commit to minggli/pandas that referenced this pull request Jul 28, 2018
* master:
  BENCH: asv csv reading benchmarks no longer read StringIO objects off the end (pandas-dev#21807)
  BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 (pandas-dev#21224)
  BUG: bug in GroupBy.count where arg minlength passed to np.bincount must be None for np<1.13 (pandas-dev#21957)
  CLN: Vbench to asv conversion script (pandas-dev#22089)
  consistent docstring (pandas-dev#22066)
  TST: skip pytables test with not-updated pytables conda package (pandas-dev#22099)
  CLN: Remove Legacy MultiIndex Index Compatibility (pandas-dev#21740)
  DOC: Reword doc for filepath_or_buffer in read_csv (pandas-dev#22058)
  BUG: rolling with MSVC 2017 build (pandas-dev#21813)
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
… the end (pandas-dev#21807)

* benchmarks for read_csv() now properly rewind StringIO objects prior to
reading them in; previously, all iterations of an asv repeat timing run
would read in no data because the StringIO object was pointing to its end after
the first iteration--setup() only runs between repeats, not iterations within
repeats of timeit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants