-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BENCH: asv csv reading benchmarks now rewind StringIO objects #21807
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21807 +/- ##
=======================================
Coverage 92.05% 92.05%
=======================================
Files 170 170
Lines 50708 50708
=======================================
Hits 46677 46677
Misses 4031 4031
Continue to review full report at Codecov.
|
asv_bench/benchmarks/io/csv.py
Outdated
@@ -69,6 +69,7 @@ def setup(self, infer_datetime_format, format): | |||
self.data = StringIO('\n'.join(rng.strftime(dt_format).tolist())) | |||
|
|||
def time_read_csv(self, infer_datetime_format, format): | |||
self.data.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm can we instead make data
a property which does this, otherwise this is likely to be forgotten the next time a benchmark is written. can you show a before /after on these benchmarks?
Hello @tylerjereddy! Thanks for updating the PR.
Comment last updated on July 26, 2018 at 20:11 Hours UTC |
Revised accordingly with before / after log files from Before rewinding After rewinding I agree that using the property approach may alleviate future forgetting-to-rewind issues, though you'll likely notice that one of the classes where two Most There's another alternative here if you prefer--setting |
asv_bench/benchmarks/io/csv.py
Outdated
class ReadCSVDInferDatetimeFormat(object): | ||
class StringIORewind(object): | ||
|
||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just make this a function I think and pass in the StringIO object
Travis failure looks unrelated. @tylerjereddy this is almost over the finish line. |
… the end * benchmarks for read_csv() now properly rewind StringIO objects prior to reading them in; previously, all iterations of an asv repeat timing run would read in no data because the StringIO object was pointing to its end after the first iteration--setup() only runs between repeats, not iterations within repeats of timeit
7d606f4
to
90d3dda
Compare
Revised / squashed / rebased / force pushed. Hopefully better now. |
thanks @tylerjereddy |
* master: BENCH: asv csv reading benchmarks no longer read StringIO objects off the end (pandas-dev#21807) BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 (pandas-dev#21224) BUG: bug in GroupBy.count where arg minlength passed to np.bincount must be None for np<1.13 (pandas-dev#21957) CLN: Vbench to asv conversion script (pandas-dev#22089) consistent docstring (pandas-dev#22066) TST: skip pytables test with not-updated pytables conda package (pandas-dev#22099) CLN: Remove Legacy MultiIndex Index Compatibility (pandas-dev#21740) DOC: Reword doc for filepath_or_buffer in read_csv (pandas-dev#22058) BUG: rolling with MSVC 2017 build (pandas-dev#21813)
… the end (pandas-dev#21807) * benchmarks for read_csv() now properly rewind StringIO objects prior to reading them in; previously, all iterations of an asv repeat timing run would read in no data because the StringIO object was pointing to its end after the first iteration--setup() only runs between repeats, not iterations within repeats of timeit
In short, the
asv
benchmarks used for timingread_csv
are actually reading in empty file-like objects most of the time at the moment. This is because thesetup()
method of the benchmarks is only called between repeats and not between iterations of repeats--see writing benchmarks docs.You can confirm this by simply printing the
size
of the dataframes you get fromread_csv
in any StringIO-dependent benchmarks--the first iteration of a repeat should have the correct size, but the rest will all be 0. You could also just print the memory address (StringIO object)--they'll all be the same after the first iteration of a repeat. I did this with @stefanv and @mattip in the NumPy context.This can be quite confusing--a bit different than the paradigm you might expect in i.e, a unit test context--for more gory details see my related comment in NumPy loadtxt() asv benchmarks. As noted there, one could probably avoid this by using an actual file object instead / if preferred, but realistically this just boils down to how timeit works. Confounding the results with the seek is unfortunate, so working with file objects may be preferred if dumping files in the bench env is acceptable.