Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: vectorize _interp_limit #16592

Merged
merged 3 commits into from
Jun 4, 2017

Conversation

TomAugspurger
Copy link
Contributor

xref #16584

Here are the timings vs. master

       before           after         ratio
     [ff0d1f4d]       [6b1552f1]
-      12.3±0.3ms      2.28±0.04ms     0.18  frame_methods.Interpolate.time_interpolate_some_good_infer
-      12.6±0.7ms      1.15±0.01ms     0.09  frame_methods.Interpolate.time_interpolate_some_good
-           1.30s         53.8±1ms     0.04  frame_methods.Interpolate.time_interpolate

I still need to time vs 0.20.1 (will post those timings later, have to run for a bit now). I may have one more spot to optimize.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 3, 2017

Also, the master and 0.20.1 versions scale poorly with limit, but that's not benchmarked anywhere. Here's some timings

limit / version 0.20.1 master interp-perf
None 42.2 ms 1.37 s 43.8 ms
50 1.45 s 1.33 s 66.2 ms
500 1.40 s 1.36 s 62.5 ms

So we're slightly slower than 0.20.1 for the default, but much faster with any limit (notice the units).

I'll add a benchmark for limit and direction. Anyone know if there's a way to run new benchmarks against old commits in asv?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 3, 2017

Just finished vs. 0.20.1, and they're essentially the same (I think the first one is my branch, 0.20.1 is second).

· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit hash b2bbd1bf:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...........................................................................................
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 16.67%] ··· Running frame_methods.Interpolate.time_interpolate                                                                                  44.8±0.3ms
[ 33.33%] ··· Running frame_methods.Interpolate.time_interpolate_some_good                                                                       1.09±0.03ms
[ 50.00%] ··· Running frame_methods.Interpolate.time_interpolate_some_good_infer                                                                 2.41±0.01ms
[ 50.00%] · For pandas commit hash 20baf972:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 66.67%] ··· Running frame_methods.Interpolate.time_interpolate                                                                                  46.7±0.9ms
[ 83.33%] ··· Running frame_methods.Interpolate.time_interpolate_some_good                                                                       1.08±0.02ms
[100.00%] ··· Running frame_methods.Interpolate.time_interpolate_some_good_infer                                                                 2.44±0.07ms
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

def _interp_limit0(invalid, fw_limit, bw_limit):
"Get idx of values that won't be filled b/c they exceed the limits."
for x in np.where(invalid)[0]:
if invalid[max(0, x - fw_limit):x + bw_limit + 1].all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should rewrite this as well

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 3, 2017 via email

@codecov
Copy link

codecov bot commented Jun 3, 2017

Codecov Report

Merging #16592 into master will increase coverage by 0.17%.
The diff coverage is 90.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16592      +/-   ##
==========================================
+ Coverage   90.75%   90.92%   +0.17%     
==========================================
  Files         161      161              
  Lines       51097    49265    -1832     
==========================================
- Hits        46372    44795    -1577     
+ Misses       4725     4470     -255
Flag Coverage Δ
#multiple 88.68% <90.9%> (+0.09%) ⬆️
#single 40.22% <9.09%> (+0.05%) ⬆️
Impacted Files Coverage Δ
pandas/core/missing.py 84.54% <90.9%> (+0.27%) ⬆️
pandas/core/computation/expressions.py 0% <0%> (-90.48%) ⬇️
pandas/io/msgpack/_version.py 44.65% <0%> (-0.22%) ⬇️
pandas/io/formats/style.py 100% <0%> (+4.07%) ⬆️
pandas/core/util/hashing.py 98.34% <0%> (+5.37%) ⬆️
pandas/io/json/json.py 100% <0%> (+9.63%) ⬆️
pandas/util/testing.py 100% <0%> (+19.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31e67d5...b2bbd1b. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 3, 2017

Codecov Report

Merging #16592 into master will increase coverage by 0.17%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16592      +/-   ##
==========================================
+ Coverage   90.75%   90.93%   +0.17%     
==========================================
  Files         161      161              
  Lines       51097    49261    -1836     
==========================================
- Hits        46372    44794    -1578     
+ Misses       4725     4467     -258
Flag Coverage Δ
#multiple 88.69% <100%> (+0.1%) ⬆️
#single 40.22% <6.89%> (+0.05%) ⬆️
Impacted Files Coverage Δ
pandas/core/missing.py 85.25% <100%> (+0.97%) ⬆️
pandas/core/computation/expressions.py 0% <0%> (-90.48%) ⬇️
pandas/io/msgpack/_version.py 44.65% <0%> (-0.22%) ⬇️
pandas/io/formats/style.py 100% <0%> (+4.07%) ⬆️
pandas/core/util/hashing.py 98.34% <0%> (+5.37%) ⬆️
pandas/io/json/json.py 100% <0%> (+9.63%) ⬆️
pandas/util/testing.py 100% <0%> (+19.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31e67d5...bf88fc7. Read the comment docs.

@TomAugspurger TomAugspurger added this to the 0.20.2 milestone Jun 3, 2017
@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Jun 3, 2017
return f_idx & b_idx


def rolling_window(a, window):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this inside _interp_limit to avoid confusion that this is public and/or used externally

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks fine
i suspect a pure cython impl might be easier to read though (maybe even faster)

can tackle later if interest

@TomAugspurger
Copy link
Contributor Author

i suspect a pure cython impl might be easier to read though (maybe even faster)

Yeah, it's kind of unreadable, and I have no idea what that striding stuff is doing.

I'll leave this open if you want to take a look. I'll tag 0.20.2 early tomorrow morning.
We can also do the cython for 0.21

@jreback
Copy link
Contributor

jreback commented Jun 4, 2017

no this is all fine maybe make an issue for 0.21 to have a look

@TomAugspurger TomAugspurger merged commit 473615e into pandas-dev:master Jun 4, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Jun 4, 2017
* PERF: vectorize _interp_limit

* CLN: remove old implementation

* fixup! CLN: remove old implementation

(cherry picked from commit 473615e)
TomAugspurger added a commit that referenced this pull request Jun 4, 2017
* PERF: vectorize _interp_limit

* CLN: remove old implementation

* fixup! CLN: remove old implementation

(cherry picked from commit 473615e)
@TomAugspurger TomAugspurger deleted the interp-perf branch June 4, 2017 20:29
Kiv pushed a commit to Kiv/pandas that referenced this pull request Jun 11, 2017
* PERF: vectorize _interp_limit

* CLN: remove old implementation

* fixup! CLN: remove old implementation
stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017
* PERF: vectorize _interp_limit

* CLN: remove old implementation

* fixup! CLN: remove old implementation
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jul 12, 2017
Version 0.20.2

* tag 'v0.20.2': (68 commits)
  RLS: v0.20.2
  DOC: Update release.rst
  DOC: Whatsnew fixups (pandas-dev#16596)
  ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460)
  BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444)
  PERF: vectorize _interp_limit (pandas-dev#16592)
  DOC: whatsnew 0.20.2 edits (pandas-dev#16587)
  API: Make is_strictly_monotonic_* private (pandas-dev#16576)
  BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565)
  Strictly monotonic (pandas-dev#16555)
  ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026)
  fix linting
  BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244)
  BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317)
  return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486)
  BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549)
  BUG: Fixed tput output on windows (pandas-dev#16496)
  Strictly monotonic (pandas-dev#16555)
  BUG: fixed wrong order of ordered labels in pd.cut()
  BUG: Fixed to_html ignoring index_names parameter
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants