-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loc enhancements #22826
Loc enhancements #22826
Conversation
Use binary search instead of re-indexing if the iterable key length is small enough
Adds benchmarks for non-unique, sorted indices in NumericSeriesIndexing and NonNumericSeriesIndexing classes
Hello @rtlee9! Thanks for submitting the PR.
|
Codecov Report
@@ Coverage Diff @@
## master #22826 +/- ##
==========================================
+ Coverage 92.18% 92.19% +<.01%
==========================================
Files 169 169
Lines 50820 50827 +7
==========================================
+ Hits 46850 46860 +10
+ Misses 3970 3967 -3
Continue to review full report at Codecov.
|
asv_bench/benchmarks/indexing.py
Outdated
(Int64Index, Float64Index), | ||
('unique_monotonic_inc', 'nonunique_monotonic_inc'), | ||
] | ||
param_names = ['index dtype', 'index structure'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you actually try this? I am shocked if this works when it has spaces in the name, or maybe the names are just mapped by postion. in any event these need underscores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I tried this and you can see it working in this sample output from the asv run:
[ 97.63%] ··· indexing.NumericSeriesIndexing.time_loc_list_like ok
[ 97.63%] ··· ========================================== ========================= ============
index dtype index structure
------------------------------------------ ------------------------- ------------
pandas.core.indexes.numeric.Int64Index unique_monotonic_inc 379±2μs
pandas.core.indexes.numeric.Int64Index nonunique_monotonic_inc 69.3±0.5ms
pandas.core.indexes.numeric.Float64Index unique_monotonic_inc 170±1ms
pandas.core.indexes.numeric.Float64Index nonunique_monotonic_inc 65.7±0.3ms
========================================== ========================= ============
I've added underscores in commit 4ad3006
.
asv_bench/benchmarks/indexing.py
Outdated
('string', 'datetime'), | ||
('unique_monotonic_inc', 'nonunique_monotonic_inc'), | ||
] | ||
param_names = ['index dtype', 'index structure'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added underscores in commit 4ad3006
if val not in d: | ||
d[val] = [] | ||
d[val].append(i) | ||
# map each starget to its position in the index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you drop the len(stargets) < 5
and just use it if its monotonic_increasing? does the small case actually make any difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you drop the len(stargets) < 5
, then we'd be running a binary search against the index for each item in a potentially large set of targets -- runtime should be O(m log n)
where m
is the number of items in the set and n
is the length of the index. Presumably, this would be slower when m
is large enough compared to the current behavior which is to run through each item in the index and check if it is in the set of targets, which should be O(n)
assuming constant time checks for whether an item is in the set of targets. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a case where this is true in the asv's and compare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, in each of the following asv's stargets
is either length one or length two
[ab9dbd64] [b704c5bb] <master> <loc-enhancements> - 383±4ms 211±3ms 0.55 indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc') - 59.0±2ms 11.9±1ms 0.20 indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr') - 69.4±0.6ms 445±3μs 0.01 indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc') - 66.3±0.3ms 423±1μs 0.01 indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc') - 66.1±0.6ms 320±2μs 0.00 indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc') - 69.2±0.4ms 330±3μs 0.00 indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc') - 65.7±0.3ms 286±3μs 0.00 indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc') - 69.3±0.5ms 295±2μs 0.00 indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc') SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
Thanks for the feedback. This PR only improves performance if the index is sorted, so I'll have to take a look at those issues in more detail later tonight or tomorrow. |
would this also increase the speed of multi-index loc calls (on sorted multi-index dfs)? |
@jreback here's an example: import pandas as pd
import numpy as np
# create a dataframe of length 10^7 with a non-unique, monotonically increasing index
N = 10 ** 7
repeat_loc = 362
df = pd.DataFrame(np.random.random((N, 1)))
df.index = list(range(N))
df.index = df.index.insert(item=df.index[repeat_loc], loc=repeat_loc)[:-1]
# benchmark performance
%timeit df.loc[repeat_loc] # 225 µs
%timeit df.loc[[repeat_loc]] # 633 ms using pandas 0.23.4, 354 µs using this commit |
My understanding is that issue #15364 isn't specific to sorted indices (based on this example), in which case this PR would not close that issue. Same thing for #19464 -- for this issue I confirmed by pulling down the asv benchmarks from their commits and found no difference in performance. Please let me know if I'm misunderstanding anything. |
thanks @rtlee9 if you would like to have a look at those referenced issues to see if they are closable now would be great. |
No problem. I took another look at those issues and I don't think they're closable -- here's why:
Please let me know if you think otherwise though. |
git diff upstream/master -u -- "*.py" | flake8 --diff
Improves performance of
IndexEngine.get_indexer_non_unique
by using binary search when:For now I've conservatively set the loc key size threshold to 5 items -- any keys larger than this will resort to the current full index scan. It would probably make sense to increase this threshold for larger indexes, but that might require further analysis. Any feedback appreciated.