Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loc enhancements #22826

Merged
merged 3 commits into from
Oct 1, 2018
Merged

Loc enhancements #22826

merged 3 commits into from
Oct 1, 2018

Conversation

rtlee9
Copy link
Contributor

@rtlee9 rtlee9 commented Sep 25, 2018

Improves performance of IndexEngine.get_indexer_non_unique by using binary search when:

  • the index is monotonically increassing, and
  • the length of the iterable loc key is sufficiently small

For now I've conservatively set the loc key size threshold to 5 items -- any keys larger than this will resort to the current full index scan. It would probably make sense to increase this threshold for larger indexes, but that might require further analysis. Any feedback appreciated.

     [ab9dbd64]       [b704c5bb]
     <master>         <loc-enhancements>
-         383±4ms          211±3ms     0.55  indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc')
-        59.0±2ms         11.9±1ms     0.20  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
-      69.4±0.6ms          445±3μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      66.3±0.3ms          423±1μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      66.1±0.6ms          320±2μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.2±0.4ms          330±3μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      65.7±0.3ms          286±3μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.3±0.5ms          295±2μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Use binary search instead of re-indexing if the iterable key length
is small enough
Adds benchmarks for non-unique, sorted indices in NumericSeriesIndexing
and NonNumericSeriesIndexing classes
@pep8speaks
Copy link

Hello @rtlee9! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Sep 25, 2018

Codecov Report

Merging #22826 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22826      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50820    50827       +7     
==========================================
+ Hits        46850    46860      +10     
+ Misses       3970     3967       -3
Flag Coverage Δ
#multiple 90.61% <ø> (ø) ⬆️
#single 42.37% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/format.py 98.35% <0%> (-0.01%) ⬇️
pandas/core/strings.py 98.63% <0%> (ø) ⬆️
pandas/core/indexes/multi.py 95.45% <0%> (ø) ⬆️
pandas/core/frame.py 97.2% <0%> (ø) ⬆️
pandas/core/generic.py 96.67% <0%> (ø) ⬆️
pandas/io/formats/style.py 96.43% <0%> (ø) ⬆️
pandas/core/series.py 93.74% <0%> (ø) ⬆️
pandas/core/nanops.py 95.14% <0%> (+0.01%) ⬆️
pandas/core/algorithms.py 94.72% <0%> (+0.02%) ⬆️
pandas/io/parsers.py 95.6% <0%> (+0.05%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab9dbd...4ad3006. Read the comment docs.

(Int64Index, Float64Index),
('unique_monotonic_inc', 'nonunique_monotonic_inc'),
]
param_names = ['index dtype', 'index structure']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you actually try this? I am shocked if this works when it has spaces in the name, or maybe the names are just mapped by postion. in any event these need underscores

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tried this and you can see it working in this sample output from the asv run:

[ 97.63%] ··· indexing.NumericSeriesIndexing.time_loc_list_like                                                     ok
[ 97.63%] ··· ========================================== ========================= ============
                             index dtype                      index structure                  
              ------------------------------------------ ------------------------- ------------
                pandas.core.indexes.numeric.Int64Index      unique_monotonic_inc     379±2μs   
                pandas.core.indexes.numeric.Int64Index    nonunique_monotonic_inc   69.3±0.5ms 
               pandas.core.indexes.numeric.Float64Index     unique_monotonic_inc     170±1ms   
               pandas.core.indexes.numeric.Float64Index   nonunique_monotonic_inc   65.7±0.3ms 
              ========================================== ========================= ============

I've added underscores in commit 4ad3006.

('string', 'datetime'),
('unique_monotonic_inc', 'nonunique_monotonic_inc'),
]
param_names = ['index dtype', 'index structure']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added underscores in commit 4ad3006

if val not in d:
d[val] = []
d[val].append(i)
# map each starget to its position in the index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you drop the len(stargets) < 5 and just use it if its monotonic_increasing? does the small case actually make any difference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you drop the len(stargets) < 5, then we'd be running a binary search against the index for each item in a potentially large set of targets -- runtime should be O(m log n) where m is the number of items in the set and n is the length of the index. Presumably, this would be slower when m is large enough compared to the current behavior which is to run through each item in the index and check if it is in the set of targets, which should be O(n) assuming constant time checks for whether an item is in the set of targets. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a case where this is true in the asv's and compare?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, in each of the following asv's stargets is either length one or length two

     [ab9dbd64]       [b704c5bb]
     <master>         <loc-enhancements>
-         383±4ms          211±3ms     0.55  indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc')
-        59.0±2ms         11.9±1ms     0.20  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
-      69.4±0.6ms          445±3μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      66.3±0.3ms          423±1μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      66.1±0.6ms          320±2μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.2±0.4ms          330±3μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      65.7±0.3ms          286±3μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.3±0.5ms          295±2μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Sep 25, 2018
@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

I think this would close #15364 as well (add in the whatsnew and at the top), and pls add that example if its needed

might close this #19464

@rtlee9
Copy link
Contributor Author

rtlee9 commented Sep 26, 2018

Thanks for the feedback. This PR only improves performance if the index is sorted, so I'll have to take a look at those issues in more detail later tonight or tomorrow.

@M00NSH0T
Copy link

would this also increase the speed of multi-index loc calls (on sorted multi-index dfs)?

@rtlee9
Copy link
Contributor Author

rtlee9 commented Sep 30, 2018

@jreback here's an example:

import pandas as pd
import numpy as np

# create a dataframe of length 10^7 with a non-unique, monotonically increasing index
N = 10 ** 7
repeat_loc = 362
df = pd.DataFrame(np.random.random((N, 1)))
df.index = list(range(N))
df.index = df.index.insert(item=df.index[repeat_loc], loc=repeat_loc)[:-1]

# benchmark performance
%timeit df.loc[repeat_loc]  # 225 µs
%timeit df.loc[[repeat_loc]]  # 633 ms using pandas 0.23.4, 354 µs using this commit

@rtlee9
Copy link
Contributor Author

rtlee9 commented Sep 30, 2018

My understanding is that issue #15364 isn't specific to sorted indices (based on this example), in which case this PR would not close that issue. Same thing for #19464 -- for this issue I confirmed by pulling down the asv benchmarks from their commits and found no difference in performance. Please let me know if I'm misunderstanding anything.

@jreback jreback added this to the 0.24.0 milestone Oct 1, 2018
@jreback jreback merged commit b92b043 into pandas-dev:master Oct 1, 2018
@jreback
Copy link
Contributor

jreback commented Oct 1, 2018

thanks @rtlee9 if you would like to have a look at those referenced issues to see if they are closable now would be great.

@rtlee9 rtlee9 deleted the loc-enhancements branch October 2, 2018 03:31
@rtlee9
Copy link
Contributor Author

rtlee9 commented Oct 2, 2018

No problem. I took another look at those issues and I don't think they're closable -- here's why:

My understanding is that issue #15364 isn't specific to sorted indices (based on this example), in which case this PR would not close that issue. Same thing for #19464 -- for this issue I confirmed by pulling down the asv benchmarks from their commits and found no difference in performance. Please let me know if I'm misunderstanding anything.

Please let me know if you think otherwise though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

loc very slow on sorted, non-unique index with list of labels ar argument
4 participants