Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: don't call RangeIndex._data unnecessarily #26565

Merged
merged 4 commits into from
Jun 1, 2019

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 29, 2019

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I've looked into RangeIndex and found that the index type creates and caches a int64 array if/when RangeIndex._data property is being called. This basically means that in many cases, a RangeIndex has the same memory consumption and the same speed as an Int64Index.

This PR improves on that situation by giving RangeIndex custom .get_loc and ._format_with_header methods. This avoids the calls to ._data in some cases, which helps on the speed and memory consumption (see performance improvements below). There are probably other case where RangeIndex._data can be avoided, which I'll investigate over the coming days.

>>> %timeit pd.RangeIndex(1_000_000).get_loc(900_000)
8.95 ms ± 485 µs per loop  # master
4.31 µs ± 303 ns per loop  # this PR
>>> rng =  pd.RangeIndex(1_000_000)
>>> %timeit rng.get_loc(900_000)
17.3 µs ± 392 ns per loop  # master
547 ns ± 8.26 ns per loop  # this PR. get_loc is now lightningly fast
>>> df = pd.DataFrame({'a': range(1_000_000)})
>>> %timeit df.loc[800_000: 900_000]
132 µs ± 5.79 µs per loop  # master
89 µs ± 2.95 µs per loop  # this PR

@topper-123 topper-123 force-pushed the range_index_calls_data branch from 3e29889 to 8e4c734 Compare May 29, 2019 19:52
@@ -64,6 +65,8 @@ class RangeIndex(Int64Index):
_typ = 'rangeindex'
_engine_type = libindex.Int64Engine

# check whether self._data has benn called
_has_called_data = False # type: bool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added to check if ._data has been called, without actually calling it..

@@ -215,6 +221,9 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header, na_rep='NaN', **kwargs):
return header + [pprint_thing(x) for x in self._range]
Copy link
Contributor Author

@topper-123 topper-123 May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this I found that reprs of small DataFrames call RangeIndex.values and therefore RangeIndex._data. This avoids that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do
header + list(map(pprint_thing, self._range))

@topper-123 topper-123 added Index Related to the Index class or subclasses Performance Memory or execution speed performance labels May 29, 2019
@topper-123 topper-123 added this to the 0.25.0 milestone May 29, 2019
@topper-123 topper-123 changed the title PERF: don't call RangeIndex._data unneccesary PERF: don't call RangeIndex._data unneccesarily May 29, 2019
@topper-123 topper-123 force-pushed the range_index_calls_data branch from 8e4c734 to 7bc8655 Compare May 29, 2019 22:15
@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #26565 into master will decrease coverage by 50.09%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.77%   41.68%   -50.1%     
==========================================
  Files         174      174              
  Lines       50649    50663      +14     
==========================================
- Hits        46483    21118   -25365     
- Misses       4166    29545   +25379
Flag Coverage Δ
#multiple ?
#single 41.68% <92.85%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/range.py 53.76% <92.85%> (-44.22%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.14% <0%> (-89.86%) ⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a91da0c...7bc8655. Read the comment docs.

@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #26565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50644    50660      +16     
==========================================
+ Hits        46516    46527      +11     
- Misses       4128     4133       +5
Flag Coverage Δ
#multiple 90.38% <100%> (ø) ⬆️
#single 41.71% <100%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/range.py 98.06% <100%> (+0.08%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.81% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f31865...ff805da. Read the comment docs.

@topper-123 topper-123 force-pushed the range_index_calls_data branch 2 times, most recently from 6e71708 to a5cad77 Compare May 29, 2019 23:05
@pep8speaks
Copy link

pep8speaks commented May 29, 2019

Hello @topper-123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-01 16:07:40 UTC

@topper-123 topper-123 force-pushed the range_index_calls_data branch from a5cad77 to a293738 Compare May 29, 2019 23:09
@@ -164,6 +168,8 @@ def _simple_new(cls, start, stop=None, step=None, name=None,
for k, v in kwargs.items():
setattr(result, k, v)

result._range = range(result._start, result._stop, result._step)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could actually remove the _start, _stop, _step properties as well?

Copy link
Contributor Author

@topper-123 topper-123 May 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm planning to do that in an upcoming PR.

Python3's range accepts slicing, which Python2's xrange didn't, so this refactoring will also allow dropping doing custom slicing operations in RangeIndex.

@@ -215,6 +221,9 @@ def _format_data(self, name=None):
# we are formatting thru the attributes
return None

def _format_with_header(self, header, na_rep='NaN', **kwargs):
return header + [pprint_thing(x) for x in self._range]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do
header + list(map(pprint_thing, self._range))

pandas/core/indexes/range.py Show resolved Hide resolved
# Calling RangeIndex._data caches a array of the same length.
# This tests whether RangeIndex._data has been called by doing methods
idx = RangeIndex(0, 100, 10)
assert idx._has_called_data is False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that you monkeypatch the class here, a bit cleaner as the code then doesn't have this attribute

Copy link
Contributor Author

@topper-123 topper-123 May 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _data attribute is a property (previously cache_readonly) and in neither cases is it technically possible to dynamically monkey-patch _data. I could subclass RangeIndex and add a new property, but not sure if that's better than this?

@topper-123 topper-123 changed the title PERF: don't call RangeIndex._data unneccesarily PERF: don't call RangeIndex._data unnecessarily May 30, 2019
@topper-123 topper-123 force-pushed the range_index_calls_data branch 3 times, most recently from 8f65498 to ff805da Compare May 31, 2019 14:27
@codecov-io
Copy link

Codecov Report

Merging #26565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50644    50659      +15     
==========================================
+ Hits        46516    46527      +11     
- Misses       4128     4132       +4
Flag Coverage Δ
#multiple 90.38% <100%> (ø) ⬆️
#single 41.73% <100%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/range.py 98.05% <100%> (+0.08%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f31865...ff805da. Read the comment docs.

@codecov-io
Copy link

codecov-io commented May 31, 2019

Codecov Report

Merging #26565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.85%   91.85%   -0.01%     
==========================================
  Files         174      174              
  Lines       50707    50722      +15     
==========================================
+ Hits        46578    46589      +11     
- Misses       4129     4133       +4
Flag Coverage Δ
#multiple 90.39% <100%> (ø) ⬆️
#single 41.78% <100%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/range.py 98.05% <100%> (+0.08%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dbb99e...c72758b. Read the comment docs.

@topper-123 topper-123 force-pushed the range_index_calls_data branch from ff805da to 618f63f Compare May 31, 2019 15:29
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, merge on green.

def _data(self):
return np.arange(self._start, self._stop, self._step, dtype=np.int64)
if self._cached_data is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give this a doc-string (e.g. that cached_data is actually an int array and be constructed only if necessary for performance reasons

@topper-123 topper-123 force-pushed the range_index_calls_data branch 2 times, most recently from 6e037ac to 61e93e5 Compare June 1, 2019 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants