PERF: don't call RangeIndex._data unnecessarily #26565

topper-123 · 2019-05-29T19:50:55Z

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

I've looked into RangeIndex and found that the index type creates and caches a int64 array if/when RangeIndex._data property is being called. This basically means that in many cases, a RangeIndex has the same memory consumption and the same speed as an Int64Index.

This PR improves on that situation by giving RangeIndex custom .get_loc and ._format_with_header methods. This avoids the calls to ._data in some cases, which helps on the speed and memory consumption (see performance improvements below). There are probably other case where RangeIndex._data can be avoided, which I'll investigate over the coming days.

>>> %timeit pd.RangeIndex(1_000_000).get_loc(900_000)
8.95 ms ± 485 µs per loop  # master
4.31 µs ± 303 ns per loop  # this PR
>>> rng =  pd.RangeIndex(1_000_000)
>>> %timeit rng.get_loc(900_000)
17.3 µs ± 392 ns per loop  # master
547 ns ± 8.26 ns per loop  # this PR. get_loc is now lightningly fast
>>> df = pd.DataFrame({'a': range(1_000_000)})
>>> %timeit df.loc[800_000: 900_000]
132 µs ± 5.79 µs per loop  # master
89 µs ± 2.95 µs per loop  # this PR

topper-123 · 2019-05-29T19:53:07Z

pandas/core/indexes/range.py

@@ -64,6 +65,8 @@ class RangeIndex(Int64Index):
    _typ = 'rangeindex'
    _engine_type = libindex.Int64Engine

+    # check whether self._data has benn called
+    _has_called_data = False  # type: bool


This is added to check if ._data has been called, without actually calling it..

topper-123 · 2019-05-29T19:54:51Z

pandas/core/indexes/range.py

@@ -215,6 +221,9 @@ def _format_data(self, name=None):
        # we are formatting thru the attributes
        return None

+    def _format_with_header(self, header, na_rep='NaN', **kwargs):
+        return header + [pprint_thing(x) for x in self._range]


Without this I found that reprs of small DataFrames call RangeIndex.values and therefore RangeIndex._data. This avoids that.

could do
header + list(map(pprint_thing, self._range))

codecov · 2019-05-29T22:15:24Z

Codecov Report

Merging #26565 into master will decrease coverage by 50.09%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.77%   41.68%   -50.1%     
==========================================
  Files         174      174              
  Lines       50649    50663      +14     
==========================================
- Hits        46483    21118   -25365     
- Misses       4166    29545   +25379

Flag	Coverage Δ
#multiple	`?`
#single	`41.68% <92.85%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/range.py	`53.76% <92.85%> (-44.22%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/tools/numeric.py	`10.14% <0%> (-89.86%)`	⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a91da0c...7bc8655. Read the comment docs.

codecov · 2019-05-29T22:15:29Z

Codecov Report

Merging #26565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50644    50660      +16     
==========================================
+ Hits        46516    46527      +11     
- Misses       4128     4133       +5

Flag	Coverage Δ
#multiple	`90.38% <100%> (ø)`	⬆️
#single	`41.71% <100%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/range.py	`98.06% <100%> (+0.08%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.81% <0%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f31865...ff805da. Read the comment docs.

pep8speaks · 2019-05-29T23:05:03Z

Hello @topper-123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-01 16:07:40 UTC

jreback · 2019-05-30T01:09:44Z

pandas/core/indexes/range.py

@@ -164,6 +168,8 @@ def _simple_new(cls, start, stop=None, step=None, name=None,
        for k, v in kwargs.items():
            setattr(result, k, v)

+        result._range = range(result._start, result._stop, result._step)


we could actually remove the _start, _stop, _step properties as well?

Yes, I'm planning to do that in an upcoming PR.

Python3's range accepts slicing, which Python2's xrange didn't, so this refactoring will also allow dropping doing custom slicing operations in RangeIndex.

jreback · 2019-05-30T01:10:24Z

pandas/core/indexes/range.py

@@ -215,6 +221,9 @@ def _format_data(self, name=None):
        # we are formatting thru the attributes
        return None

+    def _format_with_header(self, header, na_rep='NaN', **kwargs):
+        return header + [pprint_thing(x) for x in self._range]


could do
header + list(map(pprint_thing, self._range))

pandas/core/indexes/range.py

jreback · 2019-05-30T01:12:37Z

pandas/tests/indexes/test_range.py

+        # Calling RangeIndex._data caches a array of the same length.
+        # This tests whether RangeIndex._data has been called by doing methods
+        idx = RangeIndex(0, 100, 10)
+        assert idx._has_called_data is False


I would suggest that you monkeypatch the class here, a bit cleaner as the code then doesn't have this attribute

The _data attribute is a property (previously cache_readonly) and in neither cases is it technically possible to dynamically monkey-patch _data. I could subclass RangeIndex and add a new property, but not sure if that's better than this?

codecov-io · 2019-05-31T15:17:43Z

Codecov Report

Merging #26565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50644    50659      +15     
==========================================
+ Hits        46516    46527      +11     
- Misses       4128     4132       +4

Flag	Coverage Δ
#multiple	`90.38% <100%> (ø)`	⬆️
#single	`41.73% <100%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/range.py	`98.05% <100%> (+0.08%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f31865...ff805da. Read the comment docs.

codecov-io · 2019-05-31T15:17:46Z

Codecov Report

Merging #26565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26565      +/-   ##
==========================================
- Coverage   91.85%   91.85%   -0.01%     
==========================================
  Files         174      174              
  Lines       50707    50722      +15     
==========================================
+ Hits        46578    46589      +11     
- Misses       4129     4133       +4

Flag	Coverage Δ
#multiple	`90.39% <100%> (ø)`	⬆️
#single	`41.78% <100%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/range.py	`98.05% <100%> (+0.08%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dbb99e...c72758b. Read the comment docs.

jreback

small comment, merge on green.

jreback · 2019-06-01T14:41:32Z

pandas/core/indexes/range.py

    def _data(self):
-        return np.arange(self._start, self._stop, self._step, dtype=np.int64)
+        if self._cached_data is None:


can you give this a doc-string (e.g. that cached_data is actually an int array and be constructed only if necessary for performance reasons

topper-123 force-pushed the range_index_calls_data branch from 3e29889 to 8e4c734 Compare May 29, 2019 19:52

topper-123 commented May 29, 2019

View reviewed changes

topper-123 added Index Related to the Index class or subclasses Performance Memory or execution speed performance labels May 29, 2019

topper-123 added this to the 0.25.0 milestone May 29, 2019

topper-123 changed the title ~~PERF: don't call RangeIndex._data unneccesary~~ PERF: don't call RangeIndex._data unneccesarily May 29, 2019

topper-123 force-pushed the range_index_calls_data branch from 8e4c734 to 7bc8655 Compare May 29, 2019 22:15

topper-123 force-pushed the range_index_calls_data branch 2 times, most recently from 6e71708 to a5cad77 Compare May 29, 2019 23:05

topper-123 force-pushed the range_index_calls_data branch from a5cad77 to a293738 Compare May 29, 2019 23:09

jreback requested changes May 30, 2019

View reviewed changes

topper-123 changed the title ~~PERF: don't call RangeIndex._data unneccesarily~~ PERF: don't call RangeIndex._data unnecessarily May 30, 2019

This was referenced May 30, 2019

CLN: use RangeIndex._range instead of RangeIndex._start etc. #26578

Closed

use range in RangeIndex instead of _start etc. #26581

Merged

topper-123 force-pushed the range_index_calls_data branch 3 times, most recently from 8f65498 to ff805da Compare May 31, 2019 14:27

topper-123 force-pushed the range_index_calls_data branch from ff805da to 618f63f Compare May 31, 2019 15:29

jreback approved these changes Jun 1, 2019

View reviewed changes

topper-123 added 3 commits June 1, 2019 16:51

PERF: don't call RangeIndex._data unneccesary

eadbdf3

guard against invalid key

803a97d

changes

a74db3f

topper-123 force-pushed the range_index_calls_data branch 2 times, most recently from 6e037ac to 61e93e5 Compare June 1, 2019 15:13

Doc string changes

c72758b

topper-123 force-pushed the range_index_calls_data branch from 61e93e5 to c72758b Compare June 1, 2019 16:07

topper-123 merged commit 437efa6 into pandas-dev:master Jun 1, 2019

topper-123 deleted the range_index_calls_data branch June 1, 2019 17:04

topper-123 mentioned this pull request Jun 2, 2019

PERF: custom ops for RangeIndex.[all|any|__contains__] #26617

Merged

4 tasks

vaibhavhrt pushed a commit to vaibhavhrt/pandas that referenced this pull request Jun 6, 2019

PERF: don't call RangeIndex._data unnecessarily (pandas-dev#26565)

68c6766

topper-123 mentioned this pull request Jun 6, 2019

PERF: use python int in RangeIndex.get_loc #26697

Merged

topper-123 mentioned this pull request Jul 28, 2020

CLN/PERF: move RangeIndex._cached_data to RangeIndex._cache #35432

Merged

topper-123 mentioned this pull request Aug 11, 2020

PERF: make RangeIndex iterate over ._range #35676

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: don't call RangeIndex._data unnecessarily #26565

PERF: don't call RangeIndex._data unnecessarily #26565

topper-123 commented May 29, 2019 •

edited

Loading

topper-123 May 29, 2019

topper-123 May 29, 2019 •

edited

Loading

jreback May 30, 2019

codecov bot commented May 29, 2019

codecov bot commented May 29, 2019 •

edited

Loading

pep8speaks commented May 29, 2019 •

edited

Loading

jreback May 30, 2019

topper-123 May 30, 2019 •

edited

Loading

jreback May 30, 2019

jreback May 30, 2019

topper-123 May 31, 2019 •

edited

Loading

codecov-io commented May 31, 2019

codecov-io commented May 31, 2019 •

edited by codecov bot

Loading

jreback left a comment

jreback Jun 1, 2019

PERF: don't call RangeIndex._data unnecessarily #26565

PERF: don't call RangeIndex._data unnecessarily #26565

Conversation

topper-123 commented May 29, 2019 • edited Loading

topper-123 May 29, 2019

Choose a reason for hiding this comment

topper-123 May 29, 2019 • edited Loading

Choose a reason for hiding this comment

jreback May 30, 2019

Choose a reason for hiding this comment

codecov bot commented May 29, 2019

Codecov Report

codecov bot commented May 29, 2019 • edited Loading

Codecov Report

pep8speaks commented May 29, 2019 • edited Loading

Comment last updated at 2019-06-01 16:07:40 UTC

jreback May 30, 2019

Choose a reason for hiding this comment

topper-123 May 30, 2019 • edited Loading

Choose a reason for hiding this comment

jreback May 30, 2019

Choose a reason for hiding this comment

jreback May 30, 2019

Choose a reason for hiding this comment

topper-123 May 31, 2019 • edited Loading

Choose a reason for hiding this comment

codecov-io commented May 31, 2019

Codecov Report

codecov-io commented May 31, 2019 • edited by codecov bot Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

jreback Jun 1, 2019

Choose a reason for hiding this comment

topper-123 commented May 29, 2019 •

edited

Loading

topper-123 May 29, 2019 •

edited

Loading

codecov bot commented May 29, 2019 •

edited

Loading

pep8speaks commented May 29, 2019 •

edited

Loading

topper-123 May 30, 2019 •

edited

Loading

topper-123 May 31, 2019 •

edited

Loading

codecov-io commented May 31, 2019 •

edited by codecov bot

Loading