-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: don't call RangeIndex._data unnecessarily #26565
PERF: don't call RangeIndex._data unnecessarily #26565
Conversation
3e29889
to
8e4c734
Compare
pandas/core/indexes/range.py
Outdated
@@ -64,6 +65,8 @@ class RangeIndex(Int64Index): | |||
_typ = 'rangeindex' | |||
_engine_type = libindex.Int64Engine | |||
|
|||
# check whether self._data has benn called | |||
_has_called_data = False # type: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added to check if ._data
has been called, without actually calling it..
pandas/core/indexes/range.py
Outdated
@@ -215,6 +221,9 @@ def _format_data(self, name=None): | |||
# we are formatting thru the attributes | |||
return None | |||
|
|||
def _format_with_header(self, header, na_rep='NaN', **kwargs): | |||
return header + [pprint_thing(x) for x in self._range] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this I found that reprs of small DataFrames call RangeIndex.values
and therefore RangeIndex._data
. This avoids that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could do
header + list(map(pprint_thing, self._range))
8e4c734
to
7bc8655
Compare
Codecov Report
@@ Coverage Diff @@
## master #26565 +/- ##
==========================================
- Coverage 91.77% 41.68% -50.1%
==========================================
Files 174 174
Lines 50649 50663 +14
==========================================
- Hits 46483 21118 -25365
- Misses 4166 29545 +25379
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26565 +/- ##
==========================================
- Coverage 91.84% 91.84% -0.01%
==========================================
Files 174 174
Lines 50644 50660 +16
==========================================
+ Hits 46516 46527 +11
- Misses 4128 4133 +5
Continue to review full report at Codecov.
|
6e71708
to
a5cad77
Compare
Hello @topper-123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-06-01 16:07:40 UTC |
a5cad77
to
a293738
Compare
@@ -164,6 +168,8 @@ def _simple_new(cls, start, stop=None, step=None, name=None, | |||
for k, v in kwargs.items(): | |||
setattr(result, k, v) | |||
|
|||
result._range = range(result._start, result._stop, result._step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could actually remove the _start, _stop, _step properties as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm planning to do that in an upcoming PR.
Python3's range
accepts slicing, which Python2's xrange
didn't, so this refactoring will also allow dropping doing custom slicing operations in RangeIndex.
pandas/core/indexes/range.py
Outdated
@@ -215,6 +221,9 @@ def _format_data(self, name=None): | |||
# we are formatting thru the attributes | |||
return None | |||
|
|||
def _format_with_header(self, header, na_rep='NaN', **kwargs): | |||
return header + [pprint_thing(x) for x in self._range] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could do
header + list(map(pprint_thing, self._range))
pandas/tests/indexes/test_range.py
Outdated
# Calling RangeIndex._data caches a array of the same length. | ||
# This tests whether RangeIndex._data has been called by doing methods | ||
idx = RangeIndex(0, 100, 10) | ||
assert idx._has_called_data is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that you monkeypatch the class here, a bit cleaner as the code then doesn't have this attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _data
attribute is a property (previously cache_readonly
) and in neither cases is it technically possible to dynamically monkey-patch _data
. I could subclass RangeIndex
and add a new property, but not sure if that's better than this?
8f65498
to
ff805da
Compare
Codecov Report
@@ Coverage Diff @@
## master #26565 +/- ##
==========================================
- Coverage 91.84% 91.84% -0.01%
==========================================
Files 174 174
Lines 50644 50659 +15
==========================================
+ Hits 46516 46527 +11
- Misses 4128 4132 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26565 +/- ##
==========================================
- Coverage 91.85% 91.85% -0.01%
==========================================
Files 174 174
Lines 50707 50722 +15
==========================================
+ Hits 46578 46589 +11
- Misses 4129 4133 +4
Continue to review full report at Codecov.
|
ff805da
to
618f63f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, merge on green.
def _data(self): | ||
return np.arange(self._start, self._stop, self._step, dtype=np.int64) | ||
if self._cached_data is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give this a doc-string (e.g. that cached_data is actually an int array and be constructed only if necessary for performance reasons
6e037ac
to
61e93e5
Compare
61e93e5
to
c72758b
Compare
git diff upstream/master -u -- "*.py" | flake8 --diff
I've looked into
RangeIndex
and found that the index type creates and caches a int64 array if/whenRangeIndex._data
property is being called. This basically means that in many cases, aRangeIndex
has the same memory consumption and the same speed as anInt64Index
.This PR improves on that situation by giving
RangeIndex
custom.get_loc
and._format_with_header
methods. This avoids the calls to._data
in some cases, which helps on the speed and memory consumption (see performance improvements below). There are probably other case whereRangeIndex._data
can be avoided, which I'll investigate over the coming days.