-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix Series.get() for ExtensionArray and Categorical #20885
Conversation
Taking a look now. |
OK, so something that confuses me here. I initially hacked this in with a That's mainly because this line failed:
diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index 2ceec1592..b5b175c11 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -36,6 +36,7 @@ from pandas.core.dtypes.common import (
is_period_dtype,
is_bool,
is_bool_dtype,
+ is_extension_array_dtype,
is_signed_integer_dtype,
is_unsigned_integer_dtype,
is_integer_dtype, is_float_dtype,
@@ -3068,21 +3069,21 @@ class Index(IndexOpsMixin, PandasObject):
# if we have something that is Index-like, then
# use this, e.g. DatetimeIndex
s = getattr(series, '_values', None)
- if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
- try:
- return s[key]
- except (IndexError, ValueError):
+ is_extension = is_extension_array_dtype(series)
- # invalid type as an indexer
- pass
+ if is_extension:
+ s = np.arange(len(series))
+ else:
+ s = com._values_from_object(series)
- s = com._values_from_object(series)
k = com._values_from_object(key)
-
k = self._convert_scalar_indexer(k, kind='getitem')
try:
- return self._engine.get_value(s, k,
- tz=getattr(series.dtype, 'tz', None))
+ result = self._engine.get_value(
+ s, k, tz=getattr(series.dtype, 'tz', None))
+ if is_extension:
+ result = series._values[result]
+ return result
except KeyError as e1:
if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
raise The basic idea is to index into the positions, and then do a secondary ExtensionArray.getitem once you know the positions. Does that make any sense? |
|
||
s = pd.Series(data[:6], index=list('abcdef')) | ||
assert s.get('c') == s.iloc[2] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test for a slice like s.get(slice(2))
? That seems to be valid as far as .get
is concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding your suggestion, I don't think you need all of those changes. (I also don't think they will work because self._engine.get_value()
returns an item from the from the passed ndarray
and you are using that to index into the series itself). Things work fine in how I did it when specifying a slice or multiple indices. The issue is when the key is a single value, which is what I took care of with my fix.
I'll push some additional tests.
FYI, I think this can go in after the RC. |
Codecov Report
@@ Coverage Diff @@
## master #20885 +/- ##
==========================================
+ Coverage 91.81% 91.81% +<.01%
==========================================
Files 153 153
Lines 49481 49483 +2
==========================================
+ Hits 45430 45434 +4
+ Misses 4051 4049 -2
Continue to review full report at Codecov.
|
pandas/core/indexes/base.py
Outdated
@@ -3068,13 +3068,23 @@ def get_value(self, series, key): | |||
# if we have something that is Index-like, then | |||
# use this, e.g. DatetimeIndex | |||
s = getattr(series, '_values', None) | |||
if isinstance(s, (ExtensionArray, Index)) and is_scalar(key): | |||
try: | |||
return s[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you just use .get_loc(key)
on an Index as well does this work? (perf should be similar). That way we can avoid separating this.
see also #14865 I believe this will also be solved by the same fix. If so, pls add that as a test (its for Categorical) |
@jreback with respect to #14865, as @jorisvandenbossche stated in the original issue #20885 for this PR, I think they are separate issues, so I will not add additional tests. |
@jreback with respect to your suggested change of using If you guys want me to do anything else with this PR, let me know, but IMHO, you ought to just merge it in. |
what does it break? I really dont' like treating ExtensionArray different from Index, if that's the case its a bug in the API (for indexing). |
@jreback The tests that fail are
I've investigated the first one and here is what I have found out. (I think the second one is the same problem). If we use the way you propose, where we use
Here's an easier-to-read stack trace of calls for
So what may be the error here is that if the values of a Maybe when Or maybe you know how to fix things for |
this #20885 (comment) is a bug and should be fixed as a pre-cursor to this PR, @Dr-Irv pls make a separate issue (I thought we had one, but couldn't fin it). |
@jreback I'm not sure how to report the issue. The only way to replicate it is if we implement this PR as you suggested. Should I create the issue that way? I can't figure out how to demonstrate the bad behavior on 0.23rc2, or, equivalently, what test I could write that would verify that the bug illustrated above exists (since the bug only occurs if the PR is implemented in a different way). |
@jreback I've pushed a new commit here, as I found another bug in how I did things, so the new implementation looks more different for In the case of If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me
pandas/core/indexes/base.py
Outdated
iloc = self.get_loc(key) | ||
return s[iloc] | ||
except KeyError: | ||
if isinstance(key, (int, np.integer)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doesn’t the pass work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I've just pushed a new commit that uses is_integer and adds comments.
pass
doesn't work because we are catching a different exception. So the way I unified the code between Index
and ExtensionArray
was by handling the exception differently in the case that the key
was a scalar.
pandas/core/indexes/base.py
Outdated
# invalid type as an indexer | ||
pass | ||
if is_scalar(key): | ||
if isinstance(s, (Index, ExtensionArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be an and here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/indexes/base.py
Outdated
try: | ||
iloc = self.get_loc(key) | ||
return s[iloc] | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happened to the Indexerror case? is that not possible now?
result = s.get(slice('b', 'd')) | ||
expected = s.iloc[[1, 2, 3]] | ||
self.assert_series_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some cases with an out-of-range integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Dr-Irv ok this change looks ok. can you run the indexing asv's and report if anything is changing? this is a central piece of code so need to check. |
@jreback I hope I did this right. I'm doing this on my 16GB 4 core laptop on Windows. I compared upstream/master (last commit on May 5) to my latest commit. Here are the results:
So many things got faster. I'm not surprised by this, as we're not falling through exceptions any more when indexing by a scalar. I also reran just the one that got slower, and when I did that, it flipped to 62.3±4μs for master and 55.6±0μs for the new version, showing an improvement. I then ran the benchmarks again to better understand variability:
And once more:
Given the inconsistent results (probably affected by other background processes happening on my laptop), I think we're OK. |
@Dr-Irv IIRC there is a flag for affinity to help prevent this kind of variable, also I thin we have some asv options to run this many times. but close enough for now. |
thanks! |
git diff upstream/master -u -- "*.py" | flake8 --diff