-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series values cannot be accessed by numeric categorical index #14865
Comments
this is raising on a validation check and hitting positional indexing, rather than label indexing:
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/base.py#L1128
something like might work (though might cause other things to break)
|
I was mistaken regarding positional indexing. It seems that category codes are used as the index: >>> s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))
>>> s3.get(3) is None
True
>>> s3.get(0)
'b'
>>> s3.get(1)
'c'
>>> s3.get(2)
'a'
>>> s3.index.codes
array([2, 0, 1], dtype=int8) Tried to override |
I was looking for a first issue-- I can take a look at this. |
Cool, thanks. If making that change to |
I've been playing around with this to understand the behavior, and the category codes can be used as the index for non-numeric categorical columns as well:
This makes me wonder whether this is a bug at all-- should the category codes just be the default here? The behavior is different than you might expect at first, but it is consistent. I'm going to look for a fix to see if we can default to using the category over the category code, but if it breaks other things, it might make sense to document the behavior rather than changing it. |
Also, I get different behavior from @alex-filatov here.
|
Maybe there's been a version release where this got fixed? I can't replicate the behavior. |
For you first example, For your second example, yes I'm seeing that too. And I think that's the correct behavior. To summarize, I think the correct behavior is that:
Or, to put it another way these two should be identical w.r.t. indexing: In [43]: s = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))
In [44]: s2 = pd.Series(s.values, s.index.get_values())
In [45]: s.get(0)
Out[45]: 'a'
In [46]: s2.get(0) The difference is that a Categoricalndex with integer categories falls back to positional indexing when it shouldn't. |
[Edit: changing this comment, now that I think I understand better] So, if I understand correctly, the changes to be made are:
|
I think that's correct (though @jreback will know better). This should also fix this error, which I think is a bug: In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))
In [3]: s
Out[3]:
1 a
2 b
3 c
dtype: object
In [4]: s.loc[1]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-609ae8a0f3fe> in <module>()
----> 1 s.loc[1]
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in __getitem__(self, key)
1371
1372 maybe_callable = com._apply_if_callable(key, self.obj)
-> 1373 return self._getitem_axis(maybe_callable, axis=axis)
1374
1375 def _is_scalar_access(self, key):
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1624
1625 # fall thru to straight lookup
-> 1626 self._has_valid_type(key, axis)
1627 return self._get_label(key, axis=axis)
1628
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in _has_valid_type(self, key, axis)
1502
1503 try:
-> 1504 key = self._convert_scalar_indexer(key, axis)
1505 if not ax.contains(key):
1506 error()
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in _convert_scalar_indexer(self, key, axis)
254 ax = self.obj._get_axis(min(axis, self.ndim - 1))
255 # a scalar
--> 256 return ax._convert_scalar_indexer(key, kind=self.name)
257
258 def _convert_slice_indexer(self, key, axis):
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexes/category.py in _convert_scalar_indexer(self, key, kind)
573
574 return super(CategoricalIndex, self)._convert_scalar_indexer(
--> 575 key, kind=kind)
576
577 @Appender(_index_shared_docs['_convert_list_indexer'])
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexes/base.py in _convert_scalar_indexer(self, key, kind)
1391 elif kind in ['loc'] and is_integer(key):
1392 if not self.holds_integer():
-> 1393 return self._invalid_indexer('label', key)
1394
1395 return key
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexes/base.py in _invalid_indexer(self, form, key)
1575 "indexers [{key}] of {kind}".format(
1576 form=form, klass=type(self), key=key,
-> 1577 kind=type(key)))
1578
1579 def get_duplicates(self):
TypeError: cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [1] of <class 'int'> That should return |
#17569 for that second bug, if it's actually a bug. Posting a comment there. |
Ok, just posting here, since there are other related issues: (#15470). Sorry you've walked into a bit of a minefield @elsander, but this will be a good issue to learn on :) How about this: For a CategoricalIndex with integer categories, the behavior or # x has a CategoricalIndex with integer categories
y = x.copy()
y.index = x.index.get_values() all indexing is identical. Hopefully that doesn't break anything. cc @jorisvandenbossche thoughts? |
@elsander if that summary is correct, the fix @jreback suggested up above of overriding
|
Okay, yeah, this is more complicated than I originally thought! Still happy to work on it as long as I know what the expected behavior is. I wrote the following tests to demonstrate the expected behavior:
Does this still match the behavior you expect? I can also add the tests that you suggest. |
I get an |
Sorry, I should have said holds_integer |
I reported this issue with pandas 0.19.1, current version is 0.20.3 and it's indeed seems to be partially fixed, but still: >> s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))
>> s3.get(1)
'b'
>> s3.get(0)
'b'
Makes sense to me. |
Let's keep the two issues discussed here separate: a) the fact that codes are used to index and b) whether categorical index with integers should fallback to positional or not. a) the fact that codes are used to index is clearly a bug:
This is a clear bug and can be fixed separately I think. b) whether categorical index with integers should fallback to positional or not is less clear to me. This is what is discussed in #15470 (although not much discussion there ..). At the time, I argued there that I think CategoricalIndex with numerical categories should not be regarded as a numerical index (in the sense: it should use integers for positional indexing in So it comes back to: how do we see a CategoricalIndex with numerical categories? But in the end, |
@jorisvandenbossche where do you come down on this? I think the "numericness" of a CategoricalIndex should depend on the type of the categories, instead of always treating it as non-numeric like we are now. I don't like the type-specific behavior with respect to falling back to positional indexing, but that's where we are. And I think it's more consistent for the indexing behavior to depend on the type of the values here. |
I'm still happy to work on this, but I'm going to hold off for now until there's a consensus on the expected behavior. As a user, I would find falling back to positional indexing for numeric CategoricalIndexes confusing (especially if the positional indices overlapped partially with the categories), but happy to implement it however the maintainers prefer. |
Actually I find this a really easy issue from the API.
|
No, it is de-facto
Again, this discussion is not about
Yeah, maybe that is the more sensible thing to do. But to explain my reservation about changing this, let's explain it in a different way. |
To keep the discussions a bit separate, I opened an issue about the "should |
In #20882, I was asked to see if this was the same. So I collected all of the examples listed above here, used v0.22.0, and relabeled them to be able to keep track. This might help anyone who decides to address these issues in the future. >>> import pandas as pd
>>> pd.__version__
'0.22.0'
>>>
>>> s1 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3]))
>>> s1.get(3)
'c'
>>>
>>>
>>> s2 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3], dtype='category'))
>>> s2.get(3)
'c'
>>> s2.get(0)
'a'
>>>
>>> s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))
>>> [s3.get(i) for i in range(4)]
['b', 'b', 'c', 'a']
>>>
>>> s4 = pd.Series(['a', 'b', 'c'], index=pd.Series(['hello', 'there', 'world'],
... dtype='category'))
>>> s4.get(2)
'c'
>>>
>>> s5 = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))
>>> s6 = pd.Series(s5.values, s5.index.get_values())
>>> s5.get(0)
'a'
>>> s6.get(0) is None
True
>>>
>>> s7 = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))
>>> try:
... print('s7.loc[1] ', s7.loc[1])
... except TypeError as e:
... print('Expected TypeError raised ', e)
... except Exception as e:
... print('Unexpected exception raised: ', e)
...
Expected TypeError raised cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [1] of <class 'int'>
>>> s8 = pd.Series([1, 2, 3, 4], index=pd.Categorical(['b', 'a', 'b', 'c']))
>>> s8[0]
2
>>>
>>> s8.iloc[0]
1
>>> I think the results for I think the ones that have to be dealt with are the results related to And there are 2 overriding issues (using the example |
Code Sample
Problem description
In my opinion behavior in the second case can be error-prone (when there is an overlap between positional index and categorical one) and inconvenient (forces to use positional index).
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: