Series values cannot be accessed by numeric categorical index #14865

alex-filatov · 2016-12-12T11:12:32Z

Code Sample

>>> s1 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3]))
>>> s1.get(3)
'c'

>>> s2 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3], dtype='category'))
>>> s2.get(3) is None
True
>>> s2.get(0)
'a'

Problem description

In my opinion behavior in the second case can be error-prone (when there is an overlap between positional index and categorical one) and inconvenient (forces to use positional index).

Expected Output

>>> s2.get(3)
'c'

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-12-12T11:28:50Z

this is raising on a validation check and hitting positional indexing, rather than label indexing:

In [12]: s.index.holds_integer()
Out[12]: True

In [13]: s.index.is_categorical()
Out[13]: False

In [14]: s.index.is_integer()
Out[14]: True

In [15]: s.index.holds_integer()
Out[15]: True

In [16]: s2.index.is_categorical()
Out[16]: True

In [17]: s2.index.holds_integer()
Out[17]: False

https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/base.py#L1128

    def holds_integer(self):
        return self.inferred_type in ['integer', 'mixed-integer']

something like might work (though might cause other things to break)
In pandas/indexes/category.py override to

    def holds_integer(self):
        return self.categories.holds_integer()

alex-filatov · 2016-12-12T18:12:18Z

I was mistaken regarding positional indexing. It seems that category codes are used as the index:

>>> s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))
>>> s3.get(3) is None
True
>>> s3.get(0)
'b'
>>> s3.get(1)
'c'
>>> s3.get(2)
'a'
>>> s3.index.codes
array([2, 0, 1], dtype=int8)

Tried to override holds_integer as advised - didn't resolve the issue. Not sure the method is invoked in this case.

elsander · 2017-10-18T18:41:56Z

I was looking for a first issue-- I can take a look at this.

TomAugspurger · 2017-10-18T18:58:14Z

Cool, thanks. If making that change to indexes/category.py does end up breaking other things, then this could turn out to be difficult.

elsander · 2017-10-18T20:38:47Z

I've been playing around with this to understand the behavior, and the category codes can be used as the index for non-numeric categorical columns as well:

In [35]: s4 = pd.Series(['a', 'b', 'c'], index=pd.Series(['hello', 'there', 'world'], dtype='category
    ...: '))

In [36]: s4.get(2)
Out[36]: 'c'

This makes me wonder whether this is a bug at all-- should the category codes just be the default here? The behavior is different than you might expect at first, but it is consistent. I'm going to look for a fix to see if we can default to using the category over the category code, but if it breaks other things, it might make sense to document the behavior rather than changing it.

elsander · 2017-10-18T20:44:13Z

Also, I get different behavior from @alex-filatov here.

In [59]: s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))

In [60]: s3.get(3)
Out[60]: 'a' # not None!

In [61]: s3.get(2)
Out[61]: 'c'

In [62]: s3.get(1)
Out[62]: 'b'

In [63]: s3.get(0)
Out[63]: 'b'

elsander · 2017-10-18T21:15:26Z

Maybe there's been a version release where this got fixed? I can't replicate the behavior.

TomAugspurger · 2017-10-18T21:19:30Z

For you first example, s4.get(2), that's correctly falling back to positional indexing (I think). There's only ambiguity when the categories are themselves integers, so there shouldn't be any positional indexing fallback (again, I could be wrong here).

For your second example, yes I'm seeing that too. And I think that's the correct behavior.

To summarize, I think the correct behavior is that:

indexing into a CategoricalIndex should always use the categories, not the codes
whether or not a CategoricalIndex falls back to positional indexing should depend on whether the categories or integers.

Or, to put it another way these two should be identical w.r.t. indexing:

In [43]: s = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))

In [44]: s2 = pd.Series(s.values, s.index.get_values())

In [45]: s.get(0)
Out[45]: 'a'

In [46]: s2.get(0)

The difference is that a Categoricalndex with integer categories falls back to positional indexing when it shouldn't.

elsander · 2017-10-18T21:26:48Z

[Edit: changing this comment, now that I think I understand better]

So, if I understand correctly, the changes to be made are:

CategoricalIndex should no longer use the category codes for indexing
If the categories are integers, we should not fall back to anything. Otherwise, fall back to positional indexing.

TomAugspurger · 2017-10-18T21:32:21Z

I think that's correct (though @jreback will know better). This should also fix this error, which I think is a bug:

In [1]: import pandas as pd

In [2]: s = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))

In [3]: s
Out[3]:
1    a
2    b
3    c
dtype: object

In [4]: s.loc[1]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-609ae8a0f3fe> in <module>()
----> 1 s.loc[1]

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in __getitem__(self, key)
   1371
   1372             maybe_callable = com._apply_if_callable(key, self.obj)
-> 1373             return self._getitem_axis(maybe_callable, axis=axis)
   1374
   1375     def _is_scalar_access(self, key):

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1624
   1625         # fall thru to straight lookup
-> 1626         self._has_valid_type(key, axis)
   1627         return self._get_label(key, axis=axis)
   1628

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1502
   1503             try:
-> 1504                 key = self._convert_scalar_indexer(key, axis)
   1505                 if not ax.contains(key):
   1506                     error()

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in _convert_scalar_indexer(self, key, axis)
    254         ax = self.obj._get_axis(min(axis, self.ndim - 1))
    255         # a scalar
--> 256         return ax._convert_scalar_indexer(key, kind=self.name)
    257
    258     def _convert_slice_indexer(self, key, axis):

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexes/category.py in _convert_scalar_indexer(self, key, kind)
    573
    574         return super(CategoricalIndex, self)._convert_scalar_indexer(
--> 575             key, kind=kind)
    576
    577     @Appender(_index_shared_docs['_convert_list_indexer'])

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexes/base.py in _convert_scalar_indexer(self, key, kind)
   1391             elif kind in ['loc'] and is_integer(key):
   1392                 if not self.holds_integer():
-> 1393                     return self._invalid_indexer('label', key)
   1394
   1395         return key

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexes/base.py in _invalid_indexer(self, form, key)
   1575                         "indexers [{key}] of {kind}".format(
   1576                             form=form, klass=type(self), key=key,
-> 1577                             kind=type(key)))
   1578
   1579     def get_duplicates(self):

TypeError: cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [1] of <class 'int'>

That should return 'a'

TomAugspurger · 2017-10-18T21:34:04Z

#17569 for that second bug, if it's actually a bug. Posting a comment there.

TomAugspurger · 2017-10-18T21:41:01Z

Ok, just posting here, since there are other related issues: (#15470). Sorry you've walked into a bit of a minefield @elsander, but this will be a good issue to learn on :)

How about this: For a CategoricalIndex with integer categories, the behavior or __getitem__, .loc, and iloc should be identical to a regular Integer index. In particular, no fallback to positional indexing. Then we have the simple rule that for two series x and y where

# x has a CategoricalIndex with integer categories
y = x.copy()
y.index = x.index.get_values()

all indexing is identical. Hopefully that doesn't break anything.

cc @jorisvandenbossche thoughts?

TomAugspurger · 2017-10-18T21:45:42Z

@elsander if that summary is correct, the fix @jreback suggested up above of overriding holds_integer in pandas/core/indexes/category.py should work. Then we'll want tests that check

CategoricalIndex.holds_category for integer / non-integer categories
Tests that exercise the simple rule above for
- loc
- iloc
- at
- iat
- get
  and for indexers that are in the Index and not in the Index

elsander · 2017-10-18T22:59:44Z

Okay, yeah, this is more complicated than I originally thought! Still happy to work on it as long as I know what the expected behavior is. I wrote the following tests to demonstrate the expected behavior:

    def test_fallback_to_positional(self):
        # test that CategorialIndex falls back to positional
        # indexing rather than category code indexing for
        # non-numeric categories
        s = pd.Series(['a', 'b', 'c'], index=pd.Series(
            ['three', 'one', 'two'], dtype='category'))
        assert s.get(0) == 'a'
        assert s.get(1) == 'b'
        assert s.get(2) == 'c'

    def test_numeric_categorical_index(self):
        # test that CategorialIndex uses categories and does
        # not fall back to positional indexing when categories
        # are numeric
        s = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2],
                                                       dtype='category'))
        assert s.get(0) is None
        assert s.get(1) == 'b'
        assert s.get(2) == 'c'
        assert s.get(3) == 'a'

Does this still match the behavior you expect? I can also add the tests that you suggest.

elsander · 2017-10-19T00:42:33Z

I get an AttributeError on s.holds_category(). Is that a method you wanted to have implemented? It doesn't look like it's a method in the base Index class either.

TomAugspurger · 2017-10-19T01:04:47Z

Sorry, I should have said holds_integer

alex-filatov · 2017-10-19T08:01:07Z

Also, I get different behavior from @alex-filatov here.

Maybe there's been a version release where this got fixed? I can't replicate the behavior.

I reported this issue with pandas 0.19.1, current version is 0.20.3 and it's indeed seems to be partially fixed, but still:

>> s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))
>> s3.get(1)
'b'
>> s3.get(0)
'b'

So, if I understand correctly, the changes to be made are:

CategoricalIndex should no longer use the category codes for indexing
If the categories are integers, we should not fall back to anything. Otherwise, fall back to positional indexing.

Makes sense to me.

jorisvandenbossche · 2017-10-19T08:28:25Z

Let's keep the two issues discussed here separate: a) the fact that codes are used to index and b) whether categorical index with integers should fallback to positional or not.

a) the fact that codes are used to index is clearly a bug:

In [116]: s = pd.Series([1, 2, 3, 4], index=pd.Categorical(['b', 'a', 'b', 'c']))

In [117]: s.index.codes
Out[117]: array([1, 0, 1, 2], dtype=int8)

In [118]: s[0]
Out[118]: 2    <---- since it is a string categorical index, s[0] should be positional 
                     (so like iloc below), but it somehow uses the codes

In [119]: s.iloc[0]
Out[119]: 1

This is a clear bug and can be fixed separately I think.

b) whether categorical index with integers should fallback to positional or not is less clear to me. This is what is discussed in #15470 (although not much discussion there ..). At the time, I argued there that I think CategoricalIndex with numerical categories should not be regarded as a numerical index (in the sense: it should use integers for positional indexing in __getitem__, and not as labels)

So it comes back to: how do we see a CategoricalIndex with numerical categories?
Do we see it as a numerical index? Then what @TomAugspurger says above (#14865 (comment)) is correct and the behavior of __getitem__, .loc, and iloc should be identical to a regular Integer index (which also means: no fallback to positional indexing in []/__getitem_)
Or do we see it as a non-numerical index? (which is somehow supported in the sense that numerical operations like for integer index are not supported, but that's also not super convincing argument) In that case, integers in []/__getitem__ should rather be positional.

But in the end, __getitem__ will always be confusing/dubious in those cases, and it is maybe more important to fix the use of loc with integer categories (#17569) to at least allow the user to be more specific.

TomAugspurger · 2017-10-19T11:43:43Z

how do we see a CategoricalIndex with numerical categories?

@jorisvandenbossche where do you come down on this?

I think the "numericness" of a CategoricalIndex should depend on the type of the categories, instead of always treating it as non-numeric like we are now.

I don't like the type-specific behavior with respect to falling back to positional indexing, but that's where we are. And I think it's more consistent for the indexing behavior to depend on the type of the values here.

elsander · 2017-10-19T14:24:39Z

I'm still happy to work on this, but I'm going to hold off for now until there's a consensus on the expected behavior. As a user, I would find falling back to positional indexing for numeric CategoricalIndexes confusing (especially if the positional indices overlapped partially with the categories), but happy to implement it however the maintainers prefer.

jreback · 2017-10-20T13:27:00Z

Actually I find this a really easy issue from the API.

.get is de-facto .loc, so all indexing should be label based.
.codes are completely irrelevant and simply an implementation detail (so this is the bug part)
we should by-definition (as we are using .loc) never fall back or anything else again as its label based (with the labels being the categories).

jorisvandenbossche · 2017-10-20T13:58:53Z

.get is de-facto .loc, so all indexing should be label based.

No, it is de-facto __getitem__/[]. I agree it would be logical to have it only label based (but that is not how it currently is). For another issue, but we could deprecate the positional behaviour of get (or have a keyword to ask for it explicitly) ?

we should by-definition (as we are using .loc) never fall back or anything else again as its label based (with the labels being the categories).

Again, this discussion is not about loc[], it is about []. As I said above, we can discuss that we rather want to use .loc[] for get, which would solve the problem for get, but we still need to decide what the behaviour of [] on a numerical categorical index should be.

how do we see a CategoricalIndex with numerical categories?

@jorisvandenbossche where do you come down on this?

I think the "numericness" of a CategoricalIndex should depend on the type of the categories, instead of always treating it as non-numeric like we are now.

I don't like the type-specific behavior with respect to falling back to positional indexing, but that's where we are. And I think it's more consistent for the indexing behavior to depend on the type of the values here.

Yeah, maybe that is the more sensible thing to do. But to explain my reservation about changing this, let's explain it in a different way.
The behaviour of []: do we see it as 1) "try label based, if raises KeyError, then use positional (fallback) (except don't fall back for numerical)" or rather 2) "label based for Int64Index, positional for any other type of Index" ? Both are more or less equivalent (apart from special cases like the one we are discussing here, or a mixed type object dtype Index). In practice it is currently 1) (except for the case what we are discussing), but I personally would find 2) cleaner (and thus, this would be moving from 2) to 1) .., but OK, that's how it currenlty is for all others)

jorisvandenbossche · 2017-10-20T14:09:40Z

To keep the discussions a bit separate, I opened an issue about the "should get be only positional or not" (so use loc instead of []) at #17928.
Let's keep it here about the behaviour of [] for CateogoricalIndex with numerical categories.

Dr-Irv · 2018-05-01T15:59:16Z

In #20882, I was asked to see if this was the same. So I collected all of the examples listed above here, used v0.22.0, and relabeled them to be able to keep track. This might help anyone who decides to address these issues in the future.

>>> import pandas as pd
>>> pd.__version__
'0.22.0'
>>>
>>> s1 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3]))
>>> s1.get(3)
'c'
>>>
>>>
>>> s2 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3], dtype='category'))
>>> s2.get(3)
'c'
>>> s2.get(0)
'a'
>>>
>>> s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))
>>> [s3.get(i) for i in range(4)]
['b', 'b', 'c', 'a']
>>>
>>> s4 = pd.Series(['a', 'b', 'c'], index=pd.Series(['hello', 'there', 'world'],

...                                                 dtype='category'))
>>> s4.get(2)
'c'
>>>
>>> s5 = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))
>>> s6 =  pd.Series(s5.values, s5.index.get_values())
>>> s5.get(0)
'a'
>>> s6.get(0) is None
True
>>>
>>> s7 = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))
>>> try:
...     print('s7.loc[1] ', s7.loc[1])
... except TypeError as e:
...     print('Expected TypeError raised ', e)
... except Exception as e:
...     print('Unexpected exception raised: ', e)
...
Expected TypeError raised  cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [1] of <class 'int'>
>>> s8 = pd.Series([1, 2, 3, 4], index=pd.Categorical(['b', 'a', 'b', 'c']))
>>> s8[0]
2
>>>
>>> s8.iloc[0]
1
>>>

I think the results for s1, s2, s3, s4, and s5 are all correct. If the semantics of Series.get() change as discussed in #17928, then those will need to be revisited.

I think the ones that have to be dealt with are the results related to s6, s7 and s8[0]. For s7, this is #17569.

And there are 2 overriding issues (using the example s8 here) discussed in this comment above #14865 (comment) by @jorisvandenbossche about CategoricalIndex behavior -- one case is a bug, the other is a philosophical one (all well described in the referenced comment).

jreback added Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves Difficulty Novice labels Dec 12, 2016

jreback added this to the Next Major Release milestone Dec 12, 2016

cpcloud added the HackIllinois 2017 label Feb 26, 2017

TomAugspurger added the good first issue label Oct 11, 2017

jorisvandenbossche modified the milestones: Next Major Release, 0.22.0 Oct 27, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

cpcloud added the HackIllinois 2018 label Feb 23, 2018

jorisvandenbossche mentioned this issue Apr 9, 2018

Series.at and DataFrame.at crash with CategoricalIndex #20629

Closed

jreback modified the milestones: 0.23.0, Next Major Release Apr 9, 2018

This was referenced May 1, 2018

BUG: Series.get() on ExtensionArray series (and Categorical) indexed by integer returns incorrect result #20882

Closed

BUG: Fix Series.get() for ExtensionArray and Categorical #20885

Merged

TomAugspurger removed HackIllinois 2018 labels Jul 6, 2018

jreback mentioned this issue May 7, 2019

BUG-20629 allow .at accessor with CategoricalIndex #26298

Merged

4 tasks

jbrockmendel removed the Effort Low label Oct 21, 2019

jorisvandenbossche mentioned this issue Feb 5, 2020

Why does CategoricalIndex.get_value call _convert_scalar_indexer? #31683

Closed

mroeschke removed the good first issue label May 2, 2021

jbrockmendel mentioned this issue Dec 23, 2021

BUG: Series.__getitem__ with CategoricalIndex[ints] listlike inconsistent with scalar #45023

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Dec 23, 2021

jreback closed this as completed in #45023 Dec 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series values cannot be accessed by numeric categorical index #14865

Series values cannot be accessed by numeric categorical index #14865

alex-filatov commented Dec 12, 2016

INSTALLED VERSIONS

jreback commented Dec 12, 2016 •

edited

Loading

alex-filatov commented Dec 12, 2016

elsander commented Oct 18, 2017

TomAugspurger commented Oct 18, 2017

elsander commented Oct 18, 2017 •

edited

Loading

elsander commented Oct 18, 2017

elsander commented Oct 18, 2017

TomAugspurger commented Oct 18, 2017

elsander commented Oct 18, 2017 •

edited

Loading

TomAugspurger commented Oct 18, 2017 •

edited

Loading

TomAugspurger commented Oct 18, 2017

TomAugspurger commented Oct 18, 2017

TomAugspurger commented Oct 18, 2017

elsander commented Oct 18, 2017 •

edited

Loading

elsander commented Oct 19, 2017

TomAugspurger commented Oct 19, 2017

alex-filatov commented Oct 19, 2017

jorisvandenbossche commented Oct 19, 2017 •

edited

Loading

TomAugspurger commented Oct 19, 2017

elsander commented Oct 19, 2017

jreback commented Oct 20, 2017

jorisvandenbossche commented Oct 20, 2017

jorisvandenbossche commented Oct 20, 2017

Dr-Irv commented May 1, 2018

Series values cannot be accessed by numeric categorical index #14865

Series values cannot be accessed by numeric categorical index #14865

Comments

alex-filatov commented Dec 12, 2016

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Dec 12, 2016 • edited Loading

alex-filatov commented Dec 12, 2016

elsander commented Oct 18, 2017

TomAugspurger commented Oct 18, 2017

elsander commented Oct 18, 2017 • edited Loading

elsander commented Oct 18, 2017

elsander commented Oct 18, 2017

TomAugspurger commented Oct 18, 2017

elsander commented Oct 18, 2017 • edited Loading

TomAugspurger commented Oct 18, 2017 • edited Loading

TomAugspurger commented Oct 18, 2017

TomAugspurger commented Oct 18, 2017

TomAugspurger commented Oct 18, 2017

elsander commented Oct 18, 2017 • edited Loading

elsander commented Oct 19, 2017

TomAugspurger commented Oct 19, 2017

alex-filatov commented Oct 19, 2017

jorisvandenbossche commented Oct 19, 2017 • edited Loading

TomAugspurger commented Oct 19, 2017

elsander commented Oct 19, 2017

jreback commented Oct 20, 2017

jorisvandenbossche commented Oct 20, 2017

jorisvandenbossche commented Oct 20, 2017

Dr-Irv commented May 1, 2018

Output of `pd.show_versions()`

jreback commented Dec 12, 2016 •

edited

Loading

elsander commented Oct 18, 2017 •

edited

Loading

elsander commented Oct 18, 2017 •

edited

Loading

TomAugspurger commented Oct 18, 2017 •

edited

Loading

elsander commented Oct 18, 2017 •

edited

Loading

jorisvandenbossche commented Oct 19, 2017 •

edited

Loading