Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series values cannot be accessed by numeric categorical index #14865

Closed
alex-filatov opened this issue Dec 12, 2016 · 24 comments · Fixed by #45023
Closed

Series values cannot be accessed by numeric categorical index #14865

alex-filatov opened this issue Dec 12, 2016 · 24 comments · Fixed by #45023
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@alex-filatov
Copy link

Code Sample

>>> s1 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3]))
>>> s1.get(3)
'c'

>>> s2 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3], dtype='category'))
>>> s2.get(3) is None
True
>>> s2.get(0)
'a'

Problem description

In my opinion behavior in the second case can be error-prone (when there is an overlap between positional index and categorical one) and inconvenient (forces to use positional index).

Expected Output

>>> s2.get(3)
'c'

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Dec 12, 2016

this is raising on a validation check and hitting positional indexing, rather than label indexing:

In [12]: s.index.holds_integer()
Out[12]: True

In [13]: s.index.is_categorical()
Out[13]: False

In [14]: s.index.is_integer()
Out[14]: True

In [15]: s.index.holds_integer()
Out[15]: True

In [16]: s2.index.is_categorical()
Out[16]: True

In [17]: s2.index.holds_integer()
Out[17]: False

https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/base.py#L1128

    def holds_integer(self):
        return self.inferred_type in ['integer', 'mixed-integer']

something like might work (though might cause other things to break)
In pandas/indexes/category.py override to

    def holds_integer(self):
        return self.categories.holds_integer()

@jreback jreback added Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves Difficulty Novice labels Dec 12, 2016
@jreback jreback added this to the Next Major Release milestone Dec 12, 2016
@alex-filatov
Copy link
Author

I was mistaken regarding positional indexing. It seems that category codes are used as the index:

>>> s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))
>>> s3.get(3) is None
True
>>> s3.get(0)
'b'
>>> s3.get(1)
'c'
>>> s3.get(2)
'a'
>>> s3.index.codes
array([2, 0, 1], dtype=int8)

Tried to override holds_integer as advised - didn't resolve the issue. Not sure the method is invoked in this case.

@elsander
Copy link

I was looking for a first issue-- I can take a look at this.

@TomAugspurger
Copy link
Contributor

Cool, thanks. If making that change to indexes/category.py does end up breaking other things, then this could turn out to be difficult.

@elsander
Copy link

elsander commented Oct 18, 2017

I've been playing around with this to understand the behavior, and the category codes can be used as the index for non-numeric categorical columns as well:

In [35]: s4 = pd.Series(['a', 'b', 'c'], index=pd.Series(['hello', 'there', 'world'], dtype='category
    ...: '))

In [36]: s4.get(2)
Out[36]: 'c'

This makes me wonder whether this is a bug at all-- should the category codes just be the default here? The behavior is different than you might expect at first, but it is consistent. I'm going to look for a fix to see if we can default to using the category over the category code, but if it breaks other things, it might make sense to document the behavior rather than changing it.

@elsander
Copy link

Also, I get different behavior from @alex-filatov here.

In [59]: s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))

In [60]: s3.get(3)
Out[60]: 'a' # not None!

In [61]: s3.get(2)
Out[61]: 'c'

In [62]: s3.get(1)
Out[62]: 'b'

In [63]: s3.get(0)
Out[63]: 'b'

@elsander
Copy link

Maybe there's been a version release where this got fixed? I can't replicate the behavior.

@TomAugspurger
Copy link
Contributor

For you first example, s4.get(2), that's correctly falling back to positional indexing (I think). There's only ambiguity when the categories are themselves integers, so there shouldn't be any positional indexing fallback (again, I could be wrong here).

For your second example, yes I'm seeing that too. And I think that's the correct behavior.

To summarize, I think the correct behavior is that:

  1. indexing into a CategoricalIndex should always use the categories, not the codes
  2. whether or not a CategoricalIndex falls back to positional indexing should depend on whether the categories or integers.

Or, to put it another way these two should be identical w.r.t. indexing:

In [43]: s = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))

In [44]: s2 = pd.Series(s.values, s.index.get_values())

In [45]: s.get(0)
Out[45]: 'a'

In [46]: s2.get(0)

The difference is that a Categoricalndex with integer categories falls back to positional indexing when it shouldn't.

@elsander
Copy link

elsander commented Oct 18, 2017

[Edit: changing this comment, now that I think I understand better]

So, if I understand correctly, the changes to be made are:

  • CategoricalIndex should no longer use the category codes for indexing
  • If the categories are integers, we should not fall back to anything. Otherwise, fall back to positional indexing.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 18, 2017

I think that's correct (though @jreback will know better). This should also fix this error, which I think is a bug:

In [1]: import pandas as pd

In [2]: s = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))

In [3]: s
Out[3]:
1    a
2    b
3    c
dtype: object

In [4]: s.loc[1]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-609ae8a0f3fe> in <module>()
----> 1 s.loc[1]

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in __getitem__(self, key)
   1371
   1372             maybe_callable = com._apply_if_callable(key, self.obj)
-> 1373             return self._getitem_axis(maybe_callable, axis=axis)
   1374
   1375     def _is_scalar_access(self, key):

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1624
   1625         # fall thru to straight lookup
-> 1626         self._has_valid_type(key, axis)
   1627         return self._get_label(key, axis=axis)
   1628

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1502
   1503             try:
-> 1504                 key = self._convert_scalar_indexer(key, axis)
   1505                 if not ax.contains(key):
   1506                     error()

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py in _convert_scalar_indexer(self, key, axis)
    254         ax = self.obj._get_axis(min(axis, self.ndim - 1))
    255         # a scalar
--> 256         return ax._convert_scalar_indexer(key, kind=self.name)
    257
    258     def _convert_slice_indexer(self, key, axis):

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexes/category.py in _convert_scalar_indexer(self, key, kind)
    573
    574         return super(CategoricalIndex, self)._convert_scalar_indexer(
--> 575             key, kind=kind)
    576
    577     @Appender(_index_shared_docs['_convert_list_indexer'])

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexes/base.py in _convert_scalar_indexer(self, key, kind)
   1391             elif kind in ['loc'] and is_integer(key):
   1392                 if not self.holds_integer():
-> 1393                     return self._invalid_indexer('label', key)
   1394
   1395         return key

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexes/base.py in _invalid_indexer(self, form, key)
   1575                         "indexers [{key}] of {kind}".format(
   1576                             form=form, klass=type(self), key=key,
-> 1577                             kind=type(key)))
   1578
   1579     def get_duplicates(self):

TypeError: cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [1] of <class 'int'>

That should return 'a'

@TomAugspurger
Copy link
Contributor

#17569 for that second bug, if it's actually a bug. Posting a comment there.

@TomAugspurger
Copy link
Contributor

Ok, just posting here, since there are other related issues: (#15470). Sorry you've walked into a bit of a minefield @elsander, but this will be a good issue to learn on :)

How about this: For a CategoricalIndex with integer categories, the behavior or __getitem__, .loc, and iloc should be identical to a regular Integer index. In particular, no fallback to positional indexing. Then we have the simple rule that for two series x and y where

# x has a CategoricalIndex with integer categories
y = x.copy()
y.index = x.index.get_values()

all indexing is identical. Hopefully that doesn't break anything.

cc @jorisvandenbossche thoughts?

@TomAugspurger
Copy link
Contributor

@elsander if that summary is correct, the fix @jreback suggested up above of overriding holds_integer in pandas/core/indexes/category.py should work. Then we'll want tests that check

  • CategoricalIndex.holds_category for integer / non-integer categories
  • Tests that exercise the simple rule above for
    • loc
    • iloc
    • at
    • iat
    • get
      and for indexers that are in the Index and not in the Index

@elsander
Copy link

elsander commented Oct 18, 2017

Okay, yeah, this is more complicated than I originally thought! Still happy to work on it as long as I know what the expected behavior is. I wrote the following tests to demonstrate the expected behavior:

    def test_fallback_to_positional(self):
        # test that CategorialIndex falls back to positional
        # indexing rather than category code indexing for
        # non-numeric categories
        s = pd.Series(['a', 'b', 'c'], index=pd.Series(
            ['three', 'one', 'two'], dtype='category'))
        assert s.get(0) == 'a'
        assert s.get(1) == 'b'
        assert s.get(2) == 'c'

    def test_numeric_categorical_index(self):
        # test that CategorialIndex uses categories and does
        # not fall back to positional indexing when categories
        # are numeric
        s = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2],
                                                       dtype='category'))
        assert s.get(0) is None
        assert s.get(1) == 'b'
        assert s.get(2) == 'c'
        assert s.get(3) == 'a'

Does this still match the behavior you expect? I can also add the tests that you suggest.

@elsander
Copy link

I get an AttributeError on s.holds_category(). Is that a method you wanted to have implemented? It doesn't look like it's a method in the base Index class either.

@TomAugspurger
Copy link
Contributor

Sorry, I should have said holds_integer

@alex-filatov
Copy link
Author

Also, I get different behavior from @alex-filatov here.

Maybe there's been a version release where this got fixed? I can't replicate the behavior.

I reported this issue with pandas 0.19.1, current version is 0.20.3 and it's indeed seems to be partially fixed, but still:

>> s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))
>> s3.get(1)
'b'
>> s3.get(0)
'b'

So, if I understand correctly, the changes to be made are:

CategoricalIndex should no longer use the category codes for indexing
If the categories are integers, we should not fall back to anything. Otherwise, fall back to positional indexing.

Makes sense to me.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 19, 2017

Let's keep the two issues discussed here separate: a) the fact that codes are used to index and b) whether categorical index with integers should fallback to positional or not.

a) the fact that codes are used to index is clearly a bug:

In [116]: s = pd.Series([1, 2, 3, 4], index=pd.Categorical(['b', 'a', 'b', 'c']))

In [117]: s.index.codes
Out[117]: array([1, 0, 1, 2], dtype=int8)

In [118]: s[0]
Out[118]: 2    <---- since it is a string categorical index, s[0] should be positional 
                     (so like iloc below), but it somehow uses the codes

In [119]: s.iloc[0]
Out[119]: 1

This is a clear bug and can be fixed separately I think.

b) whether categorical index with integers should fallback to positional or not is less clear to me. This is what is discussed in #15470 (although not much discussion there ..). At the time, I argued there that I think CategoricalIndex with numerical categories should not be regarded as a numerical index (in the sense: it should use integers for positional indexing in __getitem__, and not as labels)

So it comes back to: how do we see a CategoricalIndex with numerical categories?
Do we see it as a numerical index? Then what @TomAugspurger says above (#14865 (comment)) is correct and the behavior of __getitem__, .loc, and iloc should be identical to a regular Integer index (which also means: no fallback to positional indexing in []/__getitem_)
Or do we see it as a non-numerical index? (which is somehow supported in the sense that numerical operations like for integer index are not supported, but that's also not super convincing argument) In that case, integers in []/__getitem__ should rather be positional.

But in the end, __getitem__ will always be confusing/dubious in those cases, and it is maybe more important to fix the use of loc with integer categories (#17569) to at least allow the user to be more specific.

@TomAugspurger
Copy link
Contributor

how do we see a CategoricalIndex with numerical categories?

@jorisvandenbossche where do you come down on this?

I think the "numericness" of a CategoricalIndex should depend on the type of the categories, instead of always treating it as non-numeric like we are now.

I don't like the type-specific behavior with respect to falling back to positional indexing, but that's where we are. And I think it's more consistent for the indexing behavior to depend on the type of the values here.

@elsander
Copy link

I'm still happy to work on this, but I'm going to hold off for now until there's a consensus on the expected behavior. As a user, I would find falling back to positional indexing for numeric CategoricalIndexes confusing (especially if the positional indices overlapped partially with the categories), but happy to implement it however the maintainers prefer.

@jreback
Copy link
Contributor

jreback commented Oct 20, 2017

Actually I find this a really easy issue from the API.

  • .get is de-facto .loc, so all indexing should be label based.
  • .codes are completely irrelevant and simply an implementation detail (so this is the bug part)
  • we should by-definition (as we are using .loc) never fall back or anything else again as its label based (with the labels being the categories).

@jorisvandenbossche
Copy link
Member

.get is de-facto .loc, so all indexing should be label based.

No, it is de-facto __getitem__/[]. I agree it would be logical to have it only label based (but that is not how it currently is). For another issue, but we could deprecate the positional behaviour of get (or have a keyword to ask for it explicitly) ?

we should by-definition (as we are using .loc) never fall back or anything else again as its label based (with the labels being the categories).

Again, this discussion is not about loc[], it is about []. As I said above, we can discuss that we rather want to use .loc[] for get, which would solve the problem for get, but we still need to decide what the behaviour of [] on a numerical categorical index should be.

how do we see a CategoricalIndex with numerical categories?

@jorisvandenbossche where do you come down on this?

I think the "numericness" of a CategoricalIndex should depend on the type of the categories, instead of always treating it as non-numeric like we are now.

I don't like the type-specific behavior with respect to falling back to positional indexing, but that's where we are. And I think it's more consistent for the indexing behavior to depend on the type of the values here.

Yeah, maybe that is the more sensible thing to do. But to explain my reservation about changing this, let's explain it in a different way.
The behaviour of []: do we see it as 1) "try label based, if raises KeyError, then use positional (fallback) (except don't fall back for numerical)" or rather 2) "label based for Int64Index, positional for any other type of Index" ? Both are more or less equivalent (apart from special cases like the one we are discussing here, or a mixed type object dtype Index). In practice it is currently 1) (except for the case what we are discussing), but I personally would find 2) cleaner (and thus, this would be moving from 2) to 1) .., but OK, that's how it currenlty is for all others)

@jorisvandenbossche
Copy link
Member

To keep the discussions a bit separate, I opened an issue about the "should get be only positional or not" (so use loc instead of []) at #17928.
Let's keep it here about the behaviour of [] for CateogoricalIndex with numerical categories.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 1, 2018

In #20882, I was asked to see if this was the same. So I collected all of the examples listed above here, used v0.22.0, and relabeled them to be able to keep track. This might help anyone who decides to address these issues in the future.

>>> import pandas as pd
>>> pd.__version__
'0.22.0'
>>>
>>> s1 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3]))
>>> s1.get(3)
'c'
>>>
>>>
>>> s2 = pd.Series(['a', 'b', 'c'], index=pd.Series([1, 2, 3], dtype='category'))
>>> s2.get(3)
'c'
>>> s2.get(0)
'a'
>>>
>>> s3 = pd.Series(['a', 'b', 'c'], index=pd.Series([3, 1, 2], dtype='category'))
>>> [s3.get(i) for i in range(4)]
['b', 'b', 'c', 'a']
>>>
>>> s4 = pd.Series(['a', 'b', 'c'], index=pd.Series(['hello', 'there', 'world'],

...                                                 dtype='category'))
>>> s4.get(2)
'c'
>>>
>>> s5 = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))
>>> s6 =  pd.Series(s5.values, s5.index.get_values())
>>> s5.get(0)
'a'
>>> s6.get(0) is None
True
>>>
>>> s7 = pd.Series(['a', 'b', 'c'], index=pd.CategoricalIndex([1, 2, 3]))
>>> try:
...     print('s7.loc[1] ', s7.loc[1])
... except TypeError as e:
...     print('Expected TypeError raised ', e)
... except Exception as e:
...     print('Unexpected exception raised: ', e)
...
Expected TypeError raised  cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [1] of <class 'int'>
>>> s8 = pd.Series([1, 2, 3, 4], index=pd.Categorical(['b', 'a', 'b', 'c']))
>>> s8[0]
2
>>>
>>> s8.iloc[0]
1
>>>

I think the results for s1, s2, s3, s4, and s5 are all correct. If the semantics of Series.get() change as discussed in #17928, then those will need to be revisited.

I think the ones that have to be dealt with are the results related to s6, s7 and s8[0]. For s7, this is #17569.

And there are 2 overriding issues (using the example s8 here) discussed in this comment above #14865 (comment) by @jorisvandenbossche about CategoricalIndex behavior -- one case is a bug, the other is a philosophical one (all well described in the referenced comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
9 participants