Overview of [] (getitem) API #9595

jorisvandenbossche · 2015-03-05T12:58:01Z

some examples (on Series only) in #12890

I started making an overview of the indexing semantics with http://nbviewer.ipython.org/gist/jorisvandenbossche/7889b389a21b41bc1063 (only for series/frame, not for panel)

Conclusion: it is mess :-)

Summary for slicing

Slicing with integer labels is:
- always integer location based
- except for a float indexer where it is label based
Slicing with other types of labels is always label based if it is of appropriate type for the indexer.

So, you can say that the behaviour is equivalent to .ix, except that the behaviour for integer labels is different for integer indexers (swapped). (For .ix, when having an integer axis, it is always label based and no fallback to integer location based).

Summary for single label

Indexing with a single label is always label based
But, there is fallback to integer location based, except for integer and float indexers

Summary for indexing with list of labels

It is primarily label based, but:
- There is fallback to integer location based apart from int/float integer axis
- It is a pure reindex, also if no label of the list is found, you just get an all NaN series (which contrasts with loc, where at least one label should be found)
- String parsing for a datetime index does not seem to work

This mainly follows ix, apart from points 2 and 3

Summary for boolean indexing

This is simple, it just works as expected

Summary for DataFrames

It uses the 'information' axis (axis 1) for:
- single labels
- list of labels
It uses the rows (axis 0) for:
- slicing
- boolean indexing

This is as documented (only the boolean case is not explicitely documented I think).

For the rest (on the choses axis), it follows the same semantics as [] on a series, but:

for a list of labels, now all labels must be present (no pure reindex as with series)
for single labels: no fallback to integer location based for non-numeric index (but this does fallback for a list of labels ...)

Questions are here:

Are there things we can change? (that would not be too disruptive .. maybe not?) And want change?
How do we document this best?
- Now you have the "basics" section (http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics) and the slicing section (http://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges), but this does not cover all cases at all.

The text was updated successfully, but these errors were encountered:

jreback · 2015-03-05T23:20:46Z

@jorisvandenbossche this is a really nice summary.

I think in general we can move []/.ix closer (maybe can get identical), so as not to have any confusion. (of course we may have to eliminate fallback which is not a bad thing anyhow).

I suppose we should prepare any changes for 0.17.0 as these will technically be API changes.

jreback · 2015-03-06T00:36:10Z

xref #7501 , #8976, #7187

shoyer · 2015-03-06T06:47:15Z

xref #9213, CC @hugadams @dandavison

@jorisvandenbossche Indeed, this is a nice summary of current behavior. Thanks!

I think we should consider radical API changes for __getitem__ if we want pandas to have a lasting influence.

My two cents on indexing is that "fallback indexing" is a really bad idea. It starts with the best of intentions, but leads to things like special cases like distinctions between integer and float indexes (e.g., see #9213). In the face of ambiguity, refuse the temptation to guess.

So if I were reinventing indexing rules from scratch, I would consider something like this (for DataFrame):

Indexing with a string or list of strings does label based selection on columns.
All other indexing is position based, NumPy style. (This includes indexing with a boolean array.)

That's it. Two simple rules that probably cover 90% of existing uses of __getitem__, at least the only ones that I could ever keep straight (string column labels and boolean arrays). Importantly, indexing would never depend on the type of the index and there would be no reindexing/NaN-filling behavior. We could also eliminate the need for .iloc as a separate indexer entirely.

This sort of change would require a serious deprecation cycle or perhaps need to wait until pandas 1.0 (likely both), but something needs to change. The fact that even pandas developers need to run extensive experiments to figure out how __getitem__ works indicates just how wrong things are. Indexing should be simple enough that its behavior can be relied on in production code. The current state of indexing is, frankly, embarrassing.

shoyer · 2016-09-09T04:40:38Z

@jorisvandenbossche Did you ever figure out how __setitem__ works? :)

jorisvandenbossche · 2016-09-09T08:38:30Z

@shoyer nope :-) I would suspect it is largely the same, but you never know ... Will try to look at it next week

matthewgilbert · 2017-08-28T15:28:22Z

I wanted to add this here since it is somewhat related to "String parsing for a datetime index does not seem to work" mentioned above and I have not seen it come up anywhere else. For a MultiIndex, string parsing for a datetime index with a scalar does not result in dropping the MultiIndex level.

In [2]: dfm = pd.DataFrame([1, 2, 3], index=pd.MultiIndex.from_arrays([pd.date_range("2015-01-01", "2015-01-03"), ['A', 'A', 'B']]))

In [3]: dfm.loc["2015-01-01"]
Out[3]: 
              0
2015-01-01 A  1

In [4]: dfm.loc[pd.Timestamp("2015-01-01")]
Out[4]: 
   0
A  1

this seems like somewhat unintuitive behaviour (to me at least)

jreback · 2017-08-29T12:45:30Z

@matthewgilbert this is just how partial string indexing works, see the docs here. The first is treated as a slice, while the second is an exact match.

aavanian · 2017-09-18T10:35:58Z

I came around this and this seems related but could also be a bug in the above interacting with the CategoricalIndex. Using the same example as #15470:

pandas 0.20.3

s = pd.Series([2, 1, 0], index=pd.CategoricalIndex([2, 1, 0]))
s[2]  # works (interpreting as label)
s.loc[2]  # fails with TypeError: cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [2] of <class 'int'>

# of course the below works!
s = pd.Series([2, 1, 0], index=[2, 1, 0])
s[2]  # works (interpreting as label)
s.loc[2]  # works (interpreting as label)

TomAugspurger · 2017-09-18T10:51:28Z

@aavanian that looks like a bug. Could you open a separate issue for it?

aavanian · 2017-09-18T10:58:30Z

Sure, done in #17569

tdpetrou · 2017-11-27T00:27:37Z

If I were to rebuild pandas, I would make indexing as simple as possible and only use .loc and .iloc. I would not implement __getitem__. There would be no ambiguity. I also wouldn't allow attribute access to columns. It would be a pain to select a single column df.loc[:, 'col'] but pandas really needs to focus on being explicit.

sgpinkus · 2019-06-23T06:28:12Z

I just came here just for @jorisvandenbossche:

Summary for DataFrames

* It uses the 'information' axis (axis 1) for:
  
  * single labels
  * list of labels

* It uses the rows (axis 0) for:
  
  * slicing
  * boolean indexing

Thanks for the rest of the analysis! Agree it's a mess. @shoyer:

* All other indexing is position based, NumPy style. (This includes indexing with a boolean array.)

I think I disagree:

In [1]: df = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('xyz'), index=list('xyz'))                                                                                                   
In [2]: df                                                                                                                                                                                       
Out[2]: 
   x  y  z
x  0  1  2
y  3  4  5
z  6  7  8
In [3]: df['x']  # By columns                                                                                                                                                 
Out[3]: 
x    0
y    3
z    6
Name: x, dtype: int64
In [4]: df[['x', 'y']]  # By columns                                                                                                                                                                 
Out[4]: 
   x  y
x  0  1
y  3  4
z  6  7
In [5]: df['x':'y'] # By rows now!?                                                                                                                                                                     
Out[5]: 
   x  y  z
x  0  1  2
y  3  4  5

Not intuitive, and is even more confusing to the beginning when you cross reference this against the behavior of df.loc[:,<X>] which works the same as df[<X>] for the first two cases but not the third. IMO df[<X>] should be identical or close as possible to df.loc[:,<X>].

In general a "[] is for cols, .loc[] is for rows" convention would be most intuitive, if [] is not dropped completely.

shoyer · 2019-06-23T07:08:58Z

@sam-at-github in my suggested model, indexing like df['x':'y'] would actually trigger an exception (because strings are not valid positional indexes). You'd have to use .loc if you wanted that sort of indexing.

sgpinkus · 2019-06-23T07:30:00Z

Oh OK, wasn't sure what you meant. I still don't think I like that much. For the second point I would prefer "every thing else fails" over switching the behavior of [] from selection on col labels to index only based selection on rows (I'm presuming you mean on rows). In my mind that doesn't address the main inconsistency: switching from a col primary op to a row primary op depending on the operand, especially in the context of the existence of .loc[] which is already for row primary stuff. Prefer anything consistent with "[] for cols, loc[] for rows".

Update: Aside, to only allow positional slicing and not "label" based is probably even more confusing since your labels can be numerical anyway:

In [8]: df = pd.DataFrame(np.arange(9).reshape((3,3)))                                                                                                          

In [9]: df[0] # By columns                                                                                                                           
Out[9]: 
0    0
1    3
2    6
Name: 0, dtype: int64

In [10]: df[[0,1]] # By columns                                                                                                                                
Out[10]: 
   0  1
0  0  1
1  3  4
2  6  7

In [11]: df[0:1] # By rows now?!
Out[11]: 
   0  1  2
0  0  1  2

jbrockmendel · 2020-01-17T04:21:05Z

Are there things we can change? (that would not be too disruptive .. maybe not?) And want change?

I'd also like to know the answer to this question.

The behavior that surprised me today was the few cases where DataFrame.__getitem__[key] does a row-based lookup rather than a column-based lookup. If deprecating any of this behavior is an option, I advocate starting with making DataFrame.__getitem__ always column-based.

jreback · 2020-01-17T04:32:41Z

Are there things we can change? (that would not be too disruptive .. maybe not?) And want change?

I'd also like to know the answer to this question.

The behavior that surprised me today was the few cases where DataFrame.__getitem__[key] does a row-based lookup rather than a column-based lookup. If deprecating any of this behavior is an option, I advocate starting with making DataFrame.__getitem__ always column-based.

i believe we have an issue for this; would be +1 in depreciation

jorisvandenbossche · 2020-01-27T20:31:48Z

@jbrockmendel can you first open (or search) an issue for this to have a discussion about it?

jbrockmendel · 2020-02-08T19:28:33Z

@jorisvandenbossche im putting together an overview of the state of the indexing code. Is the description of the API here still accurate/complete?

FluorineDog · 2020-06-01T03:09:13Z

Hey, I'm working on a join-like API ClosestItem(left: Series, right: Series, max_distance: float) which returns [index_of_right(v) for v in left] now. I would love to return Series, please give me some suggestions on:

how to deal with labels of left and right, if they are available.
when nothing within max_distance, what should I return? Currently just -1. Should I use None or anything else?

Thanks

jorisvandenbossche · 2020-06-07T09:34:25Z

@FluorineDog that's doesn't really seem related to this issue. Can you please open a new issue about it?

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Mar 5, 2015

shoyer mentioned this issue Apr 27, 2015

Towards "pandas 1.0" #10000

Closed

TomAugspurger mentioned this issue Jul 26, 2015

BUG: DataFrames with integer column names dask/dask#480

Closed

louispotok mentioned this issue Sep 15, 2015

Slicing multiple DataFrame columns doesn't work with boolean column names #11119

Closed

max-sixty mentioned this issue Oct 30, 2015

Partial indexing of a Panel #8906

Closed

TomAugspurger mentioned this issue Apr 13, 2016

The API to retrieve serie elements presents some inconsistencies #12890

Closed

shoyer mentioned this issue Sep 9, 2016

Simplifying indexing (DataFrame.__getitem__) wesm/pandas2#22

Open

mbauman mentioned this issue Sep 20, 2016

Move more logic into Axis type? JuliaArrays/AxisArrays.jl#15

Open

jorisvandenbossche mentioned this issue Feb 21, 2017

On Series with CategoricalIndex, __getitem__ not equal to .loc #15470

Closed

jreback mentioned this issue Mar 20, 2017

Proposal to change behaviour with .loc and missing keys #15747

Closed

aavanian mentioned this issue Sep 18, 2017

Can't .loc[label] on a CategoricalIndex with labels being integer #17569

Closed

toobaz mentioned this issue Oct 15, 2017

ERR: setting a column with a scalar and no index should raise #16823

Closed

shoyer mentioned this issue Nov 25, 2017

adding static type checking with mypy #14468

Closed

jorisvandenbossche mentioned this issue Nov 27, 2017

DEPR: let's deprecate #18262

Closed

34 tasks

chris-b1 mentioned this issue Dec 14, 2017

Intermittent KeyErrors for Series with duplicates in string index #18782

Closed

toobaz mentioned this issue Jun 2, 2018

Dict/dict keys in DataFrame.__getitem__ #21294

Closed

TomAugspurger mentioned this issue Mar 12, 2019

df[i, j] as an alias to df.loc[i, j] #25677

Closed

jbrockmendel mentioned this issue Jan 27, 2020

DEPR: row-based indexing in DataFrame.__getitem__ #31334

Closed

jorisvandenbossche mentioned this issue Jan 30, 2020

REGR: __setitem__ with integer slices on Int/RangeIndex is broken (label instead of positional) #31469

Closed

jbrockmendel mentioned this issue Jan 31, 2020

DEPR: DataFrame.__getitem__[str] sometimes slices on index #31476

Closed

jbrockmendel mentioned this issue Feb 20, 2020

Indexing Code Roundup #32135

Closed

galipremsagar mentioned this issue May 31, 2020

[REVIEW] Disable iteration in cudf objects and add support for DataFrame initialization with list of Series rapidsai/cudf#5340

Merged

3 tasks

jreback mentioned this issue Aug 10, 2020

ENH: Making Pandas dataframe slicing syntax match R dataframe syntax. #35659

Closed

mroeschke added the Needs Discussion Requires discussion from core team before further action label Apr 12, 2021

jbrockmendel mentioned this issue Sep 29, 2022

DISC: pd.DataFrame methods we specifically _don't_ want included data-apis/dataframe-api#83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview of [] (getitem) API #9595

Overview of [] (getitem) API #9595

jorisvandenbossche commented Mar 5, 2015

jreback commented Mar 5, 2015

jreback commented Mar 6, 2015

shoyer commented Mar 6, 2015

shoyer commented Sep 9, 2016

jorisvandenbossche commented Sep 9, 2016

matthewgilbert commented Aug 28, 2017

jreback commented Aug 29, 2017

aavanian commented Sep 18, 2017

TomAugspurger commented Sep 18, 2017

aavanian commented Sep 18, 2017

tdpetrou commented Nov 27, 2017

sgpinkus commented Jun 23, 2019 •

edited

Loading

Summary for DataFrames

shoyer commented Jun 23, 2019

sgpinkus commented Jun 23, 2019 •

edited

Loading

jbrockmendel commented Jan 17, 2020

jreback commented Jan 17, 2020

jorisvandenbossche commented Jan 27, 2020

jbrockmendel commented Feb 8, 2020

FluorineDog commented Jun 1, 2020

jorisvandenbossche commented Jun 7, 2020

Overview of [] (__getitem__) API #9595

Overview of [] (__getitem__) API #9595

Comments

jorisvandenbossche commented Mar 5, 2015

Summary for slicing

Summary for single label

Summary for indexing with list of labels

Summary for boolean indexing

Summary for DataFrames

jreback commented Mar 5, 2015

jreback commented Mar 6, 2015

shoyer commented Mar 6, 2015

shoyer commented Sep 9, 2016

jorisvandenbossche commented Sep 9, 2016

matthewgilbert commented Aug 28, 2017

jreback commented Aug 29, 2017

aavanian commented Sep 18, 2017

TomAugspurger commented Sep 18, 2017

aavanian commented Sep 18, 2017

tdpetrou commented Nov 27, 2017

sgpinkus commented Jun 23, 2019 • edited Loading

Summary for DataFrames

shoyer commented Jun 23, 2019

sgpinkus commented Jun 23, 2019 • edited Loading

jbrockmendel commented Jan 17, 2020

jreback commented Jan 17, 2020

jorisvandenbossche commented Jan 27, 2020

jbrockmendel commented Feb 8, 2020

FluorineDog commented Jun 1, 2020

jorisvandenbossche commented Jun 7, 2020

Overview of [] (getitem) API #9595

Overview of [] (getitem) API #9595

sgpinkus commented Jun 23, 2019 •

edited

Loading

sgpinkus commented Jun 23, 2019 •

edited

Loading