The API to retrieve serie elements presents some inconsistencies #12890

sylvaticus · 2016-04-13T13:18:01Z

Hello, I personally feel there is a bit of mess in the way to select elements in a Series :-)

The general idea is that .iloc and .loc have consistent behaviour for respectively demanding a position-based or a index(label)-based value, but are a bit slower than .ix and using directly [] which behaviour is not always consistent.
But I found these methods a bit inconsistent, also in terms of what to return if the labels are not found or the required position are out or range in the looked-up Series.

I compiled the following tables, that summarises the behaviour of these 4 methods of lookup depending (a) if the Series to look-up has an integer or a string index (I do not consider for the moment the date index), (b) if the required data is a single element, a slice index or a list (yes, the behaviour change!) and (c) if the index is found or not in the data.

The following tables works with pandas 0.17.1, NumPy 1.10.4, Python 3.4.3.

Case 1: Series with Integer index

s = pd.Series(np.arange(100,105), index=np.arange(10,15))
s
10    100
11    101
12    102
13    103
14    104

** Single element **             ** Slice **                                       ** Tuple **
s[0]       -> LAB -> KeyError    s[0:2]        -> POS -> {10:100, 11:101}          s[[1,3]]        -> LAB -> {1:NaN, 3:Nan}
s[13]      -> LAB -> 103         s[10:12]      -> POS -> empty Series              s[[12,14]]      -> LAB -> {12:102, 14:104}
---                              ---                                               ---
s.ix[0]    -> LAB -> KeyError    s.ix[0:2]     -> LAB -> empty Series              s.ix[[1,3]]     -> LAB -> {1:NaN, 3:Nan}
s.ix[13]   -> LAB -> 103         s.ix[10:12]   -> LAB -> {10:100, 11:101, 12:102}  s.ix[[12,14]]   -> LAB -> {12:102, 14:104}
---                              ---                                               ---
s.iloc[0]  -> POS -> 100         s.iloc[0:2]   -> POS -> {10:100, 11:101}          s.iloc[[1,3]]   -> POS -> {11:101, 13:103}
s.iloc[13] -> POS -> IndexError  s.iloc[10:12] -> POS -> empty Series              s.iloc[[12,14]] -> POS -> IndexError
---                              ---                                               ---
s.loc[0]   -> LAB -> KeyError    s.loc[0:2]    -> LAB -> empty Series              s.loc[[1,3]]    -> LAB -> KeyError
s.loc[13]  -> LAB -> 103         s.loc[10:12]  -> LAB -> {10:100, 11:101, 12:102}  s.loc[[12,14]]  -> LAB -> {12:102, 14:104}

Case 2: Series with string index

s = pd.Series(np.arange(100,105), index=['a','b','c','d','e'])
s
a    100
b    101
c    102
d    103
e    104

** Single element **                             ** Slice **                                           ** Tuple **
s[0]        -> POS -> 100                        s[0:2]          -> POS -> {'a':100,'b':101}           s[[0,2]]          -> POS -> {'a':100,'c':102} 
s[10]       -> LAB, POS -> KeyError, IndexError  s[10:12]        -> POS -> Empty Series                s[[10,12]]        -> POS -> IndexError 
s['a']      -> LAB -> 100                        s['a':'c']      -> LAB -> {'a':100,'b':101, 'c':102}  s[['a','c']]      -> LAB -> {'a':100,'b':101, 'c':102} 
s['g']      -> POS,LAB -> TypeError, KeyError    s['f':'h']      -> LAB -> Empty Series                s[['f','h']]      -> LAB -> {'f':NaN, 'h':NaN}
---                                              ---                                                   ---
s.ix[0]     -> POS -> 100                        s.ix[0:2]       -> POS -> {'a':100,'b':101}           s.ix[[0,2]]       -> POS -> {'a':100,'c':102} 
s.ix[10]    -> POS -> IndexError                 s.ix[10:12]     -> POS -> Empty Series                s.ix[[10,12]]     -> POS -> IndexError 
s.ix['a']   -> LAB -> 100                        s.ix['a':'c']   -> LAB -> {'a':100,'b':101, 'c':102}  s.ix[['a','c']]   -> LAB -> {'a':100,'b':101, 'c':102} 
s.ix['g']   -> POS, LAB -> TypeError, KeyError   s.ix['f':'h']   -> LAB -> Empty Series                s.ix[['f','h']]   -> LAB -> {'f':NaN, 'h':NaN}
---                                              ---                                                   ---
s.iloc[0]   -> POS -> 100                        s.iloc[0:2]     -> POS -> {'a':100,'b':101}           s.iloc[[0,2]]     -> POS -> {'a':100,'c':102} 
s.iloc[10]  -> POS -> IndexError                 s.iloc[10:12]   -> POS -> Empty Series                s.iloc[[10,12]]   -> POS -> IndexError 
s.iloc['a'] -> LAB -> TypeError                  s.iloc['a':'c'] -> POS -> ValueError                  s.iloc[['a','c']] -> POS -> TypeError    
s.iloc['g'] -> LAB -> TypeError                  s.iloc['f':'h'] -> POS -> ValueError                  s.iloc[['f','h']] -> POS -> TypeError
---                                              ---                                                   ---
s.loc[0]    -> LAB -> KeyError                   s.loc[0:2]     -> LAB -> TypeError                   s.loc[[0,2]]     -> LAB -> KeyError 
s.loc[10]   -> LAB -> KeyError                   s.loc[10:12]   -> LAB -> TypeError                   s.loc[[10,12]]   -> LAB -> KeyError 
s.loc['a']  -> LAB-> 100                         s.loc['a':'c'] -> LAB -> {'a':100,'b':101, 'c':102}  s.loc[['a','c']] -> LAB -> {'a':100,'c':102}    
s.loc['g']  -> LAB -> KeyError                   s.loc['f':'h'] -> LAB -> Empty Series                s.loc[['f','h']] -> LAB -> KeyError

As you can see there are several inconsistencies, some of them even using .iloc and .loc.

The event of not founding the elements/indexing out of range is managed in three different ways: an exception is thrown, a null Series is returned or a Series with the demanded keys associated to NaN values is returned. For example s.loc['f':'h'] returns an Empty Series when s.loc[['f','h']] returns instead a KeyError. There should be a single way to handle missing elements, and eventually an optional parameter should say what to do when missing elements are encountered.
When using slicers, if the lookup is by position, the end element is excluded, but when the lookup is by label the final element is included!
.ix is redundant. There should be .iloc[] and .loc[] to have a guaranteed query by position and label respectively, and a faster way with a more complicated logic (but still well documented) when performance is a priority. s[] is just quicker to type than s.ix[], so for me the latter method is redundant.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-04-13T13:46:57Z

xref #9595

Haven't had a chance to read yours closely yet. Is there anything different from #9595? Actually, do you mind closing this issue and moving your post there to avoid fragmenting the discussion? I'll followup there later.

sylvaticus · 2016-04-13T13:56:05Z

Hi, thanks for the quick response. #9595 refers in detail to the []/.ix issue, while my post is a bit more general, but yes, I guess they are strictly related, so feel free to merge them... thank you..

jreback · 2016-04-13T15:34:39Z

@sylvaticus virtually everything you showed is well-documented and expected, IOW,

.iloc does not include the right bound as its a positional indexer
.loc DOES include the right bound as its a label based indexer
.ix remains (and is not deprecated) mainly because it offers some slightly syntactic convenience on multi-axis indexing (IOW you can do combined label and positional indexing on different axes)
[] tries to be smart so lots of issues as indicated in Overview of [] (__getitem__) API #9595

There is an issue somewhere where we discuss the handling of missing indexers when using a list-like (and whether you should raise or reindex-like when). At some point I think we need an option for this, e.g.

.loc(errors='raise' or 'ignore')[....], to handle both of these cases, which are both valid and used.

as far as performance. Well correctness matters first. You should not be using these indexers repeatedly in a loop. If you are then its a user error.

TomAugspurger added the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 13, 2016

jreback closed this as completed Apr 13, 2016

jorisvandenbossche mentioned this issue Apr 13, 2016

Overview of [] (__getitem__) API #9595

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The API to retrieve serie elements presents some inconsistencies #12890

The API to retrieve serie elements presents some inconsistencies #12890

sylvaticus commented Apr 13, 2016

TomAugspurger commented Apr 13, 2016

sylvaticus commented Apr 13, 2016

jreback commented Apr 13, 2016

The API to retrieve serie elements presents some inconsistencies #12890

The API to retrieve serie elements presents some inconsistencies #12890

Comments

sylvaticus commented Apr 13, 2016

Case 1: Series with Integer index

Case 2: Series with string index

TomAugspurger commented Apr 13, 2016

sylvaticus commented Apr 13, 2016

jreback commented Apr 13, 2016