Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The API to retrieve serie elements presents some inconsistencies #12890

Closed
sylvaticus opened this issue Apr 13, 2016 · 3 comments
Closed

The API to retrieve serie elements presents some inconsistencies #12890

sylvaticus opened this issue Apr 13, 2016 · 3 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@sylvaticus
Copy link

Hello, I personally feel there is a bit of mess in the way to select elements in a Series :-)

The general idea is that .iloc and .loc have consistent behaviour for respectively demanding a position-based or a index(label)-based value, but are a bit slower than .ix and using directly [] which behaviour is not always consistent.
But I found these methods a bit inconsistent, also in terms of what to return if the labels are not found or the required position are out or range in the looked-up Series.

I compiled the following tables, that summarises the behaviour of these 4 methods of lookup depending (a) if the Series to look-up has an integer or a string index (I do not consider for the moment the date index), (b) if the required data is a single element, a slice index or a list (yes, the behaviour change!) and (c) if the index is found or not in the data.

The following tables works with pandas 0.17.1, NumPy 1.10.4, Python 3.4.3.

Case 1: Series with Integer index

s = pd.Series(np.arange(100,105), index=np.arange(10,15))
s
10    100
11    101
12    102
13    103
14    104
** Single element **             ** Slice **                                       ** Tuple **
s[0]       -> LAB -> KeyError    s[0:2]        -> POS -> {10:100, 11:101}          s[[1,3]]        -> LAB -> {1:NaN, 3:Nan}
s[13]      -> LAB -> 103         s[10:12]      -> POS -> empty Series              s[[12,14]]      -> LAB -> {12:102, 14:104}
---                              ---                                               ---
s.ix[0]    -> LAB -> KeyError    s.ix[0:2]     -> LAB -> empty Series              s.ix[[1,3]]     -> LAB -> {1:NaN, 3:Nan}
s.ix[13]   -> LAB -> 103         s.ix[10:12]   -> LAB -> {10:100, 11:101, 12:102}  s.ix[[12,14]]   -> LAB -> {12:102, 14:104}
---                              ---                                               ---
s.iloc[0]  -> POS -> 100         s.iloc[0:2]   -> POS -> {10:100, 11:101}          s.iloc[[1,3]]   -> POS -> {11:101, 13:103}
s.iloc[13] -> POS -> IndexError  s.iloc[10:12] -> POS -> empty Series              s.iloc[[12,14]] -> POS -> IndexError
---                              ---                                               ---
s.loc[0]   -> LAB -> KeyError    s.loc[0:2]    -> LAB -> empty Series              s.loc[[1,3]]    -> LAB -> KeyError
s.loc[13]  -> LAB -> 103         s.loc[10:12]  -> LAB -> {10:100, 11:101, 12:102}  s.loc[[12,14]]  -> LAB -> {12:102, 14:104}

Case 2: Series with string index

s = pd.Series(np.arange(100,105), index=['a','b','c','d','e'])
s
a    100
b    101
c    102
d    103
e    104
** Single element **                             ** Slice **                                           ** Tuple **
s[0]        -> POS -> 100                        s[0:2]          -> POS -> {'a':100,'b':101}           s[[0,2]]          -> POS -> {'a':100,'c':102} 
s[10]       -> LAB, POS -> KeyError, IndexError  s[10:12]        -> POS -> Empty Series                s[[10,12]]        -> POS -> IndexError 
s['a']      -> LAB -> 100                        s['a':'c']      -> LAB -> {'a':100,'b':101, 'c':102}  s[['a','c']]      -> LAB -> {'a':100,'b':101, 'c':102} 
s['g']      -> POS,LAB -> TypeError, KeyError    s['f':'h']      -> LAB -> Empty Series                s[['f','h']]      -> LAB -> {'f':NaN, 'h':NaN}
---                                              ---                                                   ---
s.ix[0]     -> POS -> 100                        s.ix[0:2]       -> POS -> {'a':100,'b':101}           s.ix[[0,2]]       -> POS -> {'a':100,'c':102} 
s.ix[10]    -> POS -> IndexError                 s.ix[10:12]     -> POS -> Empty Series                s.ix[[10,12]]     -> POS -> IndexError 
s.ix['a']   -> LAB -> 100                        s.ix['a':'c']   -> LAB -> {'a':100,'b':101, 'c':102}  s.ix[['a','c']]   -> LAB -> {'a':100,'b':101, 'c':102} 
s.ix['g']   -> POS, LAB -> TypeError, KeyError   s.ix['f':'h']   -> LAB -> Empty Series                s.ix[['f','h']]   -> LAB -> {'f':NaN, 'h':NaN}
---                                              ---                                                   ---
s.iloc[0]   -> POS -> 100                        s.iloc[0:2]     -> POS -> {'a':100,'b':101}           s.iloc[[0,2]]     -> POS -> {'a':100,'c':102} 
s.iloc[10]  -> POS -> IndexError                 s.iloc[10:12]   -> POS -> Empty Series                s.iloc[[10,12]]   -> POS -> IndexError 
s.iloc['a'] -> LAB -> TypeError                  s.iloc['a':'c'] -> POS -> ValueError                  s.iloc[['a','c']] -> POS -> TypeError    
s.iloc['g'] -> LAB -> TypeError                  s.iloc['f':'h'] -> POS -> ValueError                  s.iloc[['f','h']] -> POS -> TypeError
---                                              ---                                                   ---
s.loc[0]    -> LAB -> KeyError                   s.loc[0:2]     -> LAB -> TypeError                   s.loc[[0,2]]     -> LAB -> KeyError 
s.loc[10]   -> LAB -> KeyError                   s.loc[10:12]   -> LAB -> TypeError                   s.loc[[10,12]]   -> LAB -> KeyError 
s.loc['a']  -> LAB-> 100                         s.loc['a':'c'] -> LAB -> {'a':100,'b':101, 'c':102}  s.loc[['a','c']] -> LAB -> {'a':100,'c':102}    
s.loc['g']  -> LAB -> KeyError                   s.loc['f':'h'] -> LAB -> Empty Series                s.loc[['f','h']] -> LAB -> KeyError

As you can see there are several inconsistencies, some of them even using .iloc and .loc.

  1. The event of not founding the elements/indexing out of range is managed in three different ways: an exception is thrown, a null Series is returned or a Series with the demanded keys associated to NaN values is returned. For example s.loc['f':'h'] returns an Empty Series when s.loc[['f','h']] returns instead a KeyError. There should be a single way to handle missing elements, and eventually an optional parameter should say what to do when missing elements are encountered.

  2. When using slicers, if the lookup is by position, the end element is excluded, but when the lookup is by label the final element is included!

  3. .ix is redundant. There should be .iloc[] and .loc[] to have a guaranteed query by position and label respectively, and a faster way with a more complicated logic (but still well documented) when performance is a priority. s[] is just quicker to type than s.ix[], so for me the latter method is redundant.

@TomAugspurger
Copy link
Contributor

xref #9595

Haven't had a chance to read yours closely yet. Is there anything different from #9595? Actually, do you mind closing this issue and moving your post there to avoid fragmenting the discussion? I'll followup there later.

@TomAugspurger TomAugspurger added the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 13, 2016
@sylvaticus
Copy link
Author

Hi, thanks for the quick response. #9595 refers in detail to the []/.ix issue, while my post is a bit more general, but yes, I guess they are strictly related, so feel free to merge them... thank you..

@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

@sylvaticus virtually everything you showed is well-documented and expected, IOW,

  • .iloc does not include the right bound as its a positional indexer
  • .loc DOES include the right bound as its a label based indexer
  • .ix remains (and is not deprecated) mainly because it offers some slightly syntactic convenience on multi-axis indexing (IOW you can do combined label and positional indexing on different axes)
  • [] tries to be smart so lots of issues as indicated in Overview of [] (__getitem__) API #9595

There is an issue somewhere where we discuss the handling of missing indexers when using a list-like (and whether you should raise or reindex-like when). At some point I think we need an option for this, e.g.

.loc(errors='raise' or 'ignore')[....], to handle both of these cases, which are both valid and used.

as far as performance. Well correctness matters first. You should not be using these indexers repeatedly in a loop. If you are then its a user error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

3 participants