Dict/dict keys in DataFrame.getitem #21294

toobaz · 2018-06-02T08:06:50Z

Code Sample, a copy-pastable example if possible

In [2]: pd.Series(index=range(10))[{1:2,3:4}]
Out[2]: 
1   NaN
3   NaN
dtype: float64

In [3]: pd.Series(index=range(10))[{1:2,3:4}.keys()]
Out[3]: 
1   NaN
3   NaN
dtype: float64

In [4]: pd.DataFrame(index=range(10), columns=range(10))[{1:2,3:4}]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-5b29d5d42cf2> in <module>()
----> 1 pd.DataFrame(index=range(10), columns=range(10))[{1:2,3:4}]

~/nobackup/repo/pandas/pandas/core/frame.py in __getitem__(self, key)
   2683             return self._getitem_multilevel(key)
   2684         else:
-> 2685             return self._getitem_column(key)
   2686 
   2687     def _getitem_column(self, key):

~/nobackup/repo/pandas/pandas/core/frame.py in _getitem_column(self, key)
   2690         # get column
   2691         if self.columns.is_unique:
-> 2692             return self._get_item_cache(key)
   2693 
   2694         # duplicate columns & possible reduce dimensionality

~/nobackup/repo/pandas/pandas/core/generic.py in _get_item_cache(self, item)
   2482         """Return the cached item, item represents a label indexer."""
   2483         cache = self._item_cache
-> 2484         res = cache.get(item)
   2485         if res is None:
   2486             values = self._data.get(item)

TypeError: unhashable type: 'dict'

In [5]: pd.DataFrame(index=range(10), columns=range(10))[{1:2,3:4}.keys()]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-030f516d9637> in <module>()
----> 1 pd.DataFrame(index=range(10), columns=range(10))[{1:2,3:4}.keys()]

~/nobackup/repo/pandas/pandas/core/frame.py in __getitem__(self, key)
   2683             return self._getitem_multilevel(key)
   2684         else:
-> 2685             return self._getitem_column(key)
   2686 
   2687     def _getitem_column(self, key):

~/nobackup/repo/pandas/pandas/core/frame.py in _getitem_column(self, key)
   2690         # get column
   2691         if self.columns.is_unique:
-> 2692             return self._get_item_cache(key)
   2693 
   2694         # duplicate columns & possible reduce dimensionality

~/nobackup/repo/pandas/pandas/core/generic.py in _get_item_cache(self, item)
   2482         """Return the cached item, item represents a label indexer."""
   2483         cache = self._item_cache
-> 2484         res = cache.get(item)
   2485         if res is None:
   2486             values = self._data.get(item)

TypeError: unhashable type: 'dict_keys'

Problem description

I know that DataFrame.__getitem__ is a mess ( #9595 ), but I don't see why dicts and dict keys shouldn't be just considered list-likes as it happens with Series.

Expected Output

In [6]: pd.DataFrame(index=range(10), columns=range(10))[list({1:2,3:4})]
Out[6]: 
     1    3
0  NaN  NaN
1  NaN  NaN
2  NaN  NaN
3  NaN  NaN
4  NaN  NaN
5  NaN  NaN
6  NaN  NaN
7  NaN  NaN
8  NaN  NaN
9  NaN  NaN

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.24.0.dev0+25.gcd0447102
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.25.2
numpy: 1.14.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.2.2.post1153+gff6786446
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

close pandas-dev#21294

close pandas-dev#21294 close pandas-dev#21428

jorisvandenbossche · 2018-06-12T17:06:18Z

I don't see why dicts and dict keys shouldn't be just considered list-likes as it happens with Series.

Note that in many place we see Series and dicts as equivalents (in many method parameters that expect a mapping, eg fillna, astype, ..), but in those cases they are interpreted the same (I mean passing dict or pd.Series(dict) will do the same).
In this case they are not the same, as the Series' values are used, but the dict's keys.

Personally I don't really see a good reason to allow this (and for allowing this complexity). Do you have a practical use case?
People who want this can always explicitly do dict.keys() or list(dict) to be explicit about using the keys here.

(if we want to fix the inconsistency, I would rather disable using dicts in other places in indexing)

toobaz · 2018-06-12T18:59:14Z

Personally I don't really see a good reason to allow this (and for allowing this complexity). Do you have a practical use case?

Well, yes, but...

People who want this can always explicitly do dict.keys() or list(dict) to be explicit about using the keys here.

... sure

(if we want to fix the inconsistency, I would rather disable using dicts in other places in indexing)

I wouldn't have major objections if we were doing it from scratch. But:

there is no real "complexity" issue. The diff is large just because I refactored code. Deciding whether to consider dicts and dict keys as list-like or not is a matter of two trivial lines of code inside is_list_like
we should try to be as coherent as possible on what is a "list-like" for pandas, and have is_list_like reflect that. So implementation wise, your suggestion basically reflects to "change this PR just by editing is_list_like".
This would disable dicts and dict keys as list-likes everywhere. It's my turn to ask: what is a good reason to do so? Python users know that list({1:2}) results in [1], so it's not so mysterious that a list can be replaced with a dict and that this means looking at its values. (What might be mysterious is that a list can't be replaced with a str, but they learn this soon)
My main issue is then that of backward compatibility.

I think it's way simpler to tell users that every iterable is a possible list-like, with the exception of strings (because they are scalar) and tuples (because they are MultiIndex keys), and have this rule apply to all the API (with the obvious exception of places where we expect a mapping, as in the initialization of a Series).

jorisvandenbossche · 2018-06-12T21:33:48Z

there is no real "complexity" issue.

I understand that the actual fix might be two lines of code in is_list_like (and that it might actually gives simpler code to just coerce it to list than to explicitly disallow it), but with "added complexity" I also mean: 1) complexity for the user: the more possibilities we provide, the more complex it becomes (if you code written that indexes with a dict, then you need to know whether it uses the values or the keys to understand the code) and 2) maintenance overhead: regardless of the tiny code change, this gives rise to extra tests, extra questions if we should do it in other places as well to keep consistency, extra usage questions on the issue tracker, ...

so it's not so mysterious that a list can be replaced with a dict and that this means looking at its values.

you mean its "keys" .. :-)

I think it's way simpler to tell users that every iterable is a possible list-like ... and have this rule apply to all the API (with the obvious exception of places where we expect a mapping ..)

To get more concrete, what would be examples of usage cases that get changed by taking on this rule? (as I completely agree consistent rules how things are interpreted are good).

For example, to_numeric accepts list-likes (list, array, series (which is seen here as list-like and not mapping), ... but does it also need to accept dicts? (btw, it also does not accept a generator, while Series does)

toobaz · 2018-06-12T21:50:28Z

you mean its "keys" .. :-)

:-)

To get more concrete, what would be examples of usage cases that get changed by taking on this rule? (as I completely agree consistent rules how things are interpreted are good).

Hard to answer, because they are not particularly related to pandas itself, but to the code surrounding pandas calls.

In my case, I was working with a df of which I wanted to select columns and change their appearance. So I had a dict translation = {old_name1 : new_name1, old_name2 : new_name2 ...} and wanted to do my_df[translation].rename(translation, axis=1).

Again, I could certainly use my_df[list(translation)].rename(translation, axis=1), it's just that I had got used to the Series behavior.

For example, to_numeric accepts list-likes (list, array, series (which is seen here as list-like and not mapping), ... but does it also need to accept dicts? (btw, it also does not accept a generator, while Series does)

... and it accepts tuples. I do see this as a discrepancy, but a minor one, compared to consistency in the same type of indexer across different objects.

toobaz · 2018-06-12T21:58:41Z

I also think - but I will be happy to be proven wrong - that in Python libraries accepting any iterable wherever lists are expected is the norm.

toobaz · 2018-06-13T08:52:01Z

complexity for the user: the more possibilities we provide, the more complex it becomes (if you code written that indexes with a dict, then you need to know whether it uses the values or the keys to understand the code)

Just thinking out loud: this does not apply to e.g. dict.keys() or dict.values() (which currently exhibit the same behavior). So another solution would be to exclude mappings in general, but accept these (implemented as "things that have __iter__ but do not have keys()").

(I don't necessarily mean this is a better solution - my main worry, of breaking existing code, stays mostly unchanged)

jorisvandenbossche · 2018-06-13T12:02:42Z

I also think - but I will be happy to be proven wrong - that in Python libraries accepting any iterable wherever lists are expected is the norm.

Yep, that is true and a good point. Eg looking at the "built-in functions", they use consistent terminology of "iterable" in some of the functions, and they all accept a dict, eg:

In [50]: d = {1: 2, 2: 3}

In [51]: sum(d)
Out[51]: 3

jorisvandenbossche · 2018-06-13T12:05:36Z

I see the point of "allowing dicts actually creates less special cases". Which is a good one (only of course that we already make an exception on the rule for tuples in certain cases, as you mentioned)

It's only that I would personally never want to see such usage in code I review or maintain. I think I would always ask to change to be more explicit on it :-)

toobaz · 2018-06-13T21:20:39Z

we already make an exception on the rule for tuples in certain cases

By the way: I think over time it should become "in all cases" (in which we declare to accept "list-likes", which might not necessarily be all cases in which we accept lists)

Another thing: the obvious alternative to "accept anything which can be casted to a list and is not a tuple or string" is "accept only lists". But in principle there can be good reasons to pass generators to save RAM (e.g. indexing a RangeIndex with a range in Python 3 could already avoids casting to list).

I see the point of "allowing dicts actually creates less special cases".

So would you say #21313 is OK?

toobaz · 2018-06-20T15:52:20Z

@jorisvandenbossche if you think we need more time to decide, I will change #21313 so that it temporarily keeps the actual behavior

jorisvandenbossche · 2018-06-29T12:52:21Z

Given the discussion above, I am fine with going forward and handling dict here not as a special case but just as any other iterable.

close pandas-dev#21294 close pandas-dev#21428

) * BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes close #21294 close #21428

…das-dev#21313) * BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes close pandas-dev#21294 close pandas-dev#21428

toobaz added Indexing Related to indexing on series/frames, not to indexes themselves Effort Low labels Jun 2, 2018

toobaz added a commit to toobaz/pandas that referenced this issue Jun 4, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes

8b34417

close pandas-dev#21294

toobaz mentioned this issue Jun 4, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes #21313

Merged

5 tasks

toobaz added a commit to toobaz/pandas that referenced this issue Jun 11, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes

52c5267

close pandas-dev#21294 close pandas-dev#21428

jreback added this to the 0.24.0 milestone Jun 12, 2018

toobaz added a commit to toobaz/pandas that referenced this issue Jun 12, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes

43327c2

close pandas-dev#21294 close pandas-dev#21428

toobaz added a commit to toobaz/pandas that referenced this issue Jun 30, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes

0e06ffd

close pandas-dev#21294 close pandas-dev#21428

toobaz added a commit to toobaz/pandas that referenced this issue Jul 3, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes

b0f1f7d

close pandas-dev#21294 close pandas-dev#21428

toobaz added a commit to toobaz/pandas that referenced this issue Jul 4, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes

b28afdb

close pandas-dev#21294 close pandas-dev#21428

toobaz added a commit to toobaz/pandas that referenced this issue Jul 7, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes

2e63f5b

close pandas-dev#21294 close pandas-dev#21428

jreback closed this as completed in #21313 Jul 7, 2018

jreback pushed a commit that referenced this issue Jul 7, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes (#21313

32ee973

) * BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes close #21294 close #21428

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes (pan…

70d8be2

…das-dev#21313) * BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes close pandas-dev#21294 close pandas-dev#21428

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dict/dict keys in DataFrame.getitem #21294

Dict/dict keys in DataFrame.getitem #21294

toobaz commented Jun 2, 2018

INSTALLED VERSIONS

jorisvandenbossche commented Jun 12, 2018

toobaz commented Jun 12, 2018

jorisvandenbossche commented Jun 12, 2018

toobaz commented Jun 12, 2018 •

edited

Loading

toobaz commented Jun 12, 2018

toobaz commented Jun 13, 2018 •

edited

Loading

jorisvandenbossche commented Jun 13, 2018

jorisvandenbossche commented Jun 13, 2018

toobaz commented Jun 13, 2018

toobaz commented Jun 20, 2018

jorisvandenbossche commented Jun 29, 2018

Dict/dict keys in DataFrame.__getitem__ #21294

Dict/dict keys in DataFrame.__getitem__ #21294

Comments

toobaz commented Jun 2, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Jun 12, 2018

toobaz commented Jun 12, 2018

jorisvandenbossche commented Jun 12, 2018

toobaz commented Jun 12, 2018 • edited Loading

toobaz commented Jun 12, 2018

toobaz commented Jun 13, 2018 • edited Loading

jorisvandenbossche commented Jun 13, 2018

jorisvandenbossche commented Jun 13, 2018

toobaz commented Jun 13, 2018

toobaz commented Jun 20, 2018

jorisvandenbossche commented Jun 29, 2018

Dict/dict keys in DataFrame.getitem #21294

Dict/dict keys in DataFrame.getitem #21294

Output of `pd.show_versions()`

toobaz commented Jun 12, 2018 •

edited

Loading

toobaz commented Jun 13, 2018 •

edited

Loading