Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dict/dict keys in DataFrame.__getitem__ #21294

Closed
toobaz opened this issue Jun 2, 2018 · 11 comments · Fixed by #21313
Closed

Dict/dict keys in DataFrame.__getitem__ #21294

toobaz opened this issue Jun 2, 2018 · 11 comments · Fixed by #21313
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Jun 2, 2018

Code Sample, a copy-pastable example if possible

In [2]: pd.Series(index=range(10))[{1:2,3:4}]
Out[2]: 
1   NaN
3   NaN
dtype: float64

In [3]: pd.Series(index=range(10))[{1:2,3:4}.keys()]
Out[3]: 
1   NaN
3   NaN
dtype: float64

In [4]: pd.DataFrame(index=range(10), columns=range(10))[{1:2,3:4}]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-5b29d5d42cf2> in <module>()
----> 1 pd.DataFrame(index=range(10), columns=range(10))[{1:2,3:4}]

~/nobackup/repo/pandas/pandas/core/frame.py in __getitem__(self, key)
   2683             return self._getitem_multilevel(key)
   2684         else:
-> 2685             return self._getitem_column(key)
   2686 
   2687     def _getitem_column(self, key):

~/nobackup/repo/pandas/pandas/core/frame.py in _getitem_column(self, key)
   2690         # get column
   2691         if self.columns.is_unique:
-> 2692             return self._get_item_cache(key)
   2693 
   2694         # duplicate columns & possible reduce dimensionality

~/nobackup/repo/pandas/pandas/core/generic.py in _get_item_cache(self, item)
   2482         """Return the cached item, item represents a label indexer."""
   2483         cache = self._item_cache
-> 2484         res = cache.get(item)
   2485         if res is None:
   2486             values = self._data.get(item)

TypeError: unhashable type: 'dict'

In [5]: pd.DataFrame(index=range(10), columns=range(10))[{1:2,3:4}.keys()]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-030f516d9637> in <module>()
----> 1 pd.DataFrame(index=range(10), columns=range(10))[{1:2,3:4}.keys()]

~/nobackup/repo/pandas/pandas/core/frame.py in __getitem__(self, key)
   2683             return self._getitem_multilevel(key)
   2684         else:
-> 2685             return self._getitem_column(key)
   2686 
   2687     def _getitem_column(self, key):

~/nobackup/repo/pandas/pandas/core/frame.py in _getitem_column(self, key)
   2690         # get column
   2691         if self.columns.is_unique:
-> 2692             return self._get_item_cache(key)
   2693 
   2694         # duplicate columns & possible reduce dimensionality

~/nobackup/repo/pandas/pandas/core/generic.py in _get_item_cache(self, item)
   2482         """Return the cached item, item represents a label indexer."""
   2483         cache = self._item_cache
-> 2484         res = cache.get(item)
   2485         if res is None:
   2486             values = self._data.get(item)

TypeError: unhashable type: 'dict_keys'

Problem description

I know that DataFrame.__getitem__ is a mess ( #9595 ), but I don't see why dicts and dict keys shouldn't be just considered list-likes as it happens with Series.

Expected Output

In [6]: pd.DataFrame(index=range(10), columns=range(10))[list({1:2,3:4})]
Out[6]: 
     1    3
0  NaN  NaN
1  NaN  NaN
2  NaN  NaN
3  NaN  NaN
4  NaN  NaN
5  NaN  NaN
6  NaN  NaN
7  NaN  NaN
8  NaN  NaN
9  NaN  NaN

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.24.0.dev0+25.gcd0447102
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.25.2
numpy: 1.14.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.2.2.post1153+gff6786446
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@toobaz toobaz added Indexing Related to indexing on series/frames, not to indexes themselves Effort Low labels Jun 2, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Jun 4, 2018
@jreback jreback added this to the 0.24.0 milestone Jun 12, 2018
@jorisvandenbossche
Copy link
Member

I don't see why dicts and dict keys shouldn't be just considered list-likes as it happens with Series.

Note that in many place we see Series and dicts as equivalents (in many method parameters that expect a mapping, eg fillna, astype, ..), but in those cases they are interpreted the same (I mean passing dict or pd.Series(dict) will do the same).
In this case they are not the same, as the Series' values are used, but the dict's keys.

Personally I don't really see a good reason to allow this (and for allowing this complexity). Do you have a practical use case?
People who want this can always explicitly do dict.keys() or list(dict) to be explicit about using the keys here.

(if we want to fix the inconsistency, I would rather disable using dicts in other places in indexing)

@toobaz
Copy link
Member Author

toobaz commented Jun 12, 2018

Personally I don't really see a good reason to allow this (and for allowing this complexity). Do you have a practical use case?

Well, yes, but...

People who want this can always explicitly do dict.keys() or list(dict) to be explicit about using the keys here.

... sure

(if we want to fix the inconsistency, I would rather disable using dicts in other places in indexing)

I wouldn't have major objections if we were doing it from scratch. But:

  • there is no real "complexity" issue. The diff is large just because I refactored code. Deciding whether to consider dicts and dict keys as list-like or not is a matter of two trivial lines of code inside is_list_like
  • we should try to be as coherent as possible on what is a "list-like" for pandas, and have is_list_like reflect that. So implementation wise, your suggestion basically reflects to "change this PR just by editing is_list_like".
  • This would disable dicts and dict keys as list-likes everywhere. It's my turn to ask: what is a good reason to do so? Python users know that list({1:2}) results in [1], so it's not so mysterious that a list can be replaced with a dict and that this means looking at its values. (What might be mysterious is that a list can't be replaced with a str, but they learn this soon)
  • My main issue is then that of backward compatibility.

I think it's way simpler to tell users that every iterable is a possible list-like, with the exception of strings (because they are scalar) and tuples (because they are MultiIndex keys), and have this rule apply to all the API (with the obvious exception of places where we expect a mapping, as in the initialization of a Series).

@jorisvandenbossche
Copy link
Member

there is no real "complexity" issue.

I understand that the actual fix might be two lines of code in is_list_like (and that it might actually gives simpler code to just coerce it to list than to explicitly disallow it), but with "added complexity" I also mean: 1) complexity for the user: the more possibilities we provide, the more complex it becomes (if you code written that indexes with a dict, then you need to know whether it uses the values or the keys to understand the code) and 2) maintenance overhead: regardless of the tiny code change, this gives rise to extra tests, extra questions if we should do it in other places as well to keep consistency, extra usage questions on the issue tracker, ...

so it's not so mysterious that a list can be replaced with a dict and that this means looking at its values.

you mean its "keys" .. :-)

I think it's way simpler to tell users that every iterable is a possible list-like ... and have this rule apply to all the API (with the obvious exception of places where we expect a mapping ..)

To get more concrete, what would be examples of usage cases that get changed by taking on this rule? (as I completely agree consistent rules how things are interpreted are good).

For example, to_numeric accepts list-likes (list, array, series (which is seen here as list-like and not mapping), ... but does it also need to accept dicts? (btw, it also does not accept a generator, while Series does)

@toobaz
Copy link
Member Author

toobaz commented Jun 12, 2018

you mean its "keys" .. :-)

:-)

To get more concrete, what would be examples of usage cases that get changed by taking on this rule? (as I completely agree consistent rules how things are interpreted are good).

Hard to answer, because they are not particularly related to pandas itself, but to the code surrounding pandas calls.

In my case, I was working with a df of which I wanted to select columns and change their appearance. So I had a dict translation = {old_name1 : new_name1, old_name2 : new_name2 ...} and wanted to do my_df[translation].rename(translation, axis=1).

Again, I could certainly use my_df[list(translation)].rename(translation, axis=1), it's just that I had got used to the Series behavior.

For example, to_numeric accepts list-likes (list, array, series (which is seen here as list-like and not mapping), ... but does it also need to accept dicts? (btw, it also does not accept a generator, while Series does)

... and it accepts tuples. I do see this as a discrepancy, but a minor one, compared to consistency in the same type of indexer across different objects.

@toobaz
Copy link
Member Author

toobaz commented Jun 12, 2018

I also think - but I will be happy to be proven wrong - that in Python libraries accepting any iterable wherever lists are expected is the norm.

@toobaz
Copy link
Member Author

toobaz commented Jun 13, 2018

complexity for the user: the more possibilities we provide, the more complex it becomes (if you code written that indexes with a dict, then you need to know whether it uses the values or the keys to understand the code)

Just thinking out loud: this does not apply to e.g. dict.keys() or dict.values() (which currently exhibit the same behavior). So another solution would be to exclude mappings in general, but accept these (implemented as "things that have __iter__ but do not have keys()").

(I don't necessarily mean this is a better solution - my main worry, of breaking existing code, stays mostly unchanged)

@jorisvandenbossche
Copy link
Member

I also think - but I will be happy to be proven wrong - that in Python libraries accepting any iterable wherever lists are expected is the norm.

Yep, that is true and a good point. Eg looking at the "built-in functions", they use consistent terminology of "iterable" in some of the functions, and they all accept a dict, eg:

In [50]: d = {1: 2, 2: 3}

In [51]: sum(d)
Out[51]: 3

@jorisvandenbossche
Copy link
Member

I see the point of "allowing dicts actually creates less special cases". Which is a good one (only of course that we already make an exception on the rule for tuples in certain cases, as you mentioned)

It's only that I would personally never want to see such usage in code I review or maintain. I think I would always ask to change to be more explicit on it :-)

@toobaz
Copy link
Member Author

toobaz commented Jun 13, 2018

we already make an exception on the rule for tuples in certain cases

By the way: I think over time it should become "in all cases" (in which we declare to accept "list-likes", which might not necessarily be all cases in which we accept lists)

Another thing: the obvious alternative to "accept anything which can be casted to a list and is not a tuple or string" is "accept only lists". But in principle there can be good reasons to pass generators to save RAM (e.g. indexing a RangeIndex with a range in Python 3 could already avoids casting to list).

I see the point of "allowing dicts actually creates less special cases".

So would you say #21313 is OK?

@toobaz
Copy link
Member Author

toobaz commented Jun 20, 2018

@jorisvandenbossche if you think we need more time to decide, I will change #21313 so that it temporarily keeps the actual behavior

@jorisvandenbossche
Copy link
Member

Given the discussion above, I am fine with going forward and handling dict here not as a special case but just as any other iterable.

jreback pushed a commit that referenced this issue Jul 7, 2018
)

* BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes

close #21294
close #21428
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018
…das-dev#21313)

* BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes

close pandas-dev#21294
close pandas-dev#21428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants