Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.loc on DataFrame returning coerced dtype for single rows #11617

Closed
samueljohn opened this issue Nov 16, 2015 · 10 comments
Closed

.loc on DataFrame returning coerced dtype for single rows #11617

samueljohn opened this issue Nov 16, 2015 · 10 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@samueljohn
Copy link

samueljohn commented Nov 16, 2015

xref #14205

The .loc method of DataFrame with different dtypes yields coerced type even if the resulting slice does only contain elements from one type. This happens only when selecting a single row.
I can guess that this might be intended because the implementation of loc seems to first lookup the row as a single Series, doing the coercion and then applying the second (column) indexer.

However, when the column indexer narrows down the selection such that the upcasting would not have been necessary in the first place, it can be very surprising and may even cause bugs (on user-side) if it goes unnoticed. (Like, "I was sure that those column was int64").

>>> import pandas as pd

>>> d = pd.DataFrame(dict(a=[1.23]))
>>> d["b"] = 666  # adding column with int

>>> d.info()  # info as expected (column b is int64 - fine)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 2 columns):
a    1 non-null float64
b    1 non-null int64
dtypes: float64(1), int64(1)
memory usage: 24.0 bytes

>>> d.loc[0,"b"]  # UNEXPECTED: returning a single float
666.0

>>> d.ix[0, "b"]  # OK: returns a single int
666

>>> d.loc[[0], "b"]  # OK
0    666
Name: b, dtype: int64

Feel free to close if the behavior s intended. Maybe this this a "bug" or an suggested API change. I dunno.

Perhaps related to #10503, #9519, #9269, #11594 ?

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Darwin
OS-release: 15.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

pandas: 0.17.0
[...]
@jreback
Copy link
Contributor

jreback commented Nov 16, 2015

hmm, that is a bit inconsistent. I would expect all of these to give the same result (and not coerce), adding .iloc here and not change the dtype.

In [23]: d
Out[23]: 
      a    b
0  1.23  666

In [24]: d.dtypes
Out[24]: 
a    float64
b      int64
dtype: object

In [25]: d.ix[0, "b"]
Out[25]: 666

In [26]: d.loc[0, "b"]
Out[26]: 666.0

In [28]: d.iloc[0,1]
Out[28]: 666.0

if you'd like to dig in would be great!

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Nov 16, 2015
@jreback jreback added this to the Next Major Release milestone Nov 16, 2015
@samueljohn samueljohn changed the title .loc on DataFrame returning upcasted dtype for single rows .loc on DataFrame returning coerced dtype for single rows Nov 17, 2015
@samueljohn
Copy link
Author

Wow ... I have been deep down in the 5k LOC internals.py... I don't think I wanna go there again :-)
I somehow assumed pandas was something "lightweight" on top of numpy.

So, indeed a creation of a Series seems involved.

In the following, I used the latest release for tracing but I do point into the master codebase. Perhaps If you have installed pandas master you could try if this still applies (I think yes).

I have traced it so far as first a Series is created for the first key in the tuple (0,"b"). The call to d.loc.obj._xs(0, axis=0) calls d.loc.obj._data.fast_xs(0) here:
https://github.com/pydata/pandas/blob/master/pandas/core/generic.py#L1498-L1500

In the creation of the Series, the blocks are still correct:
(FloatBlock: slice(0, 1, 1), 1 x 1, dtype: float64, IntBlock: slice(1, 2, 1), 1 x 1, dtype: int64)

But then in the dtype is determined by dtype = _interleaved_dtype(self.blocks) (https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L3170) and returns float64 which makes sense from a number theoretical POV. That method is also in internals:
https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L4114

I think this is how pandas Series are defined (they must contain just one type).

But the question is if the creation of the series should perhaps better be done after the second key (in this example the column "b") is evaluated. Because then the dtype would not need to be a float64 at all.

Not sure if this is still Effort Low.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2015

@samueljohn haha, indexing is pretty complex!

We don't distinguish between all scalar keys upfront, hence the serial conversions. Easiest thing to do is try changing and see if your tests for this behavior (and original tests pass). That is the part about indexing, preserving the API when making changes.

@mao-liu
Copy link

mao-liu commented May 4, 2016

Hi @jreback , @samueljohn.

I also encountered this problem today. After a little digging around, the following may help:

Firstly, the dataframe behaves correctly if there is a non-numeric object in the dataframe:

>>> df = pd.DataFrame({'x': [1,2,3], 'y': [1.0, 2.0, 3.0]}, columns=['x', 'y'])
>>> df.loc[0]
x    1.0
y    1.0
Name: 0, dtype: float64
>>> [type(v) for v in df.loc[0]]
[numpy.float64, numpy.float64]

>>> df['z'] = 'foo'
>>> df.loc[0]
x    1  
y    1  
z    foo
Name: 0, dtype: object
>>> [type(v) forv in df.loc[0]]
[numpy.int64, numpy.float64, str]

Secondly, this may be fixed by simply changing how the Series constructor is called:

>>> s = pd.Series([np.int64(1), np.float64(1.0)])
>>> print s
0    1.0
1    1.0
dtype: float64
>>> [type(v) for v in s]
[numpy.float64, numpy.float64]

>>> s = pd.Series([np.int64(1), np.float64(1.0)], dtype='object')
>>> print s
0    1
1    1
dtype: object
>>> [type(v) for v in s]
[numpy.int64, numpy.float64]

Any thoughts on possible performance hits if df.loc[...] always returns a series with dtype='object'?

Cheers,

@jreback
Copy link
Contributor

jreback commented May 4, 2016

returning as object is only appropriate if it actually includes things that are not representable as baser types. right now we coerce ints to floats if needed, this is pretty standard practice as it leads to much more efficiency.

@mao-liu
Copy link

mao-liu commented May 4, 2016

I propose a fix in _interleaved_dtype(blocks).

I think there are use cases for both scenarios:

  • always coerce numeric dtypes into a dtype that supports all dtypes in the block, for calculation-heavy applications which don't care too much about preserving numerical dtypes
  • always preserve numerical dtypes, using dtype('object') where different numerical types are present. For applications where preserving the dtypes of a data frame is important.

Maybe it could be added as a pandas option, perhaps 'mode.coerce_numerical_dtypes'?

@jreback
Copy link
Contributor

jreback commented May 4, 2016

Try to address the specific change of verifying that all of the cases above return the same dtype. Doing something more complicated like returning an object dtype is prob ok, but only in very certain circumnstances.

Doing what you are suggesting above is not going to back-compat and likely break lots of things. Start small.

Adding an option is also a non-starter.

@mao-liu
Copy link

mao-liu commented May 4, 2016

I think I will submit a separate issue. I currently require a way of retrieving a row of a DataFrame that preserves numerical dtypes, which is separate to this issue, but very related.

@jreback
Copy link
Contributor

jreback commented May 4, 2016

I suppose a:

df.loc(coerce=False)[0] might be ok (with a default of True) for back compat.

@mao-liu
Copy link

mao-liu commented May 4, 2016

That could work, but would require more thought into how it plugs in to other data frame methods (e.g. df.apply(..., axis=1, coerce=True))

I will have a go at it, but it might take me some time.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
handle iterator
handle NamedTuple
.loc retuns scalar selection dtypes correctly, closes pandas-dev#11617

xref pandas-dev#15113

Author: Jeff Reback <jeff@reback.net>

Closes pandas-dev#15120 from jreback/indexing and squashes the following commits:

801c8d9 [Jeff Reback] BUG: indexing changes to .loc for compat to .ix for several situations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants