.loc on DataFrame returning coerced dtype for single rows #11617

samueljohn · 2015-11-16T16:36:29Z

The .loc method of DataFrame with different dtypes yields coerced type even if the resulting slice does only contain elements from one type. This happens only when selecting a single row.
I can guess that this might be intended because the implementation of loc seems to first lookup the row as a single Series, doing the coercion and then applying the second (column) indexer.

However, when the column indexer narrows down the selection such that the upcasting would not have been necessary in the first place, it can be very surprising and may even cause bugs (on user-side) if it goes unnoticed. (Like, "I was sure that those column was int64").

>>> import pandas as pd

>>> d = pd.DataFrame(dict(a=[1.23]))
>>> d["b"] = 666  # adding column with int

>>> d.info()  # info as expected (column b is int64 - fine)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 2 columns):
a    1 non-null float64
b    1 non-null int64
dtypes: float64(1), int64(1)
memory usage: 24.0 bytes

>>> d.loc[0,"b"]  # UNEXPECTED: returning a single float
666.0

>>> d.ix[0, "b"]  # OK: returns a single int
666

>>> d.loc[[0], "b"]  # OK
0    666
Name: b, dtype: int64

Feel free to close if the behavior s intended. Maybe this this a "bug" or an suggested API change. I dunno.

Perhaps related to #10503, #9519, #9269, #11594 ?

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Darwin
OS-release: 15.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

pandas: 0.17.0
[...]

The text was updated successfully, but these errors were encountered:

jreback · 2015-11-16T20:48:43Z

hmm, that is a bit inconsistent. I would expect all of these to give the same result (and not coerce), adding .iloc here and not change the dtype.

In [23]: d
Out[23]: 
      a    b
0  1.23  666

In [24]: d.dtypes
Out[24]: 
a    float64
b      int64
dtype: object

In [25]: d.ix[0, "b"]
Out[25]: 666

In [26]: d.loc[0, "b"]
Out[26]: 666.0

In [28]: d.iloc[0,1]
Out[28]: 666.0

if you'd like to dig in would be great!

samueljohn · 2015-11-17T11:30:51Z

Wow ... I have been deep down in the 5k LOC internals.py... I don't think I wanna go there again :-)
I somehow assumed pandas was something "lightweight" on top of numpy.

So, indeed a creation of a Series seems involved.

In the following, I used the latest release for tracing but I do point into the master codebase. Perhaps If you have installed pandas master you could try if this still applies (I think yes).

I have traced it so far as first a Series is created for the first key in the tuple (0,"b"). The call to d.loc.obj._xs(0, axis=0) calls d.loc.obj._data.fast_xs(0) here:
https://github.com/pydata/pandas/blob/master/pandas/core/generic.py#L1498-L1500

In the creation of the Series, the blocks are still correct:
(FloatBlock: slice(0, 1, 1), 1 x 1, dtype: float64, IntBlock: slice(1, 2, 1), 1 x 1, dtype: int64)

But then in the dtype is determined by dtype = _interleaved_dtype(self.blocks) (https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L3170) and returns float64 which makes sense from a number theoretical POV. That method is also in internals:
https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L4114

I think this is how pandas Series are defined (they must contain just one type).

But the question is if the creation of the series should perhaps better be done after the second key (in this example the column "b") is evaluated. Because then the dtype would not need to be a float64 at all.

Not sure if this is still Effort Low.

jreback · 2015-11-17T11:41:36Z

@samueljohn haha, indexing is pretty complex!

We don't distinguish between all scalar keys upfront, hence the serial conversions. Easiest thing to do is try changing and see if your tests for this behavior (and original tests pass). That is the part about indexing, preserving the API when making changes.

mao-liu · 2016-05-04T07:41:31Z

Hi @jreback , @samueljohn.

I also encountered this problem today. After a little digging around, the following may help:

Firstly, the dataframe behaves correctly if there is a non-numeric object in the dataframe:

>>> df = pd.DataFrame({'x': [1,2,3], 'y': [1.0, 2.0, 3.0]}, columns=['x', 'y'])
>>> df.loc[0]
x    1.0
y    1.0
Name: 0, dtype: float64
>>> [type(v) for v in df.loc[0]]
[numpy.float64, numpy.float64]

>>> df['z'] = 'foo'
>>> df.loc[0]
x    1  
y    1  
z    foo
Name: 0, dtype: object
>>> [type(v) forv in df.loc[0]]
[numpy.int64, numpy.float64, str]

Secondly, this may be fixed by simply changing how the Series constructor is called:

>>> s = pd.Series([np.int64(1), np.float64(1.0)])
>>> print s
0    1.0
1    1.0
dtype: float64
>>> [type(v) for v in s]
[numpy.float64, numpy.float64]

>>> s = pd.Series([np.int64(1), np.float64(1.0)], dtype='object')
>>> print s
0    1
1    1
dtype: object
>>> [type(v) for v in s]
[numpy.int64, numpy.float64]

Any thoughts on possible performance hits if df.loc[...] always returns a series with dtype='object'?

Cheers,

jreback · 2016-05-04T12:13:55Z

returning as object is only appropriate if it actually includes things that are not representable as baser types. right now we coerce ints to floats if needed, this is pretty standard practice as it leads to much more efficiency.

mao-liu · 2016-05-04T22:30:09Z

I propose a fix in _interleaved_dtype(blocks).

I think there are use cases for both scenarios:

always coerce numeric dtypes into a dtype that supports all dtypes in the block, for calculation-heavy applications which don't care too much about preserving numerical dtypes
always preserve numerical dtypes, using dtype('object') where different numerical types are present. For applications where preserving the dtypes of a data frame is important.

Maybe it could be added as a pandas option, perhaps 'mode.coerce_numerical_dtypes'?

jreback · 2016-05-04T22:43:18Z

Try to address the specific change of verifying that all of the cases above return the same dtype. Doing something more complicated like returning an object dtype is prob ok, but only in very certain circumnstances.

Doing what you are suggesting above is not going to back-compat and likely break lots of things. Start small.

Adding an option is also a non-starter.

mao-liu · 2016-05-04T22:50:18Z

I think I will submit a separate issue. I currently require a way of retrieving a row of a DataFrame that preserves numerical dtypes, which is separate to this issue, but very related.

jreback · 2016-05-04T22:52:03Z

I suppose a:

df.loc(coerce=False)[0] might be ok (with a default of True) for back compat.

mao-liu · 2016-05-04T22:58:39Z

That could work, but would require more thought into how it plugs in to other data frame methods (e.g. df.apply(..., axis=1, coerce=True))

I will have a go at it, but it might take me some time.

handle iterator handle NamedTuple .loc retuns scalar selection dtypes correctly, closes pandas-dev#11617 xref pandas-dev#15113

handle iterator handle NamedTuple .loc retuns scalar selection dtypes correctly, closes pandas-dev#11617 xref pandas-dev#15113 Author: Jeff Reback <jeff@reback.net> Closes pandas-dev#15120 from jreback/indexing and squashes the following commits: 801c8d9 [Jeff Reback] BUG: indexing changes to .loc for compat to .ix for several situations

jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Nov 16, 2015

jreback added this to the Next Major Release milestone Nov 16, 2015

samueljohn changed the title ~~.loc on DataFrame returning upcasted dtype for single rows~~ .loc on DataFrame returning coerced dtype for single rows Nov 17, 2015

jreback added Effort Medium and removed Effort Low labels Nov 17, 2015

jreback mentioned this issue Jan 6, 2016

Inconsistant datatype with type()? #11969

Closed

jreback mentioned this issue Sep 12, 2016

Datatype of Integer changes depending on indexing method on a dataframe with an integer and a float column #14205

Closed

jreback mentioned this issue Oct 3, 2016

Wrong dtype when mixed dtype DataFrame is accessed with complete indexer #14337

Closed

sinhrks mentioned this issue Oct 6, 2016

Selecting an element or row of mixed int/float DataFrame returns all floats #14361

Closed

jreback added a commit to jreback/pandas that referenced this issue Jan 12, 2017

BUG: indexing changes to .loc for compat to .ix for several situations

f42e960

handle iterator handle NamedTuple .loc retuns scalar selection dtypes correctly, closes pandas-dev#11617 xref pandas-dev#15113

jreback modified the milestones: 0.20.0, Next Major Release Jan 12, 2017

jreback mentioned this issue Jan 12, 2017

BUG: indexing changes to .loc for compat to .ix for several situations #15120

Closed

jreback added a commit to jreback/pandas that referenced this issue Jan 12, 2017

BUG: indexing changes to .loc for compat to .ix for several situations

801c8d9

handle iterator handle NamedTuple .loc retuns scalar selection dtypes correctly, closes pandas-dev#11617 xref pandas-dev#15113

jreback closed this as completed in 82ab26a Jan 12, 2017

jorisvandenbossche mentioned this issue Jan 25, 2017

.loc indexing of heterogeneous dataframe returns different dtype #15220

Closed

jreback mentioned this issue Apr 12, 2017

Spurious casting to float when assigning both integers and floats #15231

Open

mcvicuna mentioned this issue Jun 18, 2018

dtypes being ignored with certain versions of pandas LAL/trackml-library#16

Open

designMoreWeb mentioned this issue Feb 26, 2019

BUG: indexing with loc and iloc with list-likes and new dtypes do not change from object dtype #20635

Open

anirudnits mentioned this issue Jun 6, 2020

Added test test_datetimeField_after_setitem for issue #6942 #28790

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.loc on DataFrame returning coerced dtype for single rows #11617

.loc on DataFrame returning coerced dtype for single rows #11617

samueljohn commented Nov 16, 2015 •

edited by jreback

Loading

jreback commented Nov 16, 2015

samueljohn commented Nov 17, 2015

jreback commented Nov 17, 2015

mao-liu commented May 4, 2016

jreback commented May 4, 2016

mao-liu commented May 4, 2016

jreback commented May 4, 2016

mao-liu commented May 4, 2016

jreback commented May 4, 2016

mao-liu commented May 4, 2016

.loc on DataFrame returning coerced dtype for single rows #11617

.loc on DataFrame returning coerced dtype for single rows #11617

Comments

samueljohn commented Nov 16, 2015 • edited by jreback Loading

jreback commented Nov 16, 2015

samueljohn commented Nov 17, 2015

jreback commented Nov 17, 2015

mao-liu commented May 4, 2016

jreback commented May 4, 2016

mao-liu commented May 4, 2016

jreback commented May 4, 2016

mao-liu commented May 4, 2016

jreback commented May 4, 2016

mao-liu commented May 4, 2016

samueljohn commented Nov 16, 2015 •

edited by jreback

Loading