BUG: read_csv returns object dtype for dates in empty frame #15524

adbull · 2017-02-27T21:48:25Z

Code Sample, a copy-pastable example if possible

>>> import io
>>> import pandas as pd

>>> csv = lambda: io.StringIO('date')
>>> df = pd.read_csv(csv(), dtype=int)
>>> df.dtypes

date    int64
dtype: object

>>> df = pd.read_csv(csv(), parse_dates=['date'])
>>> df.dtypes

date      object
dtype: object

>>> df.date.dt.tz_localize('UTC')

Traceback (most recent call last):
  File "<ipython-input-49-067c7ce27232>", line 1, in <module>
    df.date = df.date.dt.tz_localize('UTC')
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2740, in __getattr__
    return object.__getattribute__(self, name)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/base.py", line 241, in __get__
    return self.construct_accessor(instance)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/series.py", line 2743, in _make_dt_accessor
    raise AttributeError("Can only use .dt accessor with datetimelike "
AttributeError: Can only use .dt accessor with datetimelike values

>>> parser = lambda x: pd.to_datetime(x, utc=True)
>>> df = pd.read_csv(csv(), parse_dates=['date'], date_parser=parser)
>>> df.dtypes

date      object
dtype: object

>>> df.date.dt.tz_convert('EST')

Traceback (most recent call last):
  File "<ipython-input-129-3d458af8513c>", line 1, in <module>
    df.date.dt.tz_convert('EST')
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2740, in __getattr__
    return object.__getattribute__(self, name)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/base.py", line 241, in __get__
    return self.construct_accessor(instance)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/series.py", line 2743, in _make_dt_accessor
    raise AttributeError("Can only use .dt accessor with datetimelike "
AttributeError: Can only use .dt accessor with datetimelike values

Problem description

When reading CSVs with no data rows, read_csv() returns the dtype object for dates, which can raise errors on later manipulation. This is contrary to the general behaviour of read_csv(), which otherwise correctly sets dtypes for empty frames when those dtypes are explicitly passed.

I don't think it would be hard to return the correct dtype here? If date_parser is not set, we know the dtype is datetime64[ns]; otherwise, we can call the parser with empty data, and use the returned dtype.

Note that e.g. read_csv(..., dtype='datetime64[ns]') is not a solution, as this throws an error when the csv is non-empty.

Expected Output

date    int64
dtype: object

date      datetime64[ns]
dtype: object

date      datetime64[ns, UTC]
dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.9.1
IPython: 4.2.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8.1
boto: 2.45.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-27T22:54:59Z

datetime parsing is pretty soft, meaning unless it can unequivocably convert something it won't do it. You can certainly force this after the fact via .to_datetime(...). The default for non-converting something (to any) dtype is simply object. This is the safest way to do things.

I am not sure that I would coerce these empty columns like this (even though we certainly can, at least for non-tz aware, which won't work in your example at all. The dtype is not-defined). This can easily lead to mistakes, not to mention that parse_dates=True will try to soft coerce lots of things (but I suppose that could ignore a forced conversion).

thoughts

@jorisvandenbossche
cc @gfyoung

jreback · 2017-02-27T22:55:43Z

@adbull can you show a situation when this actually matters / makes a difference? why are you trying to do this? For example concatting an empty frame with one that is correctly dtypes will work, so this doesn't practically make a difference.

adbull · 2017-02-27T23:22:54Z

Well, if I call read_csv with a specific column in parse_dates, I definitely want the column to be a datetime, and since the column is empty, it will unequivocally succeed in that conversion -- it's not like it can contain a parse error.

I'm not sure I follow you on the second paragraph? What won't work in my example?

Concatting an empty frame of dtype object to a non-empty one of dtype datetime64[ns, UTC] throws an error, is that not intended?

Anyway, potential use-cases are (a) concatting together multiple csvs, some of which may be empty; (b) regularly loading a csv which is frequently updated, and may or may not be empty. In (a), I would have to assign timezones after concat to stop it throwing an error; in (b), I would have to specifically check whether the frame is empty before attempting any timezone manipulation.

The workarounds aren't too hard, it just seems a bit awkward when the simple version would work fine for any other dtype.

gfyoung · 2017-02-28T02:23:26Z

@jreback : I'm torn on this. On the one hand, I agree with you that there is not a lot of value (from a functionality perspective) in supporting this behavior BUT that being said, @adbull does have a point about consistency. I turn your attention to some other inconsistencies related to this:

>>> Series([]).astype(np.datetime64)
Series([], dtype: datetime64[ns])
>>>
>>> Series([], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]

gfyoung · 2017-04-03T21:00:35Z

@jreback : Thoughts on what I said above?

jreback · 2017-04-03T21:07:34Z

Series([]).astype(np.datetime64)
Series([], dtype: datetime64[ns])

Series([], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]

is like what we are discussing in #15859

I agree that should work. As an aside, I think we need to systematically test all .astype vs direct construction on empties. (can create another issue if you want). mainly want the test (so if you can add that would be great, just xfailing those things that don't work).

gfyoung · 2017-04-03T21:57:00Z

Okay, sounds good. I suspect that what I presented here is probably related to #15859 initially and will trickle down to patching this issue here.

gfyoung · 2017-04-10T17:55:35Z

@jreback : So the Exception raised traces to maybe_cast_to_datetime, which specifically checks if the dtype specified exactly matches datetime64[ns], which is NOT the same as np.datetime64 (no frequency in the latter dtype).

Thus, we would need to soften the restrictions on casting to datetime64 (similarly for timedelta64). This is an API change. Thoughts?

gfyoung · 2017-04-10T22:06:06Z

Also tracked down the cause of the original bug. _get_empty_meta, which is what is called when there is no data to parse has no handling for parse_dates, but it does for dtype.

Unfortunately, patching this is not so straightforward because there is a lot of logical surrounding what columns get parsed depending on how you specify parse_dates, which feels very duplicative since we have it already for the C and Python engine but in two different implementations (one in Cython and the other in pure Python).

So yes, we can easily patch by replicating the parse_dates logic, but that does not seem optimal...

@jreback , @jorisvandenbossche : Thoughts?

jreback · 2017-04-11T09:53:24Z

@gfyoung I think the date parsing needs to be factored out generically applicable to both engines :> its a project but will make this much more generic (and not sacrifice any perf). Then this is easy.

gfyoung · 2017-04-11T12:55:47Z

@jreback : Any thoughts about my questions above regarding Series ?

jorisvandenbossche · 2017-04-11T14:29:53Z

@gfyoung so the question is if we generally interpret np.datetime64 (if passed by the user) as 'datetime64[ns]', since 'ns' is the default resolution?
And now how this is handled differs between different functions?

gfyoung · 2017-04-11T14:45:06Z

@jorisvandenbossche : I think the inconsistency is pretty clearly illustrated in my above comment here. It should be one way or the other (raise or interpret as datetime64[ns]).

jreback · 2017-04-11T21:42:49Z

so we transform np.datetime64 -> np.datetime64[ns] (well we actually interpret it according to whatever freq it actually is). In this case, this just says hey make it the default datetype, so this would be totally fine to do.

Series([], dtype=np.datetime64), IOW I would be fine accepting this. Note that the logic is in pandas.types.cast.maybe_cast_to_datetime.

though you have to see what this does break. If its just specific validation tests, then can change those.

jorisvandenbossche · 2017-04-12T07:30:25Z

I would actually also be fine with following numpy here, which means raising on dtype=np.datetime64. I don't see a strong reason to allow this.

gfyoung · 2017-04-12T08:38:52Z

@jorisvandenbossche : What do you mean?

>>> np.array([], dtype=np.datetime64)
array([], dtype=datetime64)

jreback · 2017-04-12T09:55:00Z

@gfyoung the bug here is this, right:

In [2]: pd.read_csv(StringIO('date'), parse_dates=['date']).dtypes
Out[2]: 
date    object
dtype: object

IOW we have an empty date column, which should be coerced to datetime64[ns]

This is similar to this behavior. where does np.datetime64 enter this?

In [4]: pd.to_datetime([])
Out[4]: DatetimeIndex([], dtype='datetime64[ns]', freq=None)

jorisvandenbossche · 2017-04-12T09:56:23Z

Sorry, I assumed it was the same as on non-empty array. And in pandas non-empty both already raise:

In [11]: pd.Series([1], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]

In [12]: pd.Series([1]).astype(np.datetime64)
...
SystemError: <built-in function cast_to_nanoseconds> returned a result with an error set

(although the second error message is not a correct one)

So I was mixing the empty and non-empty cases.

gfyoung · 2017-04-12T17:42:51Z

@jreback : The np.datetime64 issue came about because it seemed pertinent to the original issue.

jreback · 2017-04-12T17:46:30Z

@gfyoung yeah not sure where that exactly came up....

gfyoung · 2017-04-12T17:47:42Z

@jorisvandenbossche : Those examples you bring up above, should we follow in numpy 's footsteps? We only use one frequency, so I don't see why those two examples wouldn't work.

We only use the nanosecond frequency, so generic timestamp frequencies should be interpreted with the nanosecond frequency. xref pandas-devgh-15524 (comment).

jreback · 2017-04-13T10:34:24Z

@gfyoung so I don't see where this is passing an explicit np.datetime64 for constrution (could be buried somewhere of course), if that IS the case then I think this is wrong. Can you point to this?

gfyoung · 2017-04-13T15:11:00Z

@jreback : So the original issue does not involve passing an explicit np.datetime64 (the reason for it occurring is something completely separate from that as later analysis shows).

We only use the nanosecond frequency, and numpy doesn't even handle generic timestamp dtypes well. xref pandas-devgh-15524 (comment).

* DEPR: Deprecate generic timestamp dtypes We only use the nanosecond frequency, and numpy doesn't even handle generic timestamp dtypes well. xref gh-15524 (comment). * TST: Use pytest idioms in series/test_dtypes.py

RileyLazarou · 2021-05-16T23:31:36Z

Not sure if this is the right place to put this, but I'm having the same issue with read_sql. On occasion, my query will return no results and a dataframe with no records. In processing that data, I'll do something like

df = pd.read_sql(query, con=con, parse_dates=["datetime_column"])  # returns empty dataframe
df["seconds_passed"] = (now - df["datetime_column"]).dt.total_seconds()

which results in

AttributeError: Can only use .dt accessor with datetimelike values

jreback added IO CSV read_csv, to_csv Datetime Datetime data dtype labels Feb 27, 2017

jreback added Bug Difficulty Intermediate labels Apr 12, 2017

jreback added this to the Next Major Release milestone Apr 12, 2017

gfyoung mentioned this issue Apr 12, 2017

DEPR: Deprecate generic timestamp dtypes #15987

Merged

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 13, 2017

DEPR: Deprecated generic timestamp dtypes

ba71afc

We only use the nanosecond frequency, and numpy doesn't even handle generic timestamp dtypes well. xref pandas-devgh-15524 (comment).

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 13, 2017

DEPR: Deprecate generic timestamp dtypes

424e609

We only use the nanosecond frequency, and numpy doesn't even handle generic timestamp dtypes well. xref pandas-devgh-15524 (comment).

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 14, 2017

DEPR: Deprecate generic timestamp dtypes

c31e541

We only use the nanosecond frequency, and numpy doesn't even handle generic timestamp dtypes well. xref pandas-devgh-15524 (comment).

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv returns object dtype for dates in empty frame #15524

BUG: read_csv returns object dtype for dates in empty frame #15524

adbull commented Feb 27, 2017 •

edited

Loading

INSTALLED VERSIONS

jreback commented Feb 27, 2017 •

edited

Loading

jreback commented Feb 27, 2017 •

edited

Loading

adbull commented Feb 27, 2017 •

edited

Loading

gfyoung commented Feb 28, 2017

gfyoung commented Apr 3, 2017

jreback commented Apr 3, 2017

gfyoung commented Apr 3, 2017

gfyoung commented Apr 10, 2017 •

edited

Loading

gfyoung commented Apr 10, 2017

jreback commented Apr 11, 2017

gfyoung commented Apr 11, 2017

jorisvandenbossche commented Apr 11, 2017

gfyoung commented Apr 11, 2017 •

edited

Loading

jreback commented Apr 11, 2017

jorisvandenbossche commented Apr 12, 2017

gfyoung commented Apr 12, 2017 •

edited

Loading

jreback commented Apr 12, 2017

jorisvandenbossche commented Apr 12, 2017

gfyoung commented Apr 12, 2017

jreback commented Apr 12, 2017

gfyoung commented Apr 12, 2017 •

edited

Loading

jreback commented Apr 13, 2017

gfyoung commented Apr 13, 2017

RileyLazarou commented May 16, 2021

BUG: read_csv returns object dtype for dates in empty frame #15524

BUG: read_csv returns object dtype for dates in empty frame #15524

Comments

adbull commented Feb 27, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Feb 27, 2017 • edited Loading

jreback commented Feb 27, 2017 • edited Loading

adbull commented Feb 27, 2017 • edited Loading

gfyoung commented Feb 28, 2017

gfyoung commented Apr 3, 2017

jreback commented Apr 3, 2017

gfyoung commented Apr 3, 2017

gfyoung commented Apr 10, 2017 • edited Loading

gfyoung commented Apr 10, 2017

jreback commented Apr 11, 2017

gfyoung commented Apr 11, 2017

jorisvandenbossche commented Apr 11, 2017

gfyoung commented Apr 11, 2017 • edited Loading

jreback commented Apr 11, 2017

jorisvandenbossche commented Apr 12, 2017

gfyoung commented Apr 12, 2017 • edited Loading

jreback commented Apr 12, 2017

jorisvandenbossche commented Apr 12, 2017

gfyoung commented Apr 12, 2017

jreback commented Apr 12, 2017

gfyoung commented Apr 12, 2017 • edited Loading

jreback commented Apr 13, 2017

gfyoung commented Apr 13, 2017

RileyLazarou commented May 16, 2021

adbull commented Feb 27, 2017 •

edited

Loading

Output of `pd.show_versions()`

jreback commented Feb 27, 2017 •

edited

Loading

jreback commented Feb 27, 2017 •

edited

Loading

adbull commented Feb 27, 2017 •

edited

Loading

gfyoung commented Apr 10, 2017 •

edited

Loading

gfyoung commented Apr 11, 2017 •

edited

Loading

gfyoung commented Apr 12, 2017 •

edited

Loading

gfyoung commented Apr 12, 2017 •

edited

Loading