Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv returns object dtype for dates in empty frame #15524

Open
adbull opened this issue Feb 27, 2017 · 24 comments
Open

BUG: read_csv returns object dtype for dates in empty frame #15524

adbull opened this issue Feb 27, 2017 · 24 comments
Labels
Bug Datetime Datetime data dtype IO CSV read_csv, to_csv

Comments

@adbull
Copy link
Contributor

adbull commented Feb 27, 2017

Code Sample, a copy-pastable example if possible

>>> import io
>>> import pandas as pd

>>> csv = lambda: io.StringIO('date')
>>> df = pd.read_csv(csv(), dtype=int)
>>> df.dtypes

date    int64
dtype: object

>>> df = pd.read_csv(csv(), parse_dates=['date'])
>>> df.dtypes

date      object
dtype: object

>>> df.date.dt.tz_localize('UTC')

Traceback (most recent call last):
  File "<ipython-input-49-067c7ce27232>", line 1, in <module>
    df.date = df.date.dt.tz_localize('UTC')
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2740, in __getattr__
    return object.__getattribute__(self, name)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/base.py", line 241, in __get__
    return self.construct_accessor(instance)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/series.py", line 2743, in _make_dt_accessor
    raise AttributeError("Can only use .dt accessor with datetimelike "
AttributeError: Can only use .dt accessor with datetimelike values

>>> parser = lambda x: pd.to_datetime(x, utc=True)
>>> df = pd.read_csv(csv(), parse_dates=['date'], date_parser=parser)
>>> df.dtypes

date      object
dtype: object

>>> df.date.dt.tz_convert('EST')

Traceback (most recent call last):
  File "<ipython-input-129-3d458af8513c>", line 1, in <module>
    df.date.dt.tz_convert('EST')
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2740, in __getattr__
    return object.__getattribute__(self, name)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/base.py", line 241, in __get__
    return self.construct_accessor(instance)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/series.py", line 2743, in _make_dt_accessor
    raise AttributeError("Can only use .dt accessor with datetimelike "
AttributeError: Can only use .dt accessor with datetimelike values

Problem description

When reading CSVs with no data rows, read_csv() returns the dtype object for dates, which can raise errors on later manipulation. This is contrary to the general behaviour of read_csv(), which otherwise correctly sets dtypes for empty frames when those dtypes are explicitly passed.

I don't think it would be hard to return the correct dtype here? If date_parser is not set, we know the dtype is datetime64[ns]; otherwise, we can call the parser with empty data, and use the returned dtype.

Note that e.g. read_csv(..., dtype='datetime64[ns]') is not a solution, as this throws an error when the csv is non-empty.

Expected Output

date    int64
dtype: object

date      datetime64[ns]
dtype: object

date      datetime64[ns, UTC]
dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.9.1
IPython: 4.2.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8.1
boto: 2.45.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

datetime parsing is pretty soft, meaning unless it can unequivocably convert something it won't do it. You can certainly force this after the fact via .to_datetime(...). The default for non-converting something (to any) dtype is simply object. This is the safest way to do things.

I am not sure that I would coerce these empty columns like this (even though we certainly can, at least for non-tz aware, which won't work in your example at all. The dtype is not-defined). This can easily lead to mistakes, not to mention that parse_dates=True will try to soft coerce lots of things (but I suppose that could ignore a forced conversion).

thoughts

@jorisvandenbossche
cc @gfyoung

@jreback jreback added IO CSV read_csv, to_csv Datetime Datetime data dtype labels Feb 27, 2017
@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

@adbull can you show a situation when this actually matters / makes a difference? why are you trying to do this? For example concatting an empty frame with one that is correctly dtypes will work, so this doesn't practically make a difference.

@adbull
Copy link
Contributor Author

adbull commented Feb 27, 2017

Well, if I call read_csv with a specific column in parse_dates, I definitely want the column to be a datetime, and since the column is empty, it will unequivocally succeed in that conversion -- it's not like it can contain a parse error.

I'm not sure I follow you on the second paragraph? What won't work in my example?

Concatting an empty frame of dtype object to a non-empty one of dtype datetime64[ns, UTC] throws an error, is that not intended?

Anyway, potential use-cases are (a) concatting together multiple csvs, some of which may be empty; (b) regularly loading a csv which is frequently updated, and may or may not be empty. In (a), I would have to assign timezones after concat to stop it throwing an error; in (b), I would have to specifically check whether the frame is empty before attempting any timezone manipulation.

The workarounds aren't too hard, it just seems a bit awkward when the simple version would work fine for any other dtype.

@gfyoung
Copy link
Member

gfyoung commented Feb 28, 2017

@jreback : I'm torn on this. On the one hand, I agree with you that there is not a lot of value (from a functionality perspective) in supporting this behavior BUT that being said, @adbull does have a point about consistency. I turn your attention to some other inconsistencies related to this:

>>> Series([]).astype(np.datetime64)
Series([], dtype: datetime64[ns])
>>>
>>> Series([], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]

@gfyoung
Copy link
Member

gfyoung commented Apr 3, 2017

@jreback : Thoughts on what I said above?

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

Series([]).astype(np.datetime64)
Series([], dtype: datetime64[ns])

Series([], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]

is like what we are discussing in #15859

I agree that should work. As an aside, I think we need to systematically test all .astype vs direct construction on empties. (can create another issue if you want). mainly want the test (so if you can add that would be great, just xfailing those things that don't work).

@gfyoung
Copy link
Member

gfyoung commented Apr 3, 2017

Okay, sounds good. I suspect that what I presented here is probably related to #15859 initially and will trickle down to patching this issue here.

@gfyoung
Copy link
Member

gfyoung commented Apr 10, 2017

@jreback : So the Exception raised traces to maybe_cast_to_datetime, which specifically checks if the dtype specified exactly matches datetime64[ns], which is NOT the same as np.datetime64 (no frequency in the latter dtype).

Thus, we would need to soften the restrictions on casting to datetime64 (similarly for timedelta64). This is an API change. Thoughts?

@gfyoung
Copy link
Member

gfyoung commented Apr 10, 2017

Also tracked down the cause of the original bug. _get_empty_meta, which is what is called when there is no data to parse has no handling for parse_dates, but it does for dtype.

Unfortunately, patching this is not so straightforward because there is a lot of logical surrounding what columns get parsed depending on how you specify parse_dates, which feels very duplicative since we have it already for the C and Python engine but in two different implementations (one in Cython and the other in pure Python).

So yes, we can easily patch by replicating the parse_dates logic, but that does not seem optimal...

@jreback , @jorisvandenbossche : Thoughts?

@jreback
Copy link
Contributor

jreback commented Apr 11, 2017

@gfyoung I think the date parsing needs to be factored out generically applicable to both engines :> its a project but will make this much more generic (and not sacrifice any perf). Then this is easy.

@gfyoung
Copy link
Member

gfyoung commented Apr 11, 2017

@jreback : Any thoughts about my questions above regarding Series ?

@jorisvandenbossche
Copy link
Member

@gfyoung so the question is if we generally interpret np.datetime64 (if passed by the user) as 'datetime64[ns]', since 'ns' is the default resolution?
And now how this is handled differs between different functions?

@gfyoung
Copy link
Member

gfyoung commented Apr 11, 2017

@jorisvandenbossche : I think the inconsistency is pretty clearly illustrated in my above comment here. It should be one way or the other (raise or interpret as datetime64[ns]).

@jreback
Copy link
Contributor

jreback commented Apr 11, 2017

so we transform np.datetime64 -> np.datetime64[ns] (well we actually interpret it according to whatever freq it actually is). In this case, this just says hey make it the default datetype, so this would be totally fine to do.

Series([], dtype=np.datetime64), IOW I would be fine accepting this. Note that the logic is in pandas.types.cast.maybe_cast_to_datetime.

though you have to see what this does break. If its just specific validation tests, then can change those.

@jorisvandenbossche
Copy link
Member

I would actually also be fine with following numpy here, which means raising on dtype=np.datetime64. I don't see a strong reason to allow this.

@gfyoung
Copy link
Member

gfyoung commented Apr 12, 2017

@jorisvandenbossche : What do you mean?

>>> np.array([], dtype=np.datetime64)
array([], dtype=datetime64)

@jreback
Copy link
Contributor

jreback commented Apr 12, 2017

@gfyoung the bug here is this, right:

In [2]: pd.read_csv(StringIO('date'), parse_dates=['date']).dtypes
Out[2]: 
date    object
dtype: object

IOW we have an empty date column, which should be coerced to datetime64[ns]

This is similar to this behavior. where does np.datetime64 enter this?

In [4]: pd.to_datetime([])
Out[4]: DatetimeIndex([], dtype='datetime64[ns]', freq=None)

@jreback jreback added this to the Next Major Release milestone Apr 12, 2017
@jorisvandenbossche
Copy link
Member

Sorry, I assumed it was the same as on non-empty array. And in pandas non-empty both already raise:

In [11]: pd.Series([1], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]

In [12]: pd.Series([1]).astype(np.datetime64)
...
SystemError: <built-in function cast_to_nanoseconds> returned a result with an error set

(although the second error message is not a correct one)

So I was mixing the empty and non-empty cases.

@gfyoung
Copy link
Member

gfyoung commented Apr 12, 2017

@jreback : The np.datetime64 issue came about because it seemed pertinent to the original issue.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2017

@gfyoung yeah not sure where that exactly came up....

@gfyoung
Copy link
Member

gfyoung commented Apr 12, 2017

@jorisvandenbossche : Those examples you bring up above, should we follow in numpy 's footsteps? We only use one frequency, so I don't see why those two examples wouldn't work.

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 12, 2017
We only use the nanosecond frequency, so generic
timestamp frequencies should be interpreted with
the nanosecond frequency.

xref pandas-devgh-15524 (comment).
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 12, 2017
We only use the nanosecond frequency, so generic
timestamp frequencies should be interpreted with
the nanosecond frequency.

xref pandas-devgh-15524 (comment).
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 12, 2017
We only use the nanosecond frequency, so generic
timestamp frequencies should be interpreted with
the nanosecond frequency.

xref pandas-devgh-15524 (comment).
@jreback
Copy link
Contributor

jreback commented Apr 13, 2017

@gfyoung so I don't see where this is passing an explicit np.datetime64 for constrution (could be buried somewhere of course), if that IS the case then I think this is wrong. Can you point to this?

@gfyoung
Copy link
Member

gfyoung commented Apr 13, 2017

@jreback : So the original issue does not involve passing an explicit np.datetime64 (the reason for it occurring is something completely separate from that as later analysis shows).

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 13, 2017
We only use the nanosecond frequency, and numpy
doesn't even handle generic timestamp dtypes well.

xref pandas-devgh-15524 (comment).
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 13, 2017
We only use the nanosecond frequency, and numpy
doesn't even handle generic timestamp dtypes well.

xref pandas-devgh-15524 (comment).
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 14, 2017
We only use the nanosecond frequency, and numpy
doesn't even handle generic timestamp dtypes well.

xref pandas-devgh-15524 (comment).
jreback pushed a commit that referenced this issue Apr 14, 2017
* DEPR: Deprecate generic timestamp dtypes

We only use the nanosecond frequency, and numpy
doesn't even handle generic timestamp dtypes well.

xref gh-15524 (comment).

* TST: Use pytest idioms in series/test_dtypes.py
@RileyLazarou
Copy link

Not sure if this is the right place to put this, but I'm having the same issue with read_sql. On occasion, my query will return no results and a dataframe with no records. In processing that data, I'll do something like

df = pd.read_sql(query, con=con, parse_dates=["datetime_column"])  # returns empty dataframe
df["seconds_passed"] = (now - df["datetime_column"]).dt.total_seconds()

which results in

AttributeError: Can only use .dt accessor with datetimelike values

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

7 participants