-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_csv returns object dtype for dates in empty frame #15524
Comments
datetime parsing is pretty soft, meaning unless it can unequivocably convert something it won't do it. You can certainly force this after the fact via I am not sure that I would coerce these empty columns like this (even though we certainly can, at least for non-tz aware, which won't work in your example at all. The dtype is not-defined). This can easily lead to mistakes, not to mention that thoughts |
@adbull can you show a situation when this actually matters / makes a difference? why are you trying to do this? For example concatting an empty frame with one that is correctly dtypes will work, so this doesn't practically make a difference. |
Well, if I call I'm not sure I follow you on the second paragraph? What won't work in my example? Concatting an empty frame of dtype Anyway, potential use-cases are (a) concatting together multiple csvs, some of which may be empty; (b) regularly loading a csv which is frequently updated, and may or may not be empty. In (a), I would have to assign timezones after concat to stop it throwing an error; in (b), I would have to specifically check whether the frame is empty before attempting any timezone manipulation. The workarounds aren't too hard, it just seems a bit awkward when the simple version would work fine for any other dtype. |
@jreback : I'm torn on this. On the one hand, I agree with you that there is not a lot of value (from a functionality perspective) in supporting this behavior BUT that being said, @adbull does have a point about consistency. I turn your attention to some other inconsistencies related to this: >>> Series([]).astype(np.datetime64)
Series([], dtype: datetime64[ns])
>>>
>>> Series([], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64] |
is like what we are discussing in #15859 I agree that should work. As an aside, I think we need to systematically test all |
Okay, sounds good. I suspect that what I presented here is probably related to #15859 initially and will trickle down to patching this issue here. |
@jreback : So the Exception raised traces to Thus, we would need to soften the restrictions on casting to |
Also tracked down the cause of the original bug. Unfortunately, patching this is not so straightforward because there is a lot of logical surrounding what columns get parsed depending on how you specify So yes, we can easily patch by replicating the @jreback , @jorisvandenbossche : Thoughts? |
@gfyoung I think the date parsing needs to be factored out generically applicable to both engines :> its a project but will make this much more generic (and not sacrifice any perf). Then this is easy. |
@gfyoung so the question is if we generally interpret |
@jorisvandenbossche : I think the inconsistency is pretty clearly illustrated in my above comment here. It should be one way or the other (raise or interpret as |
so we transform
though you have to see what this does break. If its just specific validation tests, then can change those. |
I would actually also be fine with following numpy here, which means raising on |
@jorisvandenbossche : What do you mean? >>> np.array([], dtype=np.datetime64)
array([], dtype=datetime64) |
@gfyoung the bug here is this, right:
IOW we have an empty date column, which should be coerced to This is similar to this behavior. where does
|
Sorry, I assumed it was the same as on non-empty array. And in pandas non-empty both already raise:
(although the second error message is not a correct one) So I was mixing the empty and non-empty cases. |
@jreback : The |
@gfyoung yeah not sure where that exactly came up.... |
@jorisvandenbossche : Those examples you bring up above, should we follow in |
We only use the nanosecond frequency, so generic timestamp frequencies should be interpreted with the nanosecond frequency. xref pandas-devgh-15524 (comment).
We only use the nanosecond frequency, so generic timestamp frequencies should be interpreted with the nanosecond frequency. xref pandas-devgh-15524 (comment).
We only use the nanosecond frequency, so generic timestamp frequencies should be interpreted with the nanosecond frequency. xref pandas-devgh-15524 (comment).
@gfyoung so I don't see where this is passing an explicit |
@jreback : So the original issue does not involve passing an explicit |
We only use the nanosecond frequency, and numpy doesn't even handle generic timestamp dtypes well. xref pandas-devgh-15524 (comment).
We only use the nanosecond frequency, and numpy doesn't even handle generic timestamp dtypes well. xref pandas-devgh-15524 (comment).
We only use the nanosecond frequency, and numpy doesn't even handle generic timestamp dtypes well. xref pandas-devgh-15524 (comment).
* DEPR: Deprecate generic timestamp dtypes We only use the nanosecond frequency, and numpy doesn't even handle generic timestamp dtypes well. xref gh-15524 (comment). * TST: Use pytest idioms in series/test_dtypes.py
Not sure if this is the right place to put this, but I'm having the same issue with
which results in
|
Code Sample, a copy-pastable example if possible
Problem description
When reading CSVs with no data rows,
read_csv()
returns the dtypeobject
for dates, which can raise errors on later manipulation. This is contrary to the general behaviour ofread_csv()
, which otherwise correctly sets dtypes for empty frames when those dtypes are explicitly passed.I don't think it would be hard to return the correct dtype here? If
date_parser
is not set, we know the dtype isdatetime64[ns]
; otherwise, we can call the parser with empty data, and use the returned dtype.Note that e.g.
read_csv(..., dtype='datetime64[ns]')
is not a solution, as this throws an error when the csv is non-empty.Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.9.1
IPython: 4.2.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8.1
boto: 2.45.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: