Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support parsing <Month Name> <Day number> e.g. Jan 1 in date utilities #11430

Closed
TomAugspurger opened this issue Oct 25, 2015 · 21 comments
Closed
Labels
Closing Candidate May be closeable, needs more eyeballs Datetime Datetime data dtype Enhancement

Comments

@TomAugspurger
Copy link
Contributor

I assume that this was officially supported before. Haven't narrowed it down any more than sometime between 0.16.2 and 0.17.0.

In [1]: pd.__version__
Out[1]: '0.16.2'

In [2]: pd.date_range("Jan 1", "March 31", name="date")
Out[2]:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
               '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',
...
In [1]: pd.__version__
Out[1]: '0.17.0'

In [2]: pd.date_range("Jan 1", "March 31", name="date")
---------------------------------------------------------------------------
OutOfBoundsDatetime                       Traceback (most recent call last)
<ipython-input-2-8eaca08051ac> in <module>()
----> 1 pd.date_range("Jan 1", "March 31", name="date")

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/tseries/index.py in date_range(start, end, periods, freq, tz, normalize, name, closed)
   1912     return DatetimeIndex(start=start, end=end, periods=periods,
   1913                          freq=freq, tz=tz, normalize=normalize, name=name,
-> 1914                          closed=closed)
   1915
   1916

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/util/decorators.py in wrapper(*args, **kwargs)
     87                 else:
     88                     kwargs[new_arg_name] = new_arg_value
---> 89             return func(*args, **kwargs)
     90         return wrapper
     91     return _deprecate_kwarg

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/tseries/index.py in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs)
    234             return cls._generate(start, end, periods, name, freq,
    235                                  tz=tz, normalize=normalize, closed=closed,
--> 236                                  ambiguous=ambiguous)
    237
    238         if not isinstance(data, (np.ndarray, Index, ABCSeries)):

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/tseries/index.py in _generate(cls, start, end, periods, name, offset, tz, normalize, ambiguous, closed)
    383
    384         if start is not None:
--> 385             start = Timestamp(start)
    386
    387         if end is not None:

pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:8967)()

pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:22303)()

pandas/tslib.pyx in pandas.tslib.convert_str_to_tsobject (pandas/tslib.c:24364)()

pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:23344)()

pandas/tslib.pyx in pandas.tslib._check_dts_bounds (pandas/tslib.c:26590)()

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 00:00:00
@TomAugspurger
Copy link
Contributor Author

Never mind. Never documented.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 25, 2015

@jreback commented here mwaskom/seaborn#702 (comment) it was never meant to be supported

@jorisvandenbossche
Copy link
Member

Further, @jreback, I thought it was maybe not a written rule, but still somewhat generally assumed that pandas did fall back to dateutil.parser.parse if it couldn't parse the datetime itself. So in that sense, it is somewhat surprising to me pd.to_datetime('Jan 1') no longer works

@jreback
Copy link
Contributor

jreback commented Oct 25, 2015

this has to do with the change IIRC @sinhrks, e.g. #7599 made to make all parsing consistent.

I think this did work (though not officially / undoced / not tested). So we prob can support it. But should have a real effort here.

@jreback jreback added Datetime Datetime data dtype API Design labels Oct 25, 2015
@jorisvandenbossche jorisvandenbossche changed the title REGR: date_range doesn't accept "Month Day" any more REGR: possible regressions in date parsing Oct 25, 2015
@jorisvandenbossche jorisvandenbossche changed the title REGR: possible regressions in date parsing REGR: possible regressions in date parsing (v0.17.0) Oct 25, 2015
@sinhrks
Copy link
Member

sinhrks commented Oct 26, 2015

I think 0.17 behavior is consistent, but allowing to pass default datetime (today's date in most cases) for dateutil parsing may be useful?

@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Apr 21, 2017
@jorisvandenbossche
Copy link
Member

Another related case (also no (full) date part provided in the string) from #16074

In [20]: pd.to_datetime("4pm")
...
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 16:00:00

In [21]: pd.to_datetime("16:00")
Out[21]: Timestamp('2017-04-21 16:00:00')

so depending on the format of the time string, this does or does not work, so there is at least some inconsistency.

But I think we mainly have to decide on whether, if a part of the date (eg the year) or the full date is missing, do we fill it with 0001-01-01 (and with result that it raises an error) or with the current date?
In the first case, it is probably better to raise a custom error message instead of the OutOfBounds error.

@mikedeltalima
Copy link

@jreback thanks for linking the duplicate. Would it be reasonable to use the dateutil parser and allow the user to pass a default datetime?

@jreback
Copy link
Contributor

jreback commented Apr 27, 2017

@mikedeltalima you can do that as a user
pandas doesn't and shouldn't have an api for that
giving back current day as a non explicit operation seems magical

@mikedeltalima
Copy link

@jreback I'm not sure I understand. You can specify the default in dateutil, but not pandas. Why shouldn't the user have that option? Also, dateutil chose the current date for their default default, but pandas could choose something else (Jan 1 of current year?).

@jreback
Copy link
Contributor

jreback commented Apr 27, 2017

how is having a default date useful?
sure for creating a single date i suppose but not sure of the general utility

@mikedeltalima
Copy link

Let's say I have a Series (a column in a DataFrame) that consists of dates like April 5, May 10 etc. None of them specify the year. If I want to capture that information, I need to convert the column using to_datetime, but datetimes need the year specified. Why force the user to implement a (costly?) transform after the fact?

@jreback
Copy link
Contributor

jreback commented Apr 27, 2017

still not sure what you mean
this is a context problem and pandas shouldn't be guessing what you mean

@jorisvandenbossche
Copy link
Member

Just to be sure it is clear it is clear for everybody: dateutil by default fills in missing parts with the current date, and also has a keyword to change this behaviour:

In [8]: import datetime

In [9]: import dateutil

In [10]: dateutil.parser.parse('16:00')
Out[10]: datetime.datetime(2017, 4, 27, 16, 0)

In [11]: dateutil.parser.parse('16:00', default=datetime.datetime(2000, 1, 1))
Out[11]: datetime.datetime(2000, 1, 1, 16, 0)

In [13]: dateutil.parser.parse('April 3')
Out[13]: datetime.datetime(2017, 4, 3, 0, 0)

In [16]: dateutil.parser.parse('April 3', default=datetime.datetime(2000, 1, 1))
Out[16]: datetime.datetime(2000, 4, 3, 0, 0)

When it becomes more strange is when filling in the current day of the month:

In [12]: dateutil.parser.parse('Aug 2016')
Out[12]: datetime.datetime(2016, 8, 27, 0, 0)

(today is the 27th of April)

So we can discuss whether we should follow that behaviour in pandas or not. In #7599 we decided to not follow that for at least the filling of the current day of the month (the last more strange example). The consequence is that we also do not follow the rule for filling the current year, at least for certain formats (this issue), a consequence which was not fully on purpose I think (or at least this is not discussed / tested in the original PR).

Filling with current year as dateutil does, has at least some usecase I think, but if you want this, you can always directly use the dateutil parser instead of to_datetime. (@mikedeltalima that is a workaround that you can use as well, instead of adapting the strings)

giving back current day as a non explicit operation seems magical

Jeff, note that we currently actually still do that in specific cases, depending on the format of the string (see my example above #11430 (comment))

In any case, the current situation is also not ideal, as certain cases still, and I think the error message should not be an OutOfBounds error, but an error message indicating that the string could not be parsed because (part of) the date was missing.

@mikedeltalima
Copy link

@jorisvandenbossche thanks for the explanation! You've really improved the readability of the conversation :)

Could you elaborate on the workaround? This is what I could do, but I wonder if there are good reasons to use to_datetime instead if I can fix it.

pd.Series('april 5').apply(dateutil.parser.parse)
>>> 0   2017-04-05
>>> dtype: datetime64[ns]
pd.Series('april 5').apply(lambda x: dateutil.parser.parse(x, default=datetime.datetime(2017, 1,
1)))
>>> 0   2017-04-05
>>> dtype: datetime64[ns]
pd.Series('april 5').apply(lambda x: dateutil.parser.parse(x, default=datetime.datetime(1, 1,
1)))
>>> 0    0001-04-05 00:00:00
>>> dtype: object

That said, @jreback I don't think my suggestion involves pandas guessing anything. It would simply allow the user to override a default (which pandas is already using). As you can see from the last two lines in my example above, there seems to be a bug that does not allow pandas to recognize some datetimes as the correct dtype. Looks like the cutoff is September 21, 1677. :)

>>> pd.Series(datetime.datetime(1677, 1, 1))
0    1677-01-01 00:00:00
dtype: object
>>> pd.Series(datetime.datetime(1678, 1, 1))
0   1678-01-01
dtype: datetime64[ns]
>>> pd.Series(datetime.datetime(1677, 9, 22))
0   1677-09-22
dtype: datetime64[ns]
>>> pd.Series(datetime.datetime(1677, 9, 21))
0    1677-09-21 00:00:00
dtype: object

@jorisvandenbossche
Copy link
Member

Regarding the 1677, the reason for that is very simple, once you know it: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-timestamp-limits

When you try to parse a string outside of the supported range, you get an OutOfBounds error, but if you pass a datetime.datetime object outside that range, it's get stored as object dtype.

@mikedeltalima
Copy link

@jorisvandenbossche so proposed fix: pick a default that is within the bounds of the timestamp limits (Jan 1 of current year up to pd.Timestamp.max) and allow users to pass a default (raise exception if out of bounds).

@jreback
Copy link
Contributor

jreback commented Apr 27, 2017

@mikedeltalima

so you want to add an argument to the constructor of Timestamp and DatetimeIndex of default=None.

so

Timestamp('4 pm', default=Timestamp(2000,1,1)) or whatever ?

I suppose we could implement logic, or simply pass thru to dateutil (which makes it really slow though as its pure python), but then again this is a convenince feature.

@mikedeltalima
Copy link

mikedeltalima commented Apr 27, 2017

@jreback is that necessary? I was only thinking of adding the argument to the to_datetime function. From what I can tell, the function is already implementing a default, it's just hardcoded as _DEFAULT_DATETIME (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslib.pyx#L2171), so I don't think this would slow things down.

@jreback
Copy link
Contributor

jreback commented Apr 28, 2017

so I don't think this would slow things down.

by-definition when thinks get down to use dateutil they have already slowed down. this is a last ditch effort (and rarely happens actually).

so making this more explicit with a default value makes sense.

@mroeschke mroeschke changed the title REGR: possible regressions in date parsing (v0.17.0) ENH: Support parsing <Month Name> <Day number> e.g. Jan 1 in date utilities Mar 31, 2020
@mroeschke mroeschke added Enhancement and removed Needs Discussion Requires discussion from core team before further action labels Mar 31, 2020
@MarcoGorelli
Copy link
Member

strong -1 on adding extra arguments to datetime parsing, it's fine for this to error

@MarcoGorelli MarcoGorelli added the Closing Candidate May be closeable, needs more eyeballs label Mar 30, 2023
@mroeschke
Copy link
Member

Agreed here. I think whatever dateutil can parse as a string is sufficient at this point so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Datetime Datetime data dtype Enhancement
Projects
None yet
Development

No branches or pull requests

7 participants