Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datetools.parse interface #5886

Closed
3 tasks
jseabold opened this issue Jan 8, 2014 · 22 comments
Closed
3 tasks

datetools.parse interface #5886

jseabold opened this issue Jan 8, 2014 · 22 comments
Labels
Datetime Datetime data dtype Docs Period Period data type

Comments

@jseabold
Copy link
Contributor

jseabold commented Jan 8, 2014

  • document in docstring / cookbook / timeseries.rst usage of combing integer columns into YYYYMMDD and parsing to datetimes
  • clean up imports of _parse/parse from dateutils
  • to_period to create PeriodIndex

I started to make a PR #5885 to fix what I thought was a typo before realizing that this was intentional. It still doesn't make much sense to me though why I would want this return of datetime, _result, resolution. Maybe the whole approach could use a refactor. Otherwise what am I missing?

My typical use case for datetools.parse is something like

dates = map(lambda x : parse(' '.join(x)), zip(df.day, df.month, df.year))

A couple of questions.

  1. Is there a better way to do this vectorized to datetime operation? AFAICT pd.to_datetime doesn't actually use the 'advanced' parsing for quarterly and monthly dates.
  2. Should this all be unified? Assuming I haven't missed it, should there be, e.g., a function pd.parse_dates that is a general parser for both strings and works on array-like input, deprecating datetools.parse, datetools.parse_time_string, and datetools.to_datetime. This function could also have a flag to return Period or TimeStamp objects with frequency information instead of the current return of the parsed object and resolution. Given that I'm having to do things like

dates = [x[0] for x in map(lambda x : parse(' '.join(x)), zip(df.day, df.month, df.year))]

Thoughts?

@jreback
Copy link
Contributor

jreback commented Jan 8, 2014

I you already have integer day,month, year

just do this, and make sure that you specify format '%Y%m%d'; its specially optimized
to handle this (whether its an integer or a string)

In [14]: i = pd.date_range('20000101',periods=10000)

In [15]: df = pd.DataFrame(dict(year = i.year, month = i.month, day = i.day))

In [17]: %timeit pd.to_datetime(df.year*10000+df.month*100+df.day,format='%Y%m%d')
100 loops, best of 3: 8.4 ms per loop

Parsing the actual string

In [18]: ds = df.apply(lambda x: "%04d%02d%02d" % (x.year,x.month,x.day),axis=1)

In [21]: %timeit pd.to_datetime(ds)
1 loops, best of 3: 519 ms per loop

pd.to_datetime IS the general purpose parser (and will fallback to datetools for dates it hasn't
optimized code for parsing); you can always pass a strptime format

@jseabold
Copy link
Contributor Author

jseabold commented Jan 8, 2014

A couple of comments. This is not completely obvious. I'll see about adding this and similar examples to the docstring. In general, more examples in docstrings with common patterns would be welcome.

to_datetime is not general enough because it doesn't subsume the abilities of pd.parse_time_string. It only handles things that can be put into the strftime format. For example, very common to have dates in this format in economics.

pd.to_datetime(["1980m1", "1980m2"])

It'd be nice to have a function that handles this and dates handled by dateutil.parse.

And this returns an array for some reason, which I find to be odd. I see the box keyword, but since your example returns a Series, not a DatetimeIndex as indicated, wouldn't it make sense for everything just to return a Series?

Shouldn't to_datetime have parse in the name? This is what I tab-complete on. I know I want to "parse" the dates, I guess I wanted to parse them "to_datetime", but the former seems the obvious name in the namespace (to me). pd.parse_to_datetime?

@jreback
Copy link
Contributor

jreback commented Jan 8, 2014

not sure where the name came from.

I agree creating a date from columns is not so obvious (a little more obvious if you are reading it in with read_csv).

would appreciate the combing example that as a docstring / doc example / cookbook - I thought about how to make it 'automatically' do it but don't want to change the API...if you think of someway great! (could have a convenience function, but not big on that)

These will return a Series if you pass a Series/array. The boxing is internally used by the DatetimeIndex parser (which just calls this); you can use it if you want an Index instead. It also returns a scalar Timestamp if only 1 value is passed as a scalar.

I could see the format argument taking a function for parsing. How would you have your example parsed? (prob as a PeriodIndex? maybe should have a to_period (and in 0.13 have to_timedelta

I don't think can change to_datetime, but could alias to parse_to_datetime no biggie on that

@jseabold
Copy link
Contributor Author

jseabold commented Jan 8, 2014

Also maybe import parse from dateutils as _parse to discourage its use. I had no idea it wasn't really intended to be part of the public API.

@jreback
Copy link
Contributor

jreback commented Jan 8, 2014

I think the import is a big 'confused' as its in several places (some as _parse some as parse). I belive the point was to allow it as a convience in parsing dates with read_csv but that has evolved to not be that necessary.

@jreback
Copy link
Contributor

jreback commented Jan 8, 2014

ok...i'll create a todo list at the top of this PR then

@jseabold
Copy link
Contributor Author

jseabold commented Jan 8, 2014

My example would be parsed like

map(pd.datetools.parse_time_string, ["1980q1","1980q2"])

Though apparently 1980m12 is recognized as minute not month. I'd have m\d+ and q\d+ default to parsing as month and quarter, though I see that this has time_string in the title so probably not appropriate here. Odd that it tries to parse quarters here.

Not unheard of formats commonly handled by econometric/statistical software.

[~/]
[7]: map(sm.tsa.datetools.date_parser, ["1998m1", "1998m2"])
[7]: [datetime.datetime(1998, 1, 31, 0, 0), datetime.datetime(1998, 2, 28, 0, 0)]

[~/]
[8]: map(sm.tsa.datetools.date_parser, ["1998QI", "1998QII"])
[8]: [datetime.datetime(1998, 3, 31, 0, 0), datetime.datetime(1998, 6, 30, 0, 0)]

[~/]
[9]: map(sm.tsa.datetools.date_parser, ["1998mX", "1998mXI"])
[9]: [datetime.datetime(1998, 10, 31, 0, 0), datetime.datetime(1998, 11, 30, 0, 0)]

@jseabold
Copy link
Contributor Author

jseabold commented Jan 8, 2014

I don't much care if I get Periods or TimeStamps, etc. As far as I'm concerned (as a user) they're pretty much the same. As a developer, I've written most software to be agnostic about what it gets as long as it can infer a frequency.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 6, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jreback
Copy link
Contributor

jreback commented May 29, 2015

mentioned in SO

@gfyoung
Copy link
Member

gfyoung commented Aug 28, 2016

@jreback : Is this issue even relevant anymore? IINM, pd.datetools.parse doesn't exist.

@jreback
Copy link
Contributor

jreback commented Aug 28, 2016

look at the imports they are * so yes

@gfyoung
Copy link
Member

gfyoung commented Aug 28, 2016

@jreback : I don't understand what you mean by that.

@jreback
Copy link
Contributor

jreback commented Aug 28, 2016

do a dir() on the namespace

@gfyoung
Copy link
Member

gfyoung commented Aug 28, 2016

Again, do not follow. Just try the following:

>>> import pandas as pd
>>> pd.datetools.parse
...
AttributeError: module 'pandas.core.datetools' has no attribute 'parse'

@jreback
Copy link
Contributor

jreback commented Aug 28, 2016

dir()

@gfyoung
Copy link
Member

gfyoung commented Aug 28, 2016

Again, do not follow. Just try the following:

>>> import pandas as pd
>>> 'parse' in dir(pd)
False
>>> 'parse' in dir(pd.datetools)
False

I'm really not understanding what you're saying here.

@jreback
Copy link
Contributor

jreback commented Aug 28, 2016

I guess it's gone

all of these issues are either fixed then or elsewhere (to_period needs a standalone issue)

@jreback jreback closed this as completed Aug 28, 2016
@jreback
Copy link
Contributor

jreback commented Aug 28, 2016

@jreback
Copy link
Contributor

jreback commented Aug 28, 2016

@gfyoung if u would create an issue for to_period would've great
cc @sinhrks

@gfyoung
Copy link
Member

gfyoung commented Aug 28, 2016

to_period for DatetimeIndex I presume?

@jreback
Copy link
Contributor

jreback commented Aug 28, 2016

no it's like to_datetime but creates PeriodIndexes

@gfyoung
Copy link
Member

gfyoung commented Aug 28, 2016

Got it, done: #14108

@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Aug 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Docs Period Period data type
Projects
None yet
Development

No branches or pull requests

4 participants