Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ISO 8601 Duration string constructor for Timedelta #19065

Merged
merged 2 commits into from
Jan 5, 2018

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Jan 3, 2018

ASV results below

· Running 16 total benchmarks (2 commits * 1 environments * 8 benchmarks)
[  0.00%] · For pandas commit hash 6eda5188:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  6.25%] ··· Running timedelta.TimedeltaConstructor.time_from_components                                                               14.6±0.2μs
[ 12.50%] ··· Running timedelta.TimedeltaConstructor.time_from_datetime_timedelta                                                       7.17±0.1μs
[ 18.75%] ··· Running timedelta.TimedeltaConstructor.time_from_int                                                                     5.86±0.08μs
[ 25.00%] ··· Running timedelta.TimedeltaConstructor.time_from_iso_format                                                               20.7±0.4μs
[ 31.25%] ··· Running timedelta.TimedeltaConstructor.time_from_missing                                                                 2.11±0.03μs
[ 37.50%] ··· Running timedelta.TimedeltaConstructor.time_from_np_timedelta                                                             4.79±0.1μs
[ 43.75%] ··· Running timedelta.TimedeltaConstructor.time_from_string                                                                  6.32±0.08μs
[ 50.00%] ··· Running timedelta.TimedeltaConstructor.time_from_unit                                                                     6.22±0.1μs
[ 50.00%] · For pandas commit hash 6552718d:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 56.25%] ··· Running timedelta.TimedeltaConstructor.time_from_components                                                               14.8±0.2μs
[ 62.50%] ··· Running timedelta.TimedeltaConstructor.time_from_datetime_timedelta                                                      7.28±0.09μs
[ 68.75%] ··· Running timedelta.TimedeltaConstructor.time_from_int                                                                      5.67±0.1μs
[ 75.00%] ··· Running timedelta.TimedeltaConstructor.time_from_iso_format                                                                   failed
[ 81.25%] ··· Running timedelta.TimedeltaConstructor.time_from_missing                                                                 2.09±0.05μs
[ 87.50%] ··· Running timedelta.TimedeltaConstructor.time_from_np_timedelta                                                            4.74±0.04μs
[ 93.75%] ··· Running timedelta.TimedeltaConstructor.time_from_string                                                                   5.97±0.1μs
[100.00%] ··· Running timedelta.TimedeltaConstructor.time_from_unit                                                                     6.29±0.1μs
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Match a provided string against an ISO 8601 pattern, providing a group for
each ``Timedelta`` component.
"""
pater = re.compile(r"""P
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you prob want to compile this pattern in a module level variable to avoid repeatedly doing this.

@@ -506,6 +526,33 @@ def _binary_op_method_timedeltalike(op, name):
# ----------------------------------------------------------------------
# Timedelta Construction

def _value_from_iso_match(match):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make cdef and type this as int64_t

match_dict = {k: int(v) for k, v in match_dict.items()}
nano = match_dict.pop('nanoseconds', 0)

return nano + convert_to_timedelta64(timedelta(**match_dict), 'ns')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just return the nanoseconds here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow - don't we need to the timedelta64 to get the nanosecond precision of all the components?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use timedelta_from_spec and construct the number of nanoseconds, then simply return that

@@ -506,6 +526,33 @@ def _binary_op_method_timedeltalike(op, name):
# ----------------------------------------------------------------------
# Timedelta Construction

def _value_from_iso_match(match):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this parse_iso_format_string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't take the match, but the string itself.

match = match_iso_format(value)
value = _value_from_iso_match(match)
else:
value = np.timedelta64(parse_timedelta_string(value))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should look like

if len(value) > 0 and vaue[0] == 'P':
     value = parse_iso_format(value)
else:
     value = parse_timedelta_string(value)
value = np.timedelta64(value)

('P0DT0H0M0.000000123S', Timedelta(nanoseconds=123)),
('P0DT0H0M0.00001S', Timedelta(microseconds=10)),
('P0DT0H0M0.001S', Timedelta(milliseconds=1)),
('P0DT0H1M0S', Timedelta(minutes=1))])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some invalid matches that should raise (prob need to catch inside the parsing function and raise a nice message)

@codecov
Copy link

codecov bot commented Jan 4, 2018

Codecov Report

Merging #19065 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #19065   +/-   ##
=======================================
  Coverage   91.51%   91.51%           
=======================================
  Files         148      148           
  Lines       48680    48680           
=======================================
  Hits        44550    44550           
  Misses       4130     4130
Flag Coverage Δ
#multiple 89.88% <ø> (ø) ⬆️
#single 41.68% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35b2aba...6bab9da. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a hand written parser is better (though certainly more work), so maybe in the future

for k, v in match_dict.items():
ns += timedelta_from_spec(v, '0', k)

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just raise here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert on Cython but pytest was failing when raising the exception inside of this function. I believe it is attributable to the int64_t function declaration. I could remove that type and raising the exception directly inside the function would work. Otherwise, I was getting the below output during testing.

-------------------------------------------------------------- Captured stderr call --------------------------------------------------------------
AttributeError: 'NoneType' object has no attribute 'groupdict'
Exception ignored in: 'pandas._libs.tslibs.timedeltas.parse_iso_format_string'
AttributeError: 'NoneType' object has no attribute 'groupdict'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, you just need to add:
except? -1: to the declaration; this tells cython that you may raise in a cdef function so it should check. you don't need to explicity return -1 though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks. Will make changes and re-push

value = np.timedelta64(parse_timedelta_string(value))
if len(value) > 0 and value[0] == 'P':
iso_val = parse_iso_format_string(value)
if iso_val == -1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is much simpler if you raise above

Returns
-------
ns: int64_t
Precision in nanoseconds of matched ISO 8601 duration, or -1 if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a Raises section instead

@jreback jreback merged commit c6166b0 into pandas-dev:master Jan 5, 2018
@jreback
Copy link
Contributor

jreback commented Jan 5, 2018

thanks!

@WillAyd WillAyd deleted the iso-cons branch January 5, 2018 14:22
@TomAugspurger
Copy link
Contributor

Just seeing this and #11375 now.

In #11375 (comment) @jorisvandenbossche mentioned these would be better represented as DateOffsets, since there's some debate about whether ISO 8601 durations are relative (DateOffset) or absolute (TimeDelta)

I think it's still useful to support this as a format for parsing timedeltas, but we may want to put some warnings / caveats in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timedelta ISO Format Constructor
3 participants