`decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed #1399

cchwala · 2017-05-05T11:48:00Z

Hi,
decode_cf_datetime is slowed down because it always passes floats to pd.to_timedelta, while pd.to_timedelta is much faster when working on integers.

Here is a notebook that shows the differences. Working with integers is approx. one order of magnitude faster.

Hence, it would be great to automatically do the conversion from raw time value floats to integers in nanoseconds where possible (likely limited to resolutions bellow days or hours to avoid coping with different durations numbers of nanoseconds within e.g. different months).

As alternative, maybe avoid forcing the cast to floats and indicate in the docstring that the raw values should be integers to speed up the conversion.

This could possibly also be resolved in pd.to_timedelta but I assume it will be more complicated to deal with all the edge cases there.

The text was updated successfully, but these errors were encountered:

fmaussion · 2017-05-05T14:42:19Z

Hi Christian!

As alternative, maybe avoid forcing the cast to floats and indicate in the docstring that the raw values should be integers to speed up the conversion.

This sounds much less error prone to me. In particular, I am getting a bit nervous when I hear "nanoseconds" ;-) (see #789)

shoyer · 2017-05-05T16:23:17Z

Good catch! We should definitely speed this up.

Hence, it would be great to automatically do the conversion from raw time value floats to integers in nanoseconds where possible (likely limited to resolutions bellow days or hours to avoid coping with different durations numbers of nanoseconds within e.g. different months).

Yes, very much agreed.

For units such as months or years, we already are giving the wrong result when we use pandas:

In [18]: pd.to_timedelta(1, unit='M')
Out[18]: Timedelta('30 days 10:29:06')

In these cases, we should fall back to using netCDF4/netcdftime instead. We may also need to add more checks for numeric overflow.

As alternative, maybe avoid forcing the cast to floats and indicate in the docstring that the raw values should be integers to speed up the conversion.

Yes, this might also work. I no longer recall why we cast all inputs to floats (maybe just for consistency), but I suspect that that one of our time conversion libraries (probably netCDF4/netcdftime) expects a float array. Certainly we will still need to support floating point times saved in netCDF files, which are pretty common in my experience.

cchwala · 2017-05-08T09:32:58Z

Hmm... The "nanosecond"-issue seems to need a fix very much at the foundation. As long as pandas and xarray rely on datetime64[ns] you cannot avoid nanoseconds, right? pd.to_datetime() forces the conversion to nanoscends even if you pass integers but for a time unit different to ns. This does not make me as nervous as Fabien since my data is always quite recent, but I see that this is far from ideal for a tool for climate scientists.

An intermediate fix (@shoyer, do you actually want one?) that I could think of for the performance issue right now would be to do the conversion to datetime64[ns] depending on the time unit, e.g.

multiply raw values (most likely floats) with number of nanoseconds in time unit for units smaller then days (or hours?) and use these values as integers in pd.to_datetime()
else, fall back to using netCDF4/netcdftime for months and years (as suggested by shoyer) casting the raw values to floats

The only thing that bothers me is that I am not sure if the "number of nanoseconds" is always the same in every day or hour in the view of datetime64, due to leap seconds or other particularities.

@shoyer: Does this sound reasonable or did I forget to take into account any side effects?

fmaussion · 2017-05-08T10:26:26Z

xarray rely on datetime64[ns] you cannot avoid nanoseconds, right?

yes, you can ignore my comment!

shoyer · 2017-05-08T16:24:50Z

This does not make me as nervous as Fabien since my data is always quite recent, but I see that this is far from ideal for a tool for climate scientists.

@spencerkclark has been working on patch to natively support other datetime precisions in xarray (see #1252).

The only thing that bothers me is that I am not sure if the "number of nanoseconds" is always the same in every day or hour in the view of datetime64, due to leap seconds or other particularities.

For better or worse, NumPy's datetime64 ignores leap seconds.

Does this sound reasonable or did I forget to take into account any side effects?

This sounds pretty reasonable to me. The main challenge here will be guarding against integer overflow -- you might need to do the math twice, once with floats (to check for overflow) and then with integers.

You could also experiment with doing the conversion with NumPy instead of pandas, using .astype('timedelta64[{}]'.format(units)).

cchwala · 2017-05-09T06:26:36Z

Okay. I will try to come up with a PR within the next days.

shoyer added the topic-performance label May 5, 2017

cchwala mentioned this issue May 18, 2017

Speed up decode_cf_datetime #1414

Merged

4 tasks

jhamman closed this as completed in #1414 Jul 25, 2017

spencerkclark mentioned this issue May 2, 2023

Don't cast NaN to integer #7098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed #1399

`decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed #1399

cchwala commented May 5, 2017

fmaussion commented May 5, 2017

shoyer commented May 5, 2017

cchwala commented May 8, 2017

fmaussion commented May 8, 2017

shoyer commented May 8, 2017

cchwala commented May 9, 2017

decode_cf_datetime() slow because pd.to_timedelta() is slow if floats are passed #1399

decode_cf_datetime() slow because pd.to_timedelta() is slow if floats are passed #1399

Comments

cchwala commented May 5, 2017

fmaussion commented May 5, 2017

shoyer commented May 5, 2017

cchwala commented May 8, 2017

fmaussion commented May 8, 2017

shoyer commented May 8, 2017

cchwala commented May 9, 2017

`decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed #1399

`decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed #1399