Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't cast NaN to integer #7098

Closed
wants to merge 1 commit into from
Closed

Don't cast NaN to integer #7098

wants to merge 1 commit into from

Conversation

andreas-schwab
Copy link

Do not try to convert NaN to integer type, as the operation is undefined
and results in random values. This fixes all testsuite failures.

@TomNicholas
Copy link
Member

Hi @andreas-schwab, and welcome to xarray!

Can you tell what specific failures this change fixed for you? If there was something failing we want to capture it in our test suite, but I am not sure what failure you are referring to.

@andreas-schwab
Copy link
Author

andreas-schwab commented Sep 29, 2022 via email

@TomNicholas
Copy link
Member

It's already perfectly covered by the testsuite.

Okay great, but then why are these tests failing for you (locally?) and not in our CI runs? (Our main branch just passed all automated tests just now.) Do you have a different version of some package that we aren't testing against?

I'm asking because if our current CI test runs don't reproduce this error, then we (a) have no way to check that your change fixed the error and (b) will not know if some regression causes this error to resurface in future. Standard practice would be to raise an issue that demonstrates the problem, then link the fix PR to that issue.

@andreas-schwab
Copy link
Author

andreas-schwab commented Sep 29, 2022 via email

@DocOtak
Copy link
Contributor

DocOtak commented Sep 29, 2022

@TomNicholas Something different will need to happen with that cast eventually. See #6191 for something that is failing on some systems that users have but is currently unable to be captured in the tests. Numpy has already added runtime warnings about doing this, and is "thinking about" making nan to int casts raise numpy/numpy#14412. Xarray's own @shoyer has hit issues like this before as well numpy/numpy#6109.

@TomNicholas
Copy link
Member

Thank you very much for the context @DocOtak !

@dcherian
Copy link
Contributor

I think the real solution here is to explicitly handle NaNs during the decoding step. We do want these to be NaT in the output.

cc @spencerclark

Do not try to convert NaN to integer type, as the operation is undefined
and results in random values.  This fixes all testsuite failures.
@kxxt
Copy link

kxxt commented Apr 16, 2023

Hi, is there any reason that keeps this pr from being merged? This PR is a fix for some failing tests on riscv64.

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreas-schwab -- sorry that we let this sit for so long. This is indeed an important issue to address. The solution is good. I just think we might be able to preserve the optimization in the case that NaNs are not present in the input data, which might be nice.

It looks like we've been getting the warnings that @DocOtak mentioned when running the test_roundtrip_numpy_datetime_data tests:

RuntimeWarning: invalid value encountered in cast
  flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(

Comment on lines -240 to -244
# Cast input ordinals to integers of nanoseconds because pd.to_timedelta
# works much faster when dealing with integers (GH 1399).
flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(
np.int64
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe preserve this optimization (#1399) in the case that NaNs are not present in flat_num_dates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any ASV tests for this so we can see if pd.to_timedelta is still slow with non ints?

Copy link
Member

@spencerkclark spencerkclark May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the issue that motivated that PR is now quite old. It would be good to revisit this. If we use the test cases in the notebook linked to in #1399, we do see that pd.to_timedelta has improved a lot in this regard, though still trails the current approach by about a factor of four:

In [1]: import numpy as np; import pandas as pd

In [2]: t_minutes = np.arange(1.0,100000.0, 0.13, dtype=np.float64)

In [3]: %%timeit
   ...: pd.to_timedelta(t_minutes, unit='m')
   ...:
   ...:
10 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %%timeit
   ...: pd.to_timedelta((t_minutes * 60 * 1e9).astype(np.int64), unit='ns')
   ...:
   ...:
2.26 ms ± 79.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: pd.__version__
Out[5]: '2.0.0'

In [6]: np.__version__
Out[6]: '1.24.3'

Previous timings were 5.2 seconds and 10.6 milliseconds, respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why there should be any floating point time values at all when decoding. Are there backends which save times as floating point?

Currently if times are in numpy datetime64 representation and they have NaT, that NaT has a certain int64 representation. We would only need to skip CFMaskCoder and do the masking in CFDatetimeCoder at least for those cases. Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why there should be any floating point time values at all when decoding. Are there backends which save times as floating point?

This is generally up to how xarray, the user, or whoever created the data we are reading in, configured the encoding. Some files indeed do have times encoded as floats with a fill value of NaN. Some of those files were created by xarray (maybe we have some control over this, but not over past decisions); some may have been created with other tools (unfortunately we have no control over this).

I think one could argue that xarray should not create these files automatically (as you've noted it currently does if NaT is present in the data, which I agree should be fixed), but I'm not sure how we would control against the case that someone explicitly set the units encoding of the times in such a way that would require floating point values (raising feels a little extreme).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spencerkclark I very much agree here, especially the point of raising error. A warning would be sufficient.

My real concern here is, and this should really be fixed, that on decoding any time array (even if we get it as int64) with an associated _FillValue will be transformed to floating point in CFMaskCoder. I think I'll better to open a new issue for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean. I agree it would be good to open a new issue for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my oceanographic experience, the units of "days since X" are very common with floating point values to indicate the time. Argo comes immediately to mind. Fill values are encountered when the time data is not part of a dimension/coordinate variable, this is also considered valid, especially in observational data.

@kmuehlbauer
Copy link
Contributor

@andreas-schwab Thanks again for this PR. It turned out it was a bit more involved to make this work and we hope #7827 solves the underlying issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants