Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences on datetime values appears after writing reindexed variable on netCDF file #1064

Closed
Scheibs opened this issue Oct 27, 2016 · 12 comments · Fixed by #7827 or #8201
Closed

Differences on datetime values appears after writing reindexed variable on netCDF file #1064

Scheibs opened this issue Oct 27, 2016 · 12 comments · Fixed by #7827 or #8201

Comments

@Scheibs
Copy link

Scheibs commented Oct 27, 2016

In my Dataset i've got a time serie coordinate who begins like this

<xarray.DataArray 'time' (time: 10)>
array(['2014-02-15T00:00:00.000000000+0100',
       '2014-02-15T18:10:00.000000000+0100',
       '2014-02-16T18:10:00.000000000+0100',
       '2014-02-17T18:10:00.000000000+0100',
       '2014-02-18T18:10:00.000000000+0100',
       '2014-02-19T18:10:00.000000000+0100',
       '2014-02-20T18:10:00.000000000+0100',
       '2014-02-21T18:10:00.000000000+0100',
       '2014-02-22T00:00:00.000000000+0100',
       '2014-02-23T00:00:00.000000000+0100'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2014-02-14T23:00:00 2014-02-15T17:10:00 ...

And all is ok when I write and re-open the netdcdf file

Then i try to add to this dataset a reindexed variable like this

da["MeanRainfallHeigh"] = rain.reindex(time =da.time).fillna(0)

Everything is still good for the writing, but when I reopen the netcdf file, the time values are modified for the minutes part.

<xarray.DataArray 'time' (time: 10)>
array(['2014-02-15T00:00:00.000000000+0100',
       '2014-02-15T18:00:00.000000000+0100',
       '2014-02-16T18:00:00.000000000+0100',
       '2014-02-17T18:00:00.000000000+0100',
       '2014-02-18T18:00:00.000000000+0100',
       '2014-02-19T18:00:00.000000000+0100',
       '2014-02-20T18:00:00.000000000+0100',
       '2014-02-21T18:00:00.000000000+0100',
       '2014-02-22T00:00:00.000000000+0100',
       '2014-02-23T00:00:00.000000000+0100'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2014-02-14T23:00:00 2014-02-15T17:00:00 ...

Thanks!

@jhamman
Copy link
Member

jhamman commented Oct 28, 2016

@Scheibs - thanks for the report. Can you provide a simple, minimum working example (MWE)?

@Scheibs
Copy link
Author

Scheibs commented Nov 16, 2016

This is the warning I got when I wrote on my file with "to_necdf()"
xarray\conventions.py:1060: RuntimeWarning: saving variable time with floating point data as an integer dtype without any _FillValue to use for NaNs
for k, v in iteritems(variables))

@jhamman It's seems than the error appears only with a variable "rain" who commes from a previous created netcdf file, but I will try to provide you an example. tanks

@Scheibs
Copy link
Author

Scheibs commented Nov 16, 2016

@jhamman Here my example file
ftp://ftp.irsn.fr/argon/Example

@NotSqrt
Copy link
Contributor

NotSqrt commented Jan 17, 2018

I faced this issue when switching from a concat to a merge.

The first merged dataset had a time dimension which encoding says {'calendar': 'proleptic_gregorian', 'dtype': dtype('int64'), 'units': 'minutes since 2017-08-20 00:00:00'}, which meant that the data from the second merged dataset could not be stored with a finer resolution than minutes.

If I try to store values like '2017-08-20 00:00:30', I get the warning xarray\conventions.py:1092: RuntimeWarning: saving variable time with floating point data as an integer dtype without any _FillValue to use for NaNs.

Maybe it is similar in your case: netcdf stored the data as 'hours since XXXX', so you lose the minutes.

@shoyer
Copy link
Member

shoyer commented Jan 17, 2018

@NotSqrt can you make a minimum working example for this? e.g., a netCDF file with problematic data, and associated code that writes a netCDF file with lost time resolution. That would really help us diagnose and solve this problem.

@NotSqrt
Copy link
Contributor

NotSqrt commented Jan 18, 2018

There you go !

import numpy
import pandas
import tempfile
import warnings
import xarray


array1 = xarray.DataArray(
    numpy.random.rand(5),
    dims=['time'],
    coords={'time': pandas.to_datetime(['2018-01-01', '2018-01-01 00:01', '2018-01-01 00:02', '2018-01-01 00:03', '2018-01-01 00:04'])},
    name='foo'
)

array2 = xarray.DataArray(
    numpy.random.rand(5),
    dims=['time'],
    coords={'time': pandas.to_datetime(['2018-01-01 00:05', '2018-01-01 00:05:10', '2018-01-01 00:05:20', '2018-01-01 00:05:30', '2018-01-01 00:05:40'])},
    name='foo'
)

with tempfile.NamedTemporaryFile() as tmp:
    # save first array
    array1.to_netcdf(tmp.name)
    # reload it
    array1_reloaded = xarray.open_dataarray(tmp.name)

    # the time encoding stores minutes as int, so seconds won't be allowed at next call of to_netcdf
    assert array1_reloaded.time.encoding['dtype'] == numpy.int64
    assert array1_reloaded.time.encoding['units'] == 'minutes since 2018-01-01 00:00:00'

    merged = xarray.merge([array1_reloaded, array2])
    array1_reloaded.close()

    with warnings.catch_warnings():
        warnings.filterwarnings('error', category=RuntimeWarning)
        merged.to_netcdf(tmp.name)

@NotSqrt
Copy link
Contributor

NotSqrt commented Jan 23, 2018

FYI, merged.time.encoding = {} before calling to_netcdf seems to avoid the RuntimeWarning.

@kmuehlbauer
Copy link
Contributor

@NotSqrt If you are still in the works with this, I'd appreciate if you could test this against #7827.

This adds another warning with a some more detail what's going on. The issue remains that the wanted encoding in minutes does not work with the actual data, hence the second warning. But maybe we can find a way to also check dtypes.

@NotSqrt
Copy link
Contributor

NotSqrt commented Sep 18, 2023

I've run the example I gave above.

import numpy
import pandas
import tempfile
import warnings
import xarray


array1 = xarray.DataArray(
    numpy.random.rand(5),
    dims=['time'],
    coords={'time': pandas.to_datetime(['2018-01-01', '2018-01-01 00:01', '2018-01-01 00:02', '2018-01-01 00:03', '2018-01-01 00:04'], format='ISO8601')},
    name='foo'
)

array2 = xarray.DataArray(
    numpy.random.rand(5),
    dims=['time'],
    coords={'time': pandas.to_datetime(['2018-01-01 00:05', '2018-01-01 00:05:10', '2018-01-01 00:05:20', '2018-01-01 00:05:30', '2018-01-01 00:05:40'], format='ISO8601')},
    name='foo'
)

with tempfile.NamedTemporaryFile() as tmp:
    # save first array
    array1.to_netcdf(tmp.name)
    # reload it
    array1_reloaded = xarray.open_dataarray(tmp.name)

    # the time encoding stores minutes as int, so seconds won't be allowed at next call of to_netcdf
    assert array1_reloaded.time.encoding['dtype'] == numpy.int64
    assert array1_reloaded.time.encoding['units'] == 'minutes since 2018-01-01 00:00:00'

    merged = xarray.merge([array1_reloaded, array2])
    array1_reloaded.close()

    # this line avoids losing precision and removes both warnings
    #merged.time.encoding = {}
    
    # this line removes the conversion to ints, which solves the resolution loss and removes the second warning
    #merged.time.encoding.pop('dtype')

    merged.to_netcdf(tmp.name)
    merged_reloaded = xarray.open_dataarray(tmp.name)
    numpy.testing.assert_array_equal(
        numpy.concatenate([array1.time, array2.time]), 
        merged_reloaded.time.values
    )

I see that now the warnings are:

  • UserWarning: Times can't be serialized faithfully with requested units 'minutes since 2018-01-01'. Resolution of 'seconds' needed. Serializing timeseries to floating point.
  • SerializationWarning: saving variable time with floating point data as an integer dtype without any _FillValue to use for NaNs

And as the last code statement still shows that the seconds are lost, we still have to use merged.time.encoding = {} or merged.time.encoding.pop('dtype') to be sure not to lose precision.
I guess that the serializing to floating points is overwritten by the integer dtype determined after the first save, which means the floating points were not helpful without changing the dtype encoding ..

If the resolution loss can't be fixed automatically, what would be nice in the warning is a link or a summary of what the user has to do to solve the resolution loss !

Thanks !

@kmuehlbauer kmuehlbauer reopened this Sep 18, 2023
@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Sep 18, 2023

Thanks @NotSqrt for the detailed test and reasoning.

The issue is as you already wrote with encoding. Only the encoding of the first dataset survives the process. If you switch the order of the objects your code runs successfully.

As we do not update encoding and want to get rid of it soon (see dicussion in #6323) there is not much to be done.

I very much agree, that the user should get as much information out of any warnings/errors to follow up easily.

There might be at least the 3 following actions:

  1. The first thing what could be suggested is to use .reset_encoding on the merged dataset. As this might have unwanted side-effects on other variables, it might only be applied where necessary (eg. time variable).
  2. Automatically change encoding dtype from to float64 in those cases.
  3. a. Special case times/timedeltas in NonStringCoder to prevent conversion to int
    b. Remove dtype in CFDatetimeCoder / CFTimedeltaCoder()

From my perspective the less intrusive action would be 3b. For your example this would just print the first warning (which provides the needed information) and the seconds will be preserved.

@kmuehlbauer
Copy link
Contributor

@NotSqrt #8201 is not yet fully ready but you might check it already. Thanks!

@kmuehlbauer
Copy link
Contributor

#8201 will take care of this issue as follows:

It issues that warning:

* `UserWarning: Times can't be serialized faithfully with requested units 'minutes since 2018-01-01'. Resolution of 'seconds' needed. Serializing timeseries to floating point.`

If the resolution loss can't be fixed automatically, what would be nice in the warning is a link or a summary of what the user has to do to solve the resolution loss !

And it automatically drops dtype from encoding if we need to encode to float64. That will prevent recast to int64 and with that precision loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants