Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using xr.to_netcdf after xr.concat introduces max/min constraints based on first file #8135

Closed
4 tasks done
MRibberink opened this issue Sep 1, 2023 · 6 comments
Closed
4 tasks done

Comments

@MRibberink
Copy link

What happened?

I'm using xarray to concatenate several netcdf files in time (ERA5 mean sea level data that needs to be downloaded in separate months), but because I have a lot of them I save the concatenated files to netcdf after using them so I can later do my data analysis in a loop rather than each by hand. Doing so unfortunately seems to set a min/max on the whole timeseries based on the first file, though this doesn't exist when the data is plotted before being processed by to_netcdf.

Data can be found here:
https://www.dropbox.com/sh/xb1c2018ey83poc/AABDpQnjbWcGdtq_MwJeEEc-a?dl=0

The code below generates several diagnostic images, primary among them showing the problem is this one:
min_pres_error

Where the blue line is just the concatenated original datasets, and the orange line is that data saved by to_netcdf and then called up again. The first file ends at the end of August.

The same problem causes weird phenomena when looking at the sea level pressure fields, as shown in the other created plots. The first is how it should look (just concatted together), but the second is what it actually looks like (to_netcdf). Here's how they look on my end:
image
image

I also included a cross-section through the phenomenon off the coast of Canada to show what it's doing when it encounters values below the minimum - it brings them so the first value below is at the level of the maximum defined earlier:
image

I found a weird workaround, the problem is fixed by adding 0 to the dataset before sending it to to_netcdf, but this still shouldn't be happening (and I have NO idea why that fixes it).

What did you expect to happen?

I expected to_netcdf to accurately save the dataset the way it is, rather than setting a max/min based on the first file. In essence, to reproduce the minimum pressure line that was plotted before saving and reloading the data.

Minimal Complete Verifiable Example

import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
crs = ccrs.PlateCarree()

#Import and concat data
Lor1=xr.open_dataset('data_62_2022.nc')
Lor2=xr.open_dataset('data_63_2022.nc')

Lor3=xr.concat([Lor1,Lor2],dim='time')

#adding this line is a workaround and makes it function for some reason
#Lor3['msl']=Lor3['msl']+0.0

#Save and reimport data
Lor3.to_netcdf("test/lor_4.nc")
Lor4=xr.open_dataset('test/lor_4.nc')

#Diagnostic plots

#minimum pressure
fig,ax=plt.subplots()
Lor3.msl.min("longitude").min('latitude').plot(label='concat')
Lor4.msl.min("longitude").min('latitude').plot(label='to_netcdf')
plt.legend()
plt.title("Minimum pressure")

#mslp field concat (what it should look like)
fig1=plt.figure(dpi=150)
ax1 = plt.axes(projection=crs, frameon=True)
ax1.coastlines(resolution='50m')
Lor3.msl.isel(time=250).plot(ax=ax1)
ax1.set_title("Concat")

#mslp field to_netcdf (what it actually looks like)
fig2=plt.figure(dpi=150)
ax2 = plt.axes(projection=crs, frameon=True)
ax2.coastlines(resolution='50m')
Lor4.msl.isel(time=250).plot(ax=ax2)
ax2.set_title("to_netcdf")

#cross section mslp field
fig3,ax3=plt.subplots()
Lor3.msl.isel(time=250).sel(latitude=42).plot(label='concat')
Lor4.msl.isel(time=250).sel(latitude=42).plot(ls='--',label='to_netcdf')
ax3.axhline(Lor1.msl.max(),color='k',ls='--',label='max file 1')
ax3.axhline(Lor1.msl.min(),color='k',ls=':',label='min file 1')
plt.legend()

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:14:58) [MSC v.1929 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 11, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('English_Canada', '1252')
libhdf5: 1.14.1
libnetcdf: 4.9.2

xarray: 2023.6.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.0
netCDF4: 1.6.4
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.13.3
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.6.0
distributed: 2023.6.0
matplotlib: 3.7.0
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.4.0
cupy: None
pint: 0.22
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.0.0
pip: 23.0
conda: None
pytest: 7.4.0
mypy: None
IPython: 8.12.2
sphinx: 4.4.0

@MRibberink MRibberink added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 1, 2023
@welcome
Copy link

welcome bot commented Sep 1, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@kmuehlbauer
Copy link
Contributor

@MRibberink Thanks for the report.

Most likely issue is different cf-coding attributes on that variable (maybe packed-data, scale_factor/add_offset).

You can check this with:

print(Lor1.encoding, Lor1["msl"].encoding)
print(Lor2.encoding, Lor2["msl"].encoding)

During concatenating only the encoding-dict of the first object survives (as you already assessed), leading to a wrong/different encoding into netcdf.

If different cf-coding attributes is the issue, you have several options:

@kmuehlbauer kmuehlbauer removed the needs triage Issue that has not been reviewed by xarray team member label Sep 1, 2023
@kmuehlbauer
Copy link
Contributor

It looks like my first guess was correct, we have packed data for the msl variable. Thanks for providing the data @MRibberink.

short msl(time, latitude, longitude) ;
    msl:scale_factor = 0.0755888254772405 ;
    msl:add_offset = 100995.993455587 ;
    msl:_FillValue = -32767s ;
    msl:missing_value = -32767s ;
    msl:units = "Pa" ;
    msl:long_name = "Mean sea level pressure" ;
    msl:standard_name = "air_pressure_at_mean_sea_level" ;

@kmuehlbauer
Copy link
Contributor

To answer your question: Adding 0.0 to the variable turns it into float and the encoding-dict will be cleared. This will in turn write floating point data to netcdf, which might result in a somewhat larger file.

Please see also the discussion on future of encoding-dict here: #6323

@MRibberink
Copy link
Author

Wonderful. Thank you so much!

@kmuehlbauer
Copy link
Contributor

@MRibberink You are very welcome. So assuming you've got the answers you need, we can close this. Please do not hesitate to open follow-up issues whenever necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants