Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read ncml files to create multifile datasets #2697

Closed
rabernat opened this issue Jan 22, 2019 · 18 comments
Closed

read ncml files to create multifile datasets #2697

rabernat opened this issue Jan 22, 2019 · 18 comments

Comments

@rabernat
Copy link
Contributor

This issue was motivated by a recent conversation with @jdha regarding how they are preparing inputs for regional ocean models. They are currently using ncml with netcdf-java to consolidate and homogenize diverse data sources. But this approach doesn't play well with the xarray / dask stack.

ncml is standard developed by Unidata for use with their netCDF-java library:

NcML is an XML representation of netCDF metadata, (approximately) the header information one gets from a netCDF file with the "ncdump -h" command.

In addition to describing individual netCDF files, ncml can be used to annotate modifications to netCDF metadata (attributes, dimension names, etc.) and also to aggregate multiple files into a single logical dataset. This is what such an aggregation over an existing dimension looks like in ncml:

<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <aggregation dimName="time" type="joinExisting">
    <netcdf location="jan.nc" />
    <netcdf location="feb.nc" />
  </aggregation>
</netcdf>

Obviously this maps very well to xarray's concat operation. Similar aggregations can be defined that map to merge operations.

I think it would be great if we could support the ncml spec in xarray, allowing us to write code like

ds = xr.open_ncml('file.ncml')

This idea has been discussed before in #893. Perhaps it's time has finally come.

@shoyer
Copy link
Member

shoyer commented Jan 22, 2019

+1 for adding this to xarray. to_ncml would also be nice to have.

@andersy005
Copy link
Member

Any updates regarding this?

A while ago @rabernat mentioned that @dopplershift was potentially interested in working on implementing this feature in xarray in pangeo-data/esgf2xarray#1 (comment)

I am interested in helping out with getting this feature in xarray. I tried finding Python tools that provide NcML functionality and the ones I found namely:

seem to be outdated and unmaintained.

In the meantime, I've been experimenting with some basics of NcML: https://nbviewer.jupyter.org/github/NCAR/xncml/blob/master/docs/source/tutorial.ipynb

With guidance, input and feedback on what the API is expected to look like in xarray, I'd be more than happy to work on this moving forward

@dopplershift
Copy link
Contributor

I haven't had any time to start on this (and I'm a few more weeks out), so feel free to take a cut. I'm not sure what @shoyer or @rabernat have in mind for API.

@shoyer
Copy link
Member

shoyer commented Apr 19, 2019

I have not thought much about APIs yet.

@huard
Copy link
Contributor

huard commented Sep 3, 2020

I'd like to revive this issue.
We're increasingly using NcML aggregations within our THREDDS server to create "logical" datasets. This allows us to fix some non-CF-conforming metadata fields without changing files on disk (which would break syncing with ESGF nodes). More importantly, by aggregating multiple time periods, variables and realizations, we're able to create catalog entries for simulations instead of files, which we expect will greatly facilitate parsing catalog search results. We'd like to offer the same aggregation functionality outside of the THREDDS server.
Ideally, this would be supported right from the netcdf-c library (see Unidata/netcdf-c#1478), but an xarray NcML backend is the second best option. I also imagine that NcML files could be use as a clean mechanism to create Zarr/NCZarr objects ie:
*.nc -> open_ncml -> xr.Dataset -> to_zarr -> Zarr store

@andersy005 In terms of API, I think the need is not so much to create or modify NcML files, but rather to return an xarray.Dataset from an NcML description. My understanding is that open_ncml would be a wrapper around open_mfdataset. My hope is that NcML-based xarray.Dataset objects would behave similarly whether they are created from files on disk through xarray.open_ncml('sim.ncml') or xarray.open_dataset('https://.../thredds/sim.ncml').

The THREDDS repo contains a number of unit tests that could be emulated to steer the Python implementation. My understanding is that getting this done could involve a fair amount of work, so I'd like to see who's interested in collaborating on this and maybe schedule a meeting to plan work for this year or the next.

@rabernat
Copy link
Contributor Author

rabernat commented Sep 3, 2020

Thanks for reviving this @huard!

FWIW, I think it's best for this sort of utility to live in its own small standalone package, which I have referred to as "xarray-mergetool" in the past. NCML could be one special case of the things it could it. It would also be very useful for intake-esm.

We have also discussed this in NCAR/esm-collection-spec#12

We should have some bandwidth to work on this over the next year via the pangeo-forge project.

@jdha
Copy link

jdha commented Sep 3, 2020 via email

@rsignell-usgs
Copy link

rsignell-usgs commented May 5, 2021

It's worth pointing out that you can create FileReferenceSystem JSON to accomplish many of the tasks we used to use NcML for:

  • create a single virtual dataset that points to a collection of files
  • modify dataset and variable attributes

It also has the nice feature that it makes your dataset faster to work with on the cloud because the map to the data is loaded in one shot!

@huard
Copy link
Contributor

huard commented Jul 6, 2022

I've got a first draft that parses an NcML document and spits out an xarray.Dataset. It does not cover all the NcML syntax, but the essential elements are there.

It uses xsdata to parse the XML, using a datamodel automatically generated from the NcML 2-2 schema. I've scrapped test files from the netcdf-java repo to create a test suite.

Wondering what's the best place to host the code, tests and test data so others may give it a spin ?

@shoyer
Copy link
Member

shoyer commented Jul 6, 2022 via email

@huard
Copy link
Contributor

huard commented Jul 6, 2022

Ok, another option would be to add that to xncml

@andersy005 What do you think ?

@andersy005
Copy link
Member

Ok, another option would be to add that to xncml

@andersy005 What do you think ?

@huard, I haven't touched the codebase in that repo for three years 😃... So, I'm happy to transfer the xncml repo to xarray-contrib org and give you and anyone who wants access to it

@huard
Copy link
Contributor

huard commented Jul 7, 2022

@andersy005 Sounds good !

@vietnguyengit
Copy link

Hi everyone, I've hit a problem where I need to read ncml to xarray, which brought me here... Just wondering if there are any updates regarding this?

p/s xncml is broken at the moment.

Thank you.

@keewis
Copy link
Collaborator

keewis commented Nov 24, 2022

I'd assume that xncml has never been released (there's an issue suggesting the release of version 0.1), so obviously there's no package on PyPI. You can try installing from github:

pip install git+https://github.com/xarray-contrib/xncml.git

to see if that gives you something to work with, otherwise I'd wait for any of the devs to get back to you (most likely in the issue you opened on the xncml repo)

@vietnguyengit
Copy link

Thanks @keewis that's right, looks like they are still working on the docs, it was confusing.

@huard
Copy link
Contributor

huard commented Nov 24, 2022

That's right. I just did a quick 0.1 release of xncml, most likely rough around the edges. Give it a spin. PRs most welcome.

@rabernat If you're happy with it, this issue can probably be closed.

@keewis
Copy link
Collaborator

keewis commented May 29, 2023

closing, since anything still missing should be feature requests for xncml

@keewis keewis closed this as completed May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants