Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

Open
TomNicholas opened this issue Jun 24, 2024 · 5 comments
Open
Labels
CF conventions encoding xarray Requires changes to xarray upstream

Comments

@TomNicholas
Copy link
Member

We might be able to replace some logic (especially that implemented in #156) with a call to xarray.decode_cf. This function can accept a Dataset and return a new dataset with different attributes and encoding etc. I think it's used internally in some of xarray's backends.

Also there is an idea to expose a corresponding xarray.encode_cf function in xarray, which we might also be able to use (see pydata/xarray#4412).

@TomNicholas TomNicholas added xarray Requires changes to xarray upstream encoding labels Jun 24, 2024
@TomNicholas
Copy link
Member Author

I've been looking at this more, in the context of fixing #189.

I think we could probably almost use xarray.decode_cf out of the box, with a few subtleties:

Aligning what we're doing in virtualizarr with this would be another reason to implement a virtualizarr xarray backend engine, see #35.

cc @ayushnag

@ayushnag
Copy link
Contributor

I think the idea of an virtualizarr backend is appealing. One idea for how it can be implemented is by loading the actual data file and then also creating and storing the byte references in the virtual dataset. This way the dataset structure creation and data loading is handled by xarray, then the bytes are just an add on. This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.

As a side note, this might make the reference creation process (#87) simpler as well. Instead of searching for attrs, encoding, dimension names, etc the "chunk reader" only needs to create a low level chunk manifest (bytes, offset, path). The rest of the information is retrieved from the netcdf by xarray. Not sure if that is an actually time consuming part of reference creation however.

It would also add the ability to easily inline data (#62)

Basically the idea is that xr.open_dataset("data.nc", engine="virtualizarr") loads the netcdf file normally but then also reads byte ranges and creates ManifestArrays. Since I don't think it's possible to have two data arrays within one variable, perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables

@TomNicholas
Copy link
Member Author

Thanks @ayushnag . I think your suggestion is almost the opposite of what I'm suggesting doing in this issue, so I moved it to #221 to discuss there.

(Opposite in that I'm suggesting calling just part of xarray's open_dataset decoding logic from within open_virtual_dataset, whereas IIUC you are suggesting basically using all of xr.open_dataset and making it virtual as a final step.)

@ayushnag
Copy link
Contributor

I think your suggestion makes sense and yes my suggestion was more related to the backend issue.

My only concern was that if the solution is to add this to open_virtual_dataset it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes. That's why I was suggesting opening the file and then adding references.

Although if decode_cf is the main portion left then the reimplementations are limited and your approach definitely makes more sense. Is the plan to add a modified copy of this function to virtualizarr?

@TomNicholas
Copy link
Member Author

it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes.

I completely agree, which is why I've been looking into which parts of xarray we need. I think it's just decode_cf though (and actually only very limited parts of that).

Is the plan to add a modified copy of this function to virtualizarr?

Either that or make some small modifications to xarray upstream so that we can import and use xr.decode_cf directly. See also pydata/xarray#4490, which talks about making xarray's interface for decoding clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CF conventions encoding xarray Requires changes to xarray upstream
Projects
None yet
Development

No branches or pull requests

2 participants