Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writing sparse to netCDF #4156

Open
dcherian opened this issue Jun 15, 2020 · 7 comments
Open

writing sparse to netCDF #4156

dcherian opened this issue Jun 15, 2020 · 7 comments
Labels
topic-arrays related to flexible array support

Comments

@dcherian
Copy link
Contributor

I haven't looked at this too closely but it appears that this is a way to save MultiIndexed datasets to netCDF. So we may be able to do sparse -> multiindex -> netCDF

http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#compression-by-gathering

cc @fujiisoup

@fujiisoup
Copy link
Member

@dcherian
Though I have no experience with this gather compression, it looks that python-netcdf4 does not have this function impremented.

One thing we can do is
sparse -> multiindex -> reset_index > netCDF
or maybe we can even add a function to skip constructing a multiindex but just make flattened index arrays from a sparse array.

@dcherian
Copy link
Contributor Author

Yes I think we will have to "encode" to something like this example

dimensions:
  lat=73;
  lon=96;
  landpoint=2381;
  depth=4;
variables:
  int landpoint(landpoint);
    landpoint:compress="lat lon";
  float landsoilt(depth,landpoint);
    landsoilt:long_name="soil temperature";
    landsoilt:units="K";
  float depth(depth);
  float lat(lat);
  float lon(lon);
data:
  landpoint=363, 364, 365, ...;

and then write that "encoded" dataset to file.

@fujiisoup
Copy link
Member

Do we already have something similar encoding (and decoding) scheme to write (and read) data?
(does CFTime use a similar scheme?)
I think we don't have a scheme to save multiindex yet but need to manually convert by reset_index.
#1077

Maybe we can decide this encoding-decoding API before #1603.

@dschwoerer
Copy link
Contributor

I have hacked something that does support the reading and writing of sparse arrays to a netcdf file, however I didn't know how and where to put this within xarray.

def ds_to_netcdf(ds, fn):
    dsorg = ds
    ds = dsorg.copy()
    for v in ds:
        if hasattr(ds[v].data, "nnz") and (
            hasattr(ds[v].data, "to_coo") or hasattr(ds[v].data, "linear_loc")
        ):
            coord = f"_{v}_xarray_index_"
            assert coord not in ds
            data = ds[v].data
            if hasattr(data, "to_coo"):
                data = data.to_coo()
            ds[coord] = coord, data.linear_loc()
            dims = ds[v].dims
            ds[coord].attrs["compress"] = " ".join(dims)
            at = ds[v].attrs
            ds[v] = coord, data.data
            ds[v].attrs = at
            ds[v].attrs["_fill_value"] = str(data.fill_value)
            for d in dims:
                if d not in ds:
                    ds[f"_len_{d}"] = len(dsorg[d])

    print(ds)
    ds.to_netcdf(fn)
def xr_open_dataset(fn):
    ds = xr.open_dataset(fn)

    def fromflat(shape, i):
        index = []
        for fac in shape[::-1]:
            index.append(i % fac)
            i //= fac
        return tuple(index[::-1])

    for c in ds.coords:
        if "compress" in ds[c].attrs:
            vs = c.split("_")
            if len(vs) < 5:
                continue
            if vs[-1] != "" or vs[-2] != "index" or vs[-3] != "xarray":
                continue
            v = "_".join(vs[1:-3])
            at = ds[v].attrs
            dat = ds[v].data
            fill = ds[v].attrs.pop("_fill_value", None)
            if fill:
                knownfails = {"nan": np.nan, "False": False, "True": True}
                if fill in knownfails:
                    fill = knownfails[fill]
                else:
                    fill = np.fromstring(fill, dtype=dat.dtype)
            dims = ds[c].attrs["compress"].split()
            shape = []
            for d in dims:
                try:
                    shape.append(len(ds[d]))
                except KeyError:
                    shape.append(int(ds[f"_len_{d}"].data))
                    ds = ds.drop_vars(f"_len_{d}")

            locs = fromflat(shape, ds[c].data)
            data = sparse.COO(locs, ds[v].data, shape, fill_value=fill)
            ds[v] = dims, data, ds[v].attrs, ds[v].encoding
    print(ds)
    return ds

Has there been any progress since last year?

@dcherian
Copy link
Contributor Author

There is a more standards-compliant version here:#1077 (comment)

This is still blocked on choosing which CF representation to use for sparse vs which one to use for MultiIndex.

@dcherian
Copy link
Contributor Author

dcherian commented Jan 8, 2024

xref this comment thread: #3213 (comment)

@renecotyfanboy
Copy link

Coming from #8599

To answer @dcherian

Which sparse format are you using?

I am mostly using C00 or CSR/CSC format, but mostly COO

Do any of the CF-compliant representations work better for your use case?

I did a custom workaround by simply saving coords and data in a (ndim x 1 x npoints) array (COO standard). It seems to look like the first point you mentioned in the CF convention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support
Projects
None yet
Development

No branches or pull requests

4 participants