Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opening zarr dataset doesn't return as expected #521

Closed
artttt opened this issue May 6, 2022 · 19 comments · Fixed by #522
Closed

opening zarr dataset doesn't return as expected #521

artttt opened this issue May 6, 2022 · 19 comments · Fixed by #522
Labels
bug Something isn't working

Comments

@artttt
Copy link

artttt commented May 6, 2022

Problem description

I've been experimenting with the gdal zarr driver and rioxarray.open_rasterio doesn't return as expected.

Rather then getting the expected DataArray the output is a list of 2 xarray.Datasets that represent the x and y dimensions.

Maybe im missing something but i cant see a way to get the actual data.

However if I remove this section of code it works fine.

rioxarray/rioxarray/_io.py

Lines 854 to 868 in dc8efe3

if riods.subdatasets:
return _load_subdatasets(
riods=riods,
group=group,
variable=variable,
parse_coordinates=parse_coordinates,
chunks=chunks,
cache=cache,
lock=lock,
masked=masked,
mask_and_scale=mask_and_scale,
decode_times=decode_times,
decode_timedelta=decode_timedelta,
**open_kwargs,
)

This should reproduce the issue.

from affine import Affine
import numpy as np
import rioxarray
import rasterio

o = np.ones((4501, 3039),np.float32)

w_settings = {'driver': 'Zarr',
              'dtype': 'float32',
              'nodata': -9999,
              'width': 3039,
              'height': 4501,
              'count': 1,
              'crs': 3857,
              'transform': Affine(25.0, 0.0, 16556975.0,0.0, -25.0, -4179000.0),
              'blockxsize': 5760,
              'blockysize': 5760,
              'tiled': True}
zarr_settings = dict(
                COMPRESS='BLOSC',
                BLOCKSIZE='5760,5760',
                BLOSC_CNAME="lz4",
                BLOSC_CLEVEL=5,
                BLOSC_SHUFFLE="BYTE",
                #BLOSC_BLOCKSIZE=0,
                BLOSC_NUM_THREADS="ALL_CPUS",
                ARRAY_NAME='data',
                #FORMAT='ZARR_V3',
                )

with rasterio.open(
    'test/test',
    'w',
    **w_settings,
    **zarr_settings
) as dst:
    dst.write(o, 1)
    
da = rioxarray.open_rasterio("test/test")

Before you ask, yes i know there are other ways of working with zarrs in xarray but doing it through rioxarray has some advantages such as being able to wrap it in a warpedvrt...which incidentally works ( i guess because gdal ignores the subdatasets when it wraps the zarr in a vrt)

import rasterio
import rioxarray
from rasterio.vrt import WarpedVRT
with rasterio.open("test/test") as src:
    with WarpedVRT(src) as vrt:
        with rasterio.io.MemoryFile() as memfile:
            rasterio.shutil.copy(vrt, memfile.name, driver="VRT")
            new_Vrt = memfile.read().decode("utf-8")

ds_vrt_out = rioxarray.open_rasterio(new_Vrt)

Environment Information

rioxarray (0.11.1) deps:
rasterio: 1.2.10
xarray: 2022.3.0
GDAL: 3.4.1
GEOS: None
PROJ: None
PROJ DATA: None
GDAL DATA: None

Other python deps:
scipy: 1.8.0
pyproj: 3.3.1

System:
python: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0]
executable: /env/bin/python
machine: Linux-5.4.181-99.354.amzn2.x86_64-x86_64-with-glibc2.29

@artttt artttt added the bug Something isn't working label May 6, 2022
@snowman2
Copy link
Member

snowman2 commented May 6, 2022

Driver: Zarr/Zarr
Files: test/test
Size is 3039, 4501
Coordinate System is:
PROJCRS["WGS 84 / Pseudo-Mercator",
    BASEGEOGCRS["WGS 84",
        ENSEMBLE["World Geodetic System 1984 ensemble",
            MEMBER["World Geodetic System 1984 (Transit)"],
            MEMBER["World Geodetic System 1984 (G730)"],
            MEMBER["World Geodetic System 1984 (G873)"],
            MEMBER["World Geodetic System 1984 (G1150)"],
            MEMBER["World Geodetic System 1984 (G1674)"],
            MEMBER["World Geodetic System 1984 (G1762)"],
            MEMBER["World Geodetic System 1984 (G2139)"],
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[2.0]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["Popular Visualisation Pseudo-Mercator",
        METHOD["Popular Visualisation Pseudo Mercator",
            ID["EPSG",1024]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["False easting",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["easting (X)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["northing (Y)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Web mapping and visualisation."],
        AREA["World between 85.06°S and 85.06°N."],
        BBOX[-85.06,-180,85.06,180]],
    ID["EPSG",3857]]
Data axis to CRS axis mapping: 2,1
Origin = (16556975.000000000000000,-4179000.000000000000000)
Pixel Size = (25.000000000000000,-25.000000000000000)
Subdatasets:
  SUBDATASET_1_NAME=ZARR:"test/test":/X
  SUBDATASET_1_DESC=Array /X
  SUBDATASET_2_NAME=ZARR:"test/test":/Y
  SUBDATASET_2_DESC=Array /Y
Corner Coordinates:
Upper Left  (16556975.000,-4179000.000) ( 37d32'26.14"W, 81d28'10.77"N)
Lower Left  (16556975.000,-4291525.000) ( 38d33' 5.13"W, 81d28'10.77"N)
Upper Right (16632950.000,-4179000.000) ( 37d32'26.14"W, 81d34'13.08"N)
Lower Right (16632950.000,-4291525.000) ( 38d33' 5.13"W, 81d34'13.08"N)
Center      (16594962.500,-4235262.500) ( 38d 2'45.64"W, 81d31'12.46"N)
Band 1 Block=5760x5760 Type=Float32, ColorInterp=Undefined
  NoData Value=-9999

@artttt
Copy link
Author

artttt commented May 6, 2022

Ahh sorry.
The writing from rasterio is working fine (that was just me setting up a test dataset)

The reading in rioxarray is where im seeing an issue

da = rioxarray.open_rasterio("test/test")
print(da)
[<xarray.Dataset>
Dimensions:      (band: 1, x: 3039, y: 1)
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 1.656e+07 1.656e+07 ... 1.663e+07 1.663e+07
  * y            (y) float64 1.656e+07
    spatial_ref  int64 0
Data variables:
    X            (band, y, x) float64 ..., <xarray.Dataset>
Dimensions:      (band: 1, x: 4501, y: 1)
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 -4.179e+06 -4.179e+06 ... -4.291e+06 -4.292e+06
  * y            (y) float64 -4.179e+06
    spatial_ref  int64 0
Data variables:
    Y            (band, y, x) float64 ...]

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

That was just notes for the investigation. The subdatasets are why there are troubles:

Subdatasets:
  SUBDATASET_1_NAME=ZARR:"test/test":/X
  SUBDATASET_1_DESC=Array /X
  SUBDATASET_2_NAME=ZARR:"test/test":/Y
  SUBDATASET_2_DESC=Array /Y

It is strange that the X/Y coordinates have their own subdataset ...

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

https://gdal.org/drivers/raster/zarr.html#particularities-of-the-classic-raster-api

If the Zarr dataset contains one single array with 2 dimensions, it will be exposed as a regular GDALDataset when using the classic raster API. If the dataset contains more than one such single array, or arrays with 3 or more dimensions, the driver will list subdatasets to access each array and/or 2D slices within arrays with 3 or more dimensions.

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

The tests for the Zarr driver mainly use the multiimensional API:
https://github.com/OSGeo/gdal/blob/master/autotest/gdrivers/zarr_driver.py

This is currently not supported by rasterio unfortunately.

Related: #174

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

However if I remove this section of code it works fine.

Usually there isn't data in the main raster when subdatasets exist ... I wonder if that is the intended behavior?

@artttt
Copy link
Author

artttt commented May 6, 2022

A simple fix might be to only switch to reading subsets if they exist and there is no data in the main raster.

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

A simple fix might be to only switch to reading subsets if they exist and there is no data in the main raster.

This seems to be a driver specific issue. When opening test/test_data/input/tmmx_20190121.nc, it has data in the main raster and the subdatasets.

>>> import rasterio
>>> rds = rasterio.open("tmmx_20190121.nc")
>>> rds_sub = rasterio.open('NETCDF:"tmmx_20190121.nc":air_temperature')
>>> assert (rds.read() == rds_sub.read()).all()
>>> assert rds.crs == rds_sub.crs

Though, most of the time, there isn't data in the main raster when there are subdatasets.

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

#522 should hopefully address this specific scenario without impacting other datasets.

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

@rouault, do you had thoughts on this (since you implemented the Zarr driver in GDAL)? Is there something we're missing? Is the behavior expected?

@rouault
Copy link

rouault commented May 6, 2022

Is the behavior expected?

There's no well defined rule on whether a main dataset that has a band (that is not just a place holder for subdatasets) should expose itself in the subdatasets it exposes. This can vary between drivers. But here the driver should probably hide the X and Y variables as it recognizes them to build a geotransform, and they don't really bring any value as being exposed as subdatasets. You can open a ticket in OSGeo/gdal about that

@artttt
Copy link
Author

artttt commented May 6, 2022

That fix seems very specific

This would be problematic for test/test_data/input/tmmx_20190121.nc as well and not solved by that fix?

maybe open_rasterio needs to give the user an option to select if they want subdatasets or main datasets.

In contrast to my test setup above im using open_rasterio with zarrs that are produced by xarray.to_zarr. In those cases the subdatasets may not be called x and y and the crs wont be recognised by gdal unless the user does a little manipulation of the zarr
like this

z = zarr.open(store)
z.data.attrs['_CRS'] = {'wkt':z.spatial_ref.attrs['crs_wkt']}
#and probably rebuild consolidated metadata as well

This could be done at zarr creation time but xarray.to_zarr throws an error when writing attributes that are dictionaries as GDAL wants for the CRS.

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

This would be problematic for test/test_data/input/tmmx_20190121.nc as well

It appears the same data in the base dataset is the one in the subdataset, so no issues there.

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

Thanks @rouault 👍, I opened a ticket: OSGeo/gdal#5681

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

There's no well defined rule on whether a main dataset that has a band (that is not just a place holder for subdatasets) should expose itself in the subdatasets it exposes.

Is there a scenario in the GDAL test suite where there is a band in the main dataset with data and has subdatasets?

@rouault
Copy link

rouault commented May 6, 2022

Is there a scenario in the GDAL test suite where there is a band in the main dataset with data and has subdatasets?

For example, the NITF and GTiff drivers can do that when there are multiple images in a file. GDALOpen() on such files will return a dataset with the first image, and a list of subdatasets listing all images (_SUBDATASET_1 will be the first image)

$ gdal_translate autotest/gcore/data/byte.tif test.tif
$ gdal_translate autotest/gcore/data/byte.tif test.tif -co APPEND_SUBDATASET=YES -outsize 200% 200%
$ gdalinfo test.tif
Driver: GTiff/GeoTIFF
Files: test.tif
Size is 20, 20
[...]
Subdatasets:
  SUBDATASET_1_NAME=GTIFF_DIR:1:test.tif
  SUBDATASET_1_DESC=Page 1 (20P x 20L x 1B)
  SUBDATASET_2_NAME=GTIFF_DIR:2:test.tif
  SUBDATASET_2_DESC=Page 2 (40P x 40L x 1B)

@snowman2
Copy link
Member

snowman2 commented May 6, 2022

and a list of subdatasets listing all images (_SUBDATASET_1 will be the first image)

If I understand correctly, defaulting to only reading subdatasets in this scenario is safe as as _SUBDATASET_1 will be the same as reading the main dataset.

@snowman2
Copy link
Member

Reverting the patch as it is best fixed in GDAL: #523

@artttt
Copy link
Author

artttt commented May 12, 2022

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants