Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Cube Extension: Variables and more #713

Closed
m-mohr opened this issue Jan 9, 2020 · 33 comments
Closed

Data Cube Extension: Variables and more #713

m-mohr opened this issue Jan 9, 2020 · 33 comments
Assignees
Milestone

Comments

@m-mohr
Copy link
Collaborator

m-mohr commented Jan 9, 2020

Two things came up recently that could be integrated into the data cube extension:

  1. Add variables in addition to dimensions. Some data cubes expose variables, some don't. We don't need this for openEO (yet?), but Google Earth Engine @simonff would probably use them. I'm also looking at netCDF and other formats, which afaik support variables as addition to dimensions. Maybe there's space for alignment also with the ESM collection spec "fork" from @rabernat.

  2. For dimensions it might be useful to specify the number of cells (see openEO UDF discussions).

@m-mohr m-mohr added this to the 1.0.0 milestone Jan 9, 2020
@m-mohr m-mohr self-assigned this Jan 9, 2020
@m-mohr
Copy link
Collaborator Author

m-mohr commented Jan 9, 2020

Re 1: GEE uses the following fields for variables:

  • name
  • description
  • data type (double, string, ...)
  • sometimes (i.e. should be optional) "precise or estimated min/max" are available, sometimes "we also have values defined on our current representation of properties, but as far as I've seen we don't actually use it"
  • sometime units

@rabernat
Copy link

rabernat commented Jan 9, 2020

Thanks for opening the discussion @m-mohr. I'll link to esm-collection-spec here: https://github.com/NCAR/esm-collection-spec/

This is something we hacked together to provide a STAC-inspired catalog of our cloud-based Zarr climate model data. This is how we are currently cataloging the Google Cloud CMIP6 data. (More technical blog post here.)

In the long term, we would love to this to actually become a valid STAC catalog. I'd welcome anyone's thoughts on the best roadmap to achieve this.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Jan 9, 2020

@rabernat I think this will take us multiple steps and probably two extensions or so. First we would probably work together to align the data cube extension to be flexible enough for your use case and then add another extension for additional domain-specific and/or format-specific things. One primary question to answer is probably whether your data would better be a STAC Item or Collection.

For the data cube extension I'd need to understand your requirements and why you did what you did in the ESM spec. I guess the attributes are what would translate to what I call "variables" here, but I don't really understand that vocabulary thing in the ESM spec. Also, why is it "external"?

The other fields would probably be part of the another extension, which would probably be mostly copy & paste with some restructuring. What is this CSV file about? Is there a reason for it being CSV instead of JSON?

@cholmes
Copy link
Contributor

cholmes commented Apr 14, 2020

I'd really love to get ESM and STAC aligned, and I think the time is now as we're going to go 1.0-beta soon. I agree with the path @m-mohr lays out - first get the data cube extension to be flexible enough, and then add in an another extension for the specifics.

@rabernat - could you help Matthias understand your requirements? Perhaps we could jump on call sometime soon and try to sort it? Take a crack at a STAC+datacube+extension version of ESM?

From our discussions earlier, it seems hard to really map to an Item. But we don't have assets at the collection level, so I could see something like a collection that has one item which has the asset links, since conceptually it feels more at the level of a collection, but one with a very expansive Item.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 14, 2020

In the next weeks I'll have a look at

  • openEO
  • netCDF / ncoJSON / cfJSON
  • CoverageJSON
  • OpenDataCube
  • ESM Collection spec, if @rabernat can support

and probably some more data cube things and will try to align them. Any help is highly appreciated. I'm not so much into many of these formats.

@rabernat
Copy link

Thanks for keeping this discussion alive, and sorry for my slow responses. I'll tag a few more Pangeo folks to help move the conversation along: @jhamman, @andersy005 (who maintains ESM collection spec and Intake ESM) .

Let's use the CMIP6 Google Cloud data as a representative use case. This dataset consists of about 100,000 distinct Zarr Groups, each formatted following the NetCDF data model. Opening a single group in Xarray returns something like this. Here the main data variable is tas, surface air temperature:
xarray repr

This object, with dimensions time, lat, lon, is already a data cube. Other variables may have additional dimensions, e.g. height, or may not use lat-lon coordinates (instead using some sort of curvilinear mesh to cover the sphere).

A principle contrast between CMIP6 and most other datasets I've seen in STAC is that all of the CMIP6 data is completely global in extent. It's not at all interesting or important for us to know the bounding box of the data. Furthermore, the time range may include non-standard calendars used in climate modeling, such as 360, no-leap, etc, which are impossible to encode using STAC.

The challenge for users with CMIP6 data is not finding a particular spatial or temporal extent--rather, it is filtering the >100,000 datasets according to variable, scenario, modeling center, etc. There is no inherent hierarchy to these attributes, so it can't be nested. The ESGF uses a custom search API to solve this problem. For the cloud data, the simplest solution we could come up with is a flat table, stored as a .csv file, with all of the relevant attributes for each dataset, e.g.

image

(Screenshot from https://catalog.pangeo.io/browse/master/climate/cmip6_gcs/.)

What is this CSV file about? Is there a reason for it being CSV instead of JSON?

Because it is not hierarchical and therefore cannot be nested, size is a major concern. These CSV files are already ~50 MB. JSON is much larger to store and slower to parse for this sort of flat, tidy data. But supporting JSON would certainly be possible.

We would be happy to set up a call to discuss the details. I have some availability Wednesday morning (EST) and Friday afternoon.

@jhamman
Copy link

jhamman commented Apr 14, 2020

I'd also be happy to join a discussion to see how we can move these efforts along/together. Wednesday afternoon (expect 1-2p PT) and Friday after 1p PT are free for me.

JSON is much larger to store and slower to parse for this sort of flat, tidy data. But supporting JSON would certainly be possible.

This is possible now following NCAR/esm-collection-spec#15. I find this single-file collection to be the easiest example to comprehend: https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/examples/sample-collection-with-catalog-dict.json

@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 14, 2020

Thanks for all the information. I'll use them in the next days and try to figure out how to combine boths specs without breaking too much on both sides. I don't think we need to put the catalog into JSON though, I guess this could simply be a new link type.

I'm happy to join a call, but 1pm PDT is 10pm here in Europe, so that's too late unfortunately. Would be better to find something in the morning (PDT) next week or so.

@cholmes
Copy link
Contributor

cholmes commented Apr 14, 2020

I could do monday 8am-10am pacific, or tuesday 7:30-8:30am pacific.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 14, 2020

From those times, I could do Mondays 9-10 AM PDT or Tuesdays 7.30-8.30 AM PDT. For For Tuesdays I need to know it 5 to 6 days in advance, because I need to shift another meeting. @rabernat @jhamman

@jhamman
Copy link

jhamman commented Apr 14, 2020

Both of these time slots work for me.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 15, 2020

I tried to come up with an example to base discussions upon, based on:

  • this ESM spec exmaple
  • STAC Collections
  • STAC Data Cube Extension
  • STAC Asset Extension
  • an envisioned new STAC ESM extension
  • the infos from @rabernat above

For more details see also the comments in the code.

{
  // STAC collection fields
  "stac_version": "0.9.0",
  "stac_extensions": [
    "asset",
    "datacube",
    "esm" // A new extension based on the ESM collection spec
  ],
  "id": "pangeo-cmip6",
  "title": "Google CMIP6",
  "description": "This is an ESM collection for CMIP6 Zarr data residing in Pangeo's Google Storage.",
  "extent": {
    "spatial": {
      "bbox": [[-180, -90, 180, 90]]
    },
    "temporal": {
      "interval": [["1850-01-15T12:00:00Z", "2014-12-15T12:00:00Z"]]
    }
  },
  "providers": [
    {
    "name": " World Climate Research Programme",
    "roles": ["producer","licensor"],
    "url": "https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6"
    },
    {
    "name": "The Pangeo Project",
    "roles": ["processor"],
    "url": "https://console.cloud.google.com/pangeo.io"
    },
    {
    "name": "Google",
    "roles": ["host"],
    "url": "https://console.cloud.google.com/marketplace/details/noaa-public/cmip6"
    }
  ],
  "license": "proprietary",
  "links": [
    {
      "href": "https://pcmdi.llnl.gov/CMIP6/TermsOfUse/TermsOfUse6-1.html",
      "type": "text/html",
      "rel": "license",
      "title": "CMIP6: Terms of Use"
    }
  ],
  "summaries": {
    // Could hold additional metadata as defined for STAC Items, not sure what could be relevant.
  },
  // Data Cube extension, see https://github.com/radiantearth/stac-spec/tree/master/extensions/datacube
  "cube:dimensions": {
    "lon": {
      "type": "spatial",
      "axis": "x",
      "extent": [0,360],
      "reference_system": 0 // Placeholder, Which is it here?
    },
    "lat": {
      "type": "spatial",
      "axis": "y",
      "extent": [-90,90],
      "reference_system": 0 // Placeholder, Which is it here?
    },
    "time": { 
      "type": "temporal",
      "extent": ["1850-01-15T12:00:00Z", "2014-12-15T12:00:00Z"],
      "step": "P30D" // Random placeholder
    },
    // Could probably be moved to cube:variables
    "tas": {
      "type": "variable",
      "description": "Surface air temperature",
      "extent": [-70, 70], // Placeholder
      "unit": "°C"
    },
  },
  // This is not part of STAC yet and needs to be defined, probably similar to dimensions with objects like https://github.com/radiantearth/stac-spec/tree/master/extensions/datacube#additional-dimension-object
  "cube:variables": {
    "tas": {
      // to be defined...
      "type": "variable",
      "description": "Surface air temperature",
      "extent": [-70, 70], // Placeholder
      "unit": "°C"
    },
    "time_bnds": {
      // to be defined
    },
    "lat_bnds": {
      // to be defined
    },
    "lon_bnds": {
      // to be defined
    }
  },
  // Asset extension, extended by ESM extension to support asset-level metadata (adds the `href` property), ESM also defines "column_name" and specific roles ("catalog", "attribute").
  "assets": {
    "catalog": {
      // Optional, otherwise specify esm:catalog below
      "roles": ["catalog"],
      "type": "application/vnd.zarr", // Previously assets.format - is there a ZARR media type?
      "column_name": "path",
      "title": "Catalog",
      "description": "Path to a the CSV file with the catalog contents.",
      "href": "sample-pangeo-cmip6-zarr-stores.csv"
    },
    // All attributes / vocabulary files, we may also move these out of the assets, depending on whether there's usually a "href" set or not. If not, it could simply be moved to a field "esm:attributes" with the same structure as in the ESM spec.
    "activity_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "activity_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
    },
    "source_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "source_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
    },
    "institution_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "institution_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
    },
    "experiment_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "experiment_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
    },
    "member_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "member_id"
    },
    "table_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "table_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
    },
    "variable_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "variable_id"
    },
    "grid_label": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "grid_label",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
    }
  },
  // ESM extension fields
  "esm:catalog": {}, // Optional, previously the "catalog dict" if no "catalog" asset is available 
  "esm:aggregation_control": {
    // As defined by the ESM spec
  }
}

@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 15, 2020

Thanks for all the information, it helped to work on the example above.

This object, with dimensions time, lat, lon, is already a data cube. Other variables may have additional dimensions, e.g. height, or may not use lat-lon coordinates (instead using some sort of curvilinear mesh to cover the sphere).

That seems pretty complex to model. We need to check closely whether we can abstract it into the data cube extension or add this to the ESM extension if desired.

A principle contrast between CMIP6 and most other datasets I've seen in STAC is that all of the CMIP6 data is completely global in extent. It's not at all interesting or important for us to know the bounding box of the data.

Understood, although STAC collections require you to specify the extent. So you could always set it to worldwide.

Furthermore, the time range may include non-standard calendars used in climate modeling, such as 360, no-leap, etc, which are impossible to encode using STAC.

It still has a temporal extent it covers, right? So that can be specified in the extents. And then for the non-standard calendars we can probably use the data cube extension (it has the option to do so) or add something in the ESM extension. Although I couldn't find any information on dates etc in the ESM spec so not sure whether you want to include it at all...

The challenge for users with CMIP6 data is not finding a particular spatial or temporal extent--rather, it is filtering the >100,000 datasets according to variable, scenario, modeling center, etc.

You don't need to search for spatial or temporal in STAC API. We "just" need to figure out how you search for variable, scenario, modeling center etc. and how we can map this to STAC. But the example above only looks at a JSON encoding for now. API is a completely different beast.

The ESGF uses a custom search API to solve this problem.

As said above, I guess you still need a custom API or an extension for the STAC API. That's nothing we can support out of the box yet. Let's start with aligning the JSON encodings first.

Because it is not hierarchical and therefore cannot be nested, size is a major concern. These CSV files are already ~50 MB. JSON is much larger to store and slower to parse for this sort of flat, tidy data. But supporting JSON would certainly be possible.

We don't need to store it as JSON, I think. We can just reference the CSV file(s) as assets.

@cholmes
Copy link
Contributor

cholmes commented Apr 15, 2020

It's not at all interesting or important for us to know the bounding box of the data.

Understood, although STAC collections require you to specify the extent. So you could always set it to worldwide.

Agreed, and I think even though it's not interesting or important to you, it is important for a general person looking for geospatial information to know that this particular set of data is global. But setting it worldwide in all cases I think is totally valid. And it does seem like there are multidimensional cubes / netcdf data that isn't global sometimes?

The challenge for users with CMIP6 data is not finding a particular spatial or temporal extent--rather, it is filtering the >100,000 datasets according to variable, scenario, modeling center, etc.

You don't need to search for spatial or temporal in STAC API. We "just" need to figure out how you search for variable, scenario, modeling center etc. and how we can map this to STAC. But the example above only looks at a JSON encoding for now. API is a completely different beast.

+1 - I see the win here as getting the 'overview' in the same 'world' as STAC, so we have interoperability at the 'collection' level. The filtering of datasets can be a totally different 'thing'. And I think the links to the csv files seems like a nice cloud native way to let diverse tools filter datasets.

But supporting JSON would certainly be possible.

We don't need to store it as JSON, I think. We can just reference the CSV file(s) as assets.

+1 - I too was curious to know why it is CSV, but the reasoning makes sense, and having STAC refer to assets that aren't json is totally normal.

@rabernat
Copy link

We don't need to store it as JSON, I think. We can just reference the CSV file(s) as assets.

This is more-or-less what the ESM collection spec does now. At the top level, we still have a json file, which points to this csv file.

Although I couldn't find any information on dates etc in the ESM spec so not sure whether you want to include it at all...

This is all covered by CF conventions: http://cfconventions.org/cf-conventions/cf-conventions.html#calendar

An important thing to note here is that our field has a very comprehensive and widely adopted set of metadata conventions called CF conventions. But all the CF metadata live inside the netCDF / Zarr file (and in Zarr it is stored in json). A question we would need to resolve is how much metadata to pull out and store in a STAC collection. @m-mohr, in your example above, you are essentially duplicating much of that metadata, so it's starting to look very similar to the zarr metadata file itself. An example: https://storage.googleapis.com/cmip6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/clt/gn/.zmetadata:

{
    "metadata": {
        ".zattrs": {
            "Conventions": "CF-1.7 CMIP-6.2",
            "activity_id": "CMIP",
            "branch_method": "Hybrid-restart from year 0701-01-01 of piControl",
            "branch_time": 0.0,
            "branch_time_in_child": 0.0,
            "branch_time_in_parent": 182500.0,
            "cmor_version": "3.5.0",
            "contact": "Dr. Wei-Liang Lee (leelupin@gate.sinica.edu.tw)",
            "coordinates": "lon_bnds time_bnds lat_bnds",
            "creation_date": "2020-01-17T02:36:31Z",
            "data_specs_version": "01.00.31",
            "experiment": "1 percent per year increase in CO2",
            "experiment_id": "1pctCO2",
            "external_variables": "areacella",
            "forcing_index": 1,
            "frequency": "mon",
            "further_info_url": "https://furtherinfo.es-doc.org/CMIP6.AS-RCEC.TaiESM1.1pctCO2.none.r1i1p1f1",
            "grid": "finite-volume grid with 0.9x1.25 degree lat/lon resolution",
            "grid_label": "gn",
            "history": "2020-01-17T02:36:31Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards.",
            "initialization_index": 1,
            "institution": "Research Center for Environmental Changes, Academia Sinica, Nankang, Taipei 11529, Taiwan",
            "institution_id": "AS-RCEC",
            "license": "CMIP6 model data produced by NCC is licensed under a Creative Commons Attribution ShareAlike 4.0 International License (https://creativecommons.org/licenses). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment. Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file) and at https:///pcmdi.llnl.gov/. The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law.",
            "mip_era": "CMIP6",
            "model_id": "TaiESM1",
            "nominal_resolution": "100 km",
            "parent_activity_id": "CMIP",
            "parent_experiment_id": "piControl",
            "parent_mip_era": "CMIP6",
            "parent_source_id": "TaiESM1",
            "parent_sub_experiment_id": "none",
            "parent_time_units": "days since 1850-01-01",
            "parent_variant_label": "r1i1p1f1",
            "physics_index": 1,
            "product": "model-output",
            "realization_index": 1,
            "realm": "atmos",
            "references": "http://cclics.rcec.sinica.edu.tw/index.php/taiesm/outline.html",
            "run_variant": "N/A",
            "source": "TaiESM 1.0 (2018): \naerosol: SNAP (same grid as atmos)\natmos: TaiAM1 (0.9x1.25 degree; 288 x 192 longitude/latitude; 30 levels; top level ~2 hPa)\natmosChem: SNAP (same grid as atmos)\nland: CLM4.0 (same grid as atmos)\nlandIce: none\nocean: POP2 (320x384 longitude/latitude; 60 levels; top grid cell 0-10 m)\nocnBgchem: none\nseaIce: CICE4",
            "source_id": "TaiESM1",
            "source_type": "AOGCM AER BGC",
            "status": "2020-04-03;created; by gcs.cmip6.ldeo@gmail.com",
            "sub_experiment": "none",
            "sub_experiment_id": "none",
            "table_id": "Amon",
            "table_info": "Creation Date:(24 July 2019) MD5:0bb394a356ef9d214d027f1aca45853e",
            "title": "TaiESM1 output prepared for CMIP6",
            "tracking_id": "hdl:21.14100/a8dd48da-2316-4c01-871c-9491971da509",
            "variable_id": "clt",
            "variant_label": "r1i1p1f1"
        },
        ".zgroup": {
            "zarr_format": 2
        },
        "clt/.zarray": {
            "chunks": [
                321,
                192,
                288
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f4",
            "fill_value": 1.0000000200408773e+20,
            "filters": null,
            "order": "C",
            "shape": [
                1800,
                192,
                288
            ],
            "zarr_format": 2
        },
        "clt/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "time",
                "lat",
                "lon"
            ],
            "cell_measures": "area: areacella",
            "cell_methods": "area: time: mean",
            "comment": "Total cloud area fraction (reported as a percentage) for the whole atmospheric column, as seen from the surface or the top of the atmosphere. Includes both large-scale and convective cloud.",
            "history": "2020-01-17T02:36:31Z altered by CMOR: Converted units from '1' to '%'. 2020-01-17T02:36:31Z altered by CMOR: Converted type from 'd' to 'f'.",
            "long_name": "Total Cloud Cover Percentage",
            "original_name": "CLDTOT",
            "original_units": "1",
            "standard_name": "cloud_area_fraction",
            "units": "%"
        },
        "lat/.zarray": {
            "chunks": [
                192
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                192
            ],
            "zarr_format": 2
        },
        "lat/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "lat"
            ],
            "axis": "Y",
            "bounds": "lat_bnds",
            "long_name": "Latitude",
            "standard_name": "latitude",
            "units": "degrees_north"
        },
        "lat_bnds/.zarray": {
            "chunks": [
                192,
                2
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                192,
                2
            ],
            "zarr_format": 2
        },
        "lat_bnds/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "lat",
                "bnds"
            ]
        },
        "lon/.zarray": {
            "chunks": [
                288
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                288
            ],
            "zarr_format": 2
        },
        "lon/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "lon"
            ],
            "axis": "X",
            "bounds": "lon_bnds",
            "long_name": "Longitude",
            "standard_name": "longitude",
            "units": "degrees_east"
        },
        "lon_bnds/.zarray": {
            "chunks": [
                288,
                2
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                288,
                2
            ],
            "zarr_format": 2
        },
        "lon_bnds/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "lon",
                "bnds"
            ]
        },
        "time/.zarray": {
            "chunks": [
                1800
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<i8",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                1800
            ],
            "zarr_format": 2
        },
        "time/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "time"
            ],
            "axis": "T",
            "bounds": "time_bnds",
            "calendar": "noleap",
            "long_name": "time",
            "standard_name": "time",
            "units": "hours since 0001-01-16 12:00:00.000000"
        },
        "time_bnds/.zarray": {
            "chunks": [
                1800,
                2
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                1800,
                2
            ],
            "zarr_format": 2
        },
        "time_bnds/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "time",
                "bnds"
            ],
            "calendar": "noleap",
            "units": "days since 0001-01-01"
        }
    },
    "zarr_consolidated_format": 1

This is essentially CF conventions encoded in the Zarr v2 Spec.

Perhaps one path forward is to make STAC more aware of Zarr objects. There is already considerably compatibility. Zarr uses json files to provide metadata about a collection of binary data objects which together comprise a full multidimensional array.

From those times, I could do Mondays 9-10 AM PDT

I can do Monday 9 AM PDT (12 PM EDT). Shall we settle on this?

@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 15, 2020

Sure, let's do it then. Here are the Zoom details:

Thema: ESM / STAC alignment
Uhrzeit: Mon, 20. Apr. 2020 06:00 PM +02:00 (Amsterdam, Berlin, Rom, Stockholm, Wien)

https://wwu.zoom.us/j/97130735633

@rabernat @cholmes @jhamman

@cholmes
Copy link
Contributor

cholmes commented Apr 15, 2020

On my calendar - thanks!

@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 16, 2020

This is more-or-less what the ESM collection spec does now.

Yes, I had looked at the spec when coming up with the example. It tries to borrow as much as possible from there.

This is all covered by CF conventions: http://cfconventions.org/cf-conventions/cf-conventions.html#calendar

Interesting. We need to figure out what needs to be exposed and how. Would someone search on this data? Maybe it's as simple as adding a field "calendar" with the options defined in the link?

A question we would need to resolve is how much metadata to pull out and store in a STAC collection. @m-mohr, in your example above, you are essentially duplicating much of that metadata, so it's starting to look very similar to the zarr metadata file itself.

Yes, STAC doesn't replace the original metadata files, but exposes the search / discovery related data in JSON. So there is indeed some intentional duplication. We need to figure out to which extent this is needed for Zarr, for example all the data cube related things are optional and then the example is shorter and you have a bare minimum, I think. I don't think we can or should go beyond that point of "simplicifcation".

Perhaps one path forward is to make STAC more aware of Zarr objects. There is already considerably compatibility. Zarr uses json files to provide metadata about a collection of binary data objects which together comprise a full multidimensional array.

Not 100% sure how we could do that, but we can link to it for sure. Is there a media type for zarr (metadata) except for application/json?

This was referenced Apr 16, 2020
@rabernat
Copy link

Is there a media type for zarr (metadata) except for application/json?

A Zarr array or group is not a single file / object. It is a collection of json files and binary objects, stored with a standard layout. So it is not meaningful to talk about a media type for Zarr.

In other words, a Zarr array or group is analogous to a STAC collection, and the individual chunks analogous to STAC items. This overlap in scope is one reason I have a hard time mapping Zarr to STAC.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 16, 2020

Aha, I wasn't aware of that. That makes some of my considerations above obsolete. Let's see how we can go forward...

@rabernat
Copy link

Over the past year, I've spent quite a bit of time reading and trying to understand the STAC spec and its extension. In order to make Monday's meeting as productive as possible, I encourage you to do the same with the Zarr spec.

For context, Zarr recently received a CZI EOSS Grant. It is growing rapidly, not only in geoscience but also in bioinformatics / bioimaging. Thinking carefully about how to best integrate STAC and Zarr is an important task.

@matthewhanson
Copy link
Collaborator

Just catching up on this - I'll join the call on Monday as well.

To @rabernat's point about Zarr being multiple files, while that's true you don't reference the individual pieces....that is they aren't individually addressable as files. You reference the entire Zarr dataset, so in that sense I think that maybe a media type might make sense for Zarr.

One idea that had been brought up before was a collection-level assets, which there's an issue for now ( #779 ), but it does go against the central concept of STAC....Items are the things that are searched.

Some questions to consider and to talk about on Monday:

  • Is there value in indexing metadata from Zarr chunks?
  • The Zarr format is just a set of multidimensional arrays - is there value in spatially indexing geographic coordinates from Zarr arrays?

If the answer to the above questions is 'yes' or 'possibly' then that implies that it is advantageous to being able to specifically address a portion of a Zarr dataset, and to determine that ahead of time by some separate query. This goes against my point above that pieces of a Zarr dataset aren't individually addressable....they aren't in the traditional sense but Zarr chunks are read in pieces...so maybe the "URLs" to get to a piece of a Zarr dataset is really just arguments to some function. In this case maybe multiple STAC Items representing a complete Zarr dataset make sense, with the Collection representing the entire dataset.

If the answer to these questions is no, then Items do not really make sense. We could still align STAC and ESM - either by adding assets to a Collection and the Collection represents the entire dataset, or maybe just a single STAC Item is a Zarr dataset.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 17, 2020

The meeting details above were not correct. It said it's Sun, 19th, but of course we meet on Mon, 20th.

@m-mohr m-mohr modified the milestones: 1.0.0-beta1, 1.0.0-beta2 Apr 28, 2020
@m-mohr
Copy link
Collaborator Author

m-mohr commented Apr 28, 2020

As we are not going to use it in the ESM spec, I'm delaying the work on this until beta 2 and will focus on other more pressuring issues first.

@m-mohr m-mohr modified the milestones: 1.0.0-beta.2, 1.0.0-beta.3 Jun 15, 2020
@m-mohr m-mohr modified the milestones: 1.0.0-beta.3, future Jan 4, 2021
@matprov
Copy link

matprov commented Feb 5, 2021

Related to the ESM collection spec alignment with STAC (NCAR/esm-collection-spec#21)

I too was curious to know why it is CSV, but the reasoning makes sense, and having STAC refer to assets that aren't json is totally normal.

@cholmes Isn't it important that each item is searchable through the STAC API's search endpoint itself, rather than relying on a client side CSV lookup? I mean, it's clear that storing as JSON would be a heavier data representation than what we find in the CSV file, but the server side search capability would benefit from it.

And I think the links to the csv files seems like a nice cloud native way to let diverse tools filter datasets.

Based on the nature of the climate data, which is accessed by variables instead of spatial or temporal extent, it makes sense to use a CSV to easily find datasets. However, I wonder if the actual goal of STAC search is reached if it only provides the CSV and all operations are client side. What about having the actual data attributes and values self-contained in the JSON using the data cube extension or custom metadata fields using custom schemas, instead of relying on CSV? I personally can't describe CSV files as a cloud native way to store this data. I imagine consuming the STAC API through a light client (ie: web browser) would be painful if 10mb CSV needs to be beforehand downloaded.

@cholmes
Copy link
Contributor

cholmes commented Feb 5, 2021

@cholmes Isn't it important that each item is searchable through the STAC API's search endpoint itself, rather than relying on a client side CSV lookup? I mean, it's clear that storing as JSON would be a heavier data representation than what we find in the CSV file, but the server side search capability would benefit from it

I agree that's the ideal. But my main thought with esm / zarr in STAC now is to not let great be the enemy good. Originally STAC was very focused on 'Items', but OpenEO and Google Earth Engine both find a lot of value in just using it for 'collections' (as they don't have items - each layer is a full composite, abstracting out the 'scenes'/ It's just a more modern metadata for those who don't want to use one of the older XML standards, and it works in STAC tools. I still don't have my head fully around zarr, so it feels like STAC can at least 'help' with collection level search. We get that 'win', and then we can further investigate the 'item' level. What you say makes sense to me, but I still don't have a real feel of what exactly should be an 'item'. But I do think once we get that first win we should try to dive deep. I think it'll also be easier once the ecosystem for stac is a bit more mature, as we'll be able to see what putting things in items or collections actually results in. Right now it just all feels too abstract.

However, I wonder if the actual goal of STAC search is reached if it only provides the CSV and all operations are client side. What about having the actual data attributes and values self-contained in the JSON using the data cube extension or custom metadata fields using custom schemas, instead of relying on CSV?

This does make sense to me, and I fully agree the full goal of STAC search will be better reached if we map the CSV into the structures that result in fuller 'search' in STAC. But I think we can start with 'stac collection search' (which to be fair we don't really do yet in API's, as we are waiting to align with OGC API - Records, but we know the 'static' structure that will power the apis). But I'm also up to spend some time to figure out the 'full' solution, though I'd also want to understand what that would mean for existing clients of the CSV.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Feb 5, 2021

The ESM Collection spec work seems to have stopped at some point, it was never finished as far as I know, but I think it was replaced with a more generic zarr mapping for STAC. Maybe @rabernat can sound in here?

@rabernat
Copy link

rabernat commented Feb 5, 2021

Thanks for re-opening the discussion.

Looking back at these issues in retrospect, I see that we muddied the water a lot by conflating two separate issues. These separate issues are

  1. How do we refer to NetCDF or Zarr asset from a STAC catalog? This is also discussed in Zarr Extension? #781.
  2. How do we represent the typical deeply nested hierarchy of earth-system model fields in a STAC catalog? Our solution with the ESM STAC extension idea--of referring to a CSV file that describes the actual assets--seems like a misstep.

Going forward, I think the most important issue to resolve is how best to point to and describe non-imagery file formats in STAC. Zarr is a particularly hard case, because it is not even a single file but rather has its own internal nested hierarchy of objects.

Rather than trying to do this ESM collection / csv idea from inside STAC, our current view is that we should simply generate a static, deeply nested STAC catalog for our data and then index it with various search tools. In this context, the csv file is simply one of those indexes. Our community is quite into open search and elastic search as well, so those could be other options. This is discussed quite a bit in pangeo-forge/cmip6-pipeline#7.

Coincidentally, a group of folks from Pangeo and ESGF has just been getting ready to reach out to the STAC community again to discuss how we can start to align better with STAC. We are still stuck on some high level conceptual questions that would benefit from real-time discussion. We would love to set up a meeting with @m-mohr, @cholmes, @HamedAlemo, and anyone else interested. This meeting would include some folks from the ESGF leadership and so would be a chance to really advance STAC adoption in the climate world.

Would you be interested in such a meeting, and, if so, what's the best way to schedule?

@matprov
Copy link

matprov commented Feb 8, 2021

  1. How do we refer to NetCDF or Zarr asset from a STAC catalog? This is also discussed in Zarr Extension? #781.

@rabernat Also the points you mention in #366 (comment) are pertinent and should be part of the discussion. The data pipeline around STAC which is responsible to bridge with existing TDS data is one important piece of the STAC ecosystem.

When is the next meeting?

@HamedAlemo
Copy link
Collaborator

@rabernat thanks for the clarification on the two issues. I think a meeting would be helpful. If you can also share a sample small Zarr file (something much smaller than a CMIP6 output e.g.) before the meeting it would be great. That can help better understand the Zarr hierarchy and how it can work with STAC.
I'll be happy to coordinate the meeting with a doodle.

@rabernat
Copy link

I just discovered a bunch of weather data (grib) cataloged in STAC! https://api.weather.gc.ca/ I think @tomkralidis is responsible for this!

@tomkralidis
Copy link
Contributor

Thanks @rabernat. Yes, powered by pygeoapi, this is an experimental capability atop some of our MSC Datamart real-time data. Note that we are working on providing our climate data via STAC in a future release, so it would be great to move forward climate data in STAC.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Mar 4, 2021

The data cube extension has been moved to another repository: https://github.com/stac-extensions/datacube

This issue has been moved to: stac-extensions/datacube#1

Closing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants