Skip to content
This repository has been archived by the owner on Feb 3, 2023. It is now read-only.

Roadmap for merging with STAC #21

Open
rabernat opened this issue Apr 24, 2020 · 16 comments
Open

Roadmap for merging with STAC #21

rabernat opened this issue Apr 24, 2020 · 16 comments

Comments

@rabernat
Copy link
Collaborator

rabernat commented Apr 24, 2020

This is a follow up to the discussion in radiantearth/stac-spec#713 (comment).

On 2020-04-20, we had a call with myself, @jhamman, @cholmes, @m-mohr, and @matthewhanson. The aim was to make progress on something everyone wants: to merge esm collection spec with STAC. That was our intention from the beginning, but we chose to fork temporarily to get something working fast.

The goal for now is to do as minimal changes as possible to make this work. My recollection of the meeting is that there are two steps to the proposed plan:

  • Define an esm extention as a new valid STAC extension. That extension will probably need to live in a new repo (I propose NCAR/stac-esm), or alternatively this repo could morph into that project. The esm extension would include most of the custom fields. we have defined already for esm-collection-spec.
  • Redefine all esm-collection.json files as valid STAC Collections. This means adding some additional required metadata fields per the collection spec. These collections will use the esm extension.

In radiantearth/stac-spec#713 (comment), @m-mohr worked up a really nice example of how it might look. During the meeting, we agreed that we won't try to also use the datacube extension. That is an eventual goal as well, but we noted several challenges in terms of reconciling datacube with Zarr and CF metadata.

So here I repeat @m-mohr's example minus the datacube part

{
  // STAC collection fields
  "stac_version": "0.9.0",
  "stac_extensions": [
    "asset",
    "esm" // A new extension based on the ESM collection spec
  ],
  "id": "pangeo-cmip6",
  "title": "Google CMIP6",
  "description": "This is an ESM collection for CMIP6 Zarr data residing in Pangeo's Google Storage.",
  "extent": {
    "spatial": {
      "bbox": [[-180, -90, 180, 90]]
    },
    "temporal": {
      "interval": [["1850-01-15T12:00:00Z", "2014-12-15T12:00:00Z"]]
    }
  },
  "providers": [
    {
    "name": " World Climate Research Programme",
    "roles": ["producer","licensor"],
    "url": "https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6"
    },
    {
    "name": "The Pangeo Project",
    "roles": ["processor"],
    "url": "https://console.cloud.google.com/pangeo.io"
    },
    {
    "name": "Google",
    "roles": ["host"],
    "url": "https://console.cloud.google.com/marketplace/details/noaa-public/cmip6"
    }
  ],
  "license": "CC BY-SA 4.0",
  "links": [
    {
      "href": "https://pcmdi.llnl.gov/CMIP6/TermsOfUse/TermsOfUse6-1.html",
      "type": "text/html",
      "rel": "license",
      "title": "CMIP6: Terms of Use"
    }
  ],
  "summaries": {
    // Could hold additional metadata as defined for STAC Items, not sure what could be relevant.
  },
  // Asset extension, extended by ESM extension to support asset-level metadata (adds the `href` property), ESM also defines "column_name" and specific roles ("catalog", "attribute").
  "assets": {
    "catalog": {
      // Optional, otherwise specify esm:catalog below
      "roles": ["esm-catalog"],
      "type": "text/csv", // Previously assets.format 
      "column_name": "path",
      "title": "Catalog",
      "description": "Path to a the CSV file with the catalog contents.",
      "href": "sample-pangeo-cmip6-zarr-stores.csv"
    },
    // All attributes / vocabulary files, we may also move these out of the assets, depending on whether there's usually a "href" set or not. If not, it could simply be moved to a field "esm:attributes" with the same structure as in the ESM spec.
    "activity_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "activity_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
    },
    "source_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "source_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
    },
    "institution_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "institution_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
    },
    "experiment_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "experiment_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
    },
    "member_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "member_id"
    },
    "table_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "table_id",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
    },
    "variable_id": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "variable_id"
    },
    "grid_label": {
      "roles": ["attribute"],
      "type": "application/json",
      "column_name": "grid_label",
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
    }
  },
  // ESM extension fields
  "esm:catalog": {}, // Optional, previously the "catalog dict" if no "catalog" asset is available 
  "esm:aggregation_control": {
    // As defined by the ESM spec
  }
}

One thing I changed was to define the role for the asset as esm-catalog rather than catalog. This can hopefully let a processor (like intake-esm) know that this asset has a special role within the esm extension.


I'd love some feedback on whether I remembered the meeting accurately (it was a few days ago and our notes were sparse) and whether this sounds like a good plan. The STAC folks proposed organizing a 2-hour spring to bang this out, and I think that's a great idea. I would not be free until the first week of May. If others agree (particularly need help from @andersy005 and @charlesbluca), I'll send out a Doodle.

@m-mohr
Copy link

m-mohr commented Apr 24, 2020

  • Define an esm extention as a new valid STAC extension. That extension will probably need to live in a new repo (I propose NCAR/stac-esm), or alternatively this repo could morph into that project.

I guess you could just start with a branch here?

"type": "application/json", // Previously assets.format

That's meant to be text/csv (or whatever media type applies for CSV files), of course.

One thing I changed was to define the role for the asset as esm-catalog rather than catalog. This can hopefully let a processor (like intake-esm) know that this asset has a special role within the esm extension.

  • That makes sense!
  • I think we also agreed on leaving the attributes how they are (or with slight adjustments?) and not put them in assets. We could optionally add the actual references in assets or links though.
  • I think another idea came up: Removing the "catalog_dict" / "esm:catalog" and move it to a separate file, linked to from the assets exactly like the CSV catalog just with a different media type. I like that idea.

I would not be free until the first week of May. If others agree (particularly need help from @andersy005 and @charlesbluca), I'll send out a Doodle.

First week of May sounds good to me. Doodle it good, too.

@andersy005
Copy link
Contributor

Thank you @rabernat @jhamman & @m-mohr for putting this together! Looking forward to seeing this brought to completion.

If others agree (particularly need help from @andersy005 and @charlesbluca), I'll send out a Doodle.

The first week of May works for me as well.

@rabernat
Copy link
Collaborator Author

That's meant to be text/csv (or whatever media type applies for CSV files), of course.

Fixed

I have created a Doodle here:
https://doodle.com/poll/756diubsfb3x5nb2
The goal of this meeting is to have a 2-hour block where we all work on this simultaneously. I am hoping that at minimum, @m-mohr, @andersy005, @charlesbuca, and myself can attend. Would also be great to have @naomi-henderson, @jhamman, @cholmes, and @matthewhanson. If you can't make all two hours, that's okay--just click "if need be" in Doodle.

The goals of the hack session are:

  • Define the esm extension and submit a PR to STAC
  • Refactor some existing ESM collections to conform to the new spec
  • Update the documentation in this repo to reflect the new standard

If time permits, we can start updating processing tools (e.g. pangeo catalog, intake-esm) to adapt to the new conventions. However, this is not the main goal.

Anything I missed?

@rabernat
Copy link
Collaborator Author

rabernat commented Apr 29, 2020

The winning time is

May 7 THU 1:00 PM - 3:00 PM EDT

We can use https://whereby.com/pangeo to chat / coordinate.

@m-mohr
Copy link

m-mohr commented May 5, 2020

A little bit of updates before the telco: Based on the last telco, I tried to come up with a new example. I think it better aligns both specs. The biggest change and probably biggest point of discussion is splitting the vocabulary links into assets and a separate array of attribute names.

{
  "stac_version": "0.9.0",
  "stac_extensions": [
    "collection-assets",
    "https://github.com/NCAR/esm-collection-spec/tree/master/schema.json"
  ],
  "id": "pangeo-cmip6",
  "title": "Google CMIP6",
  "description": "This is an ESM collection for CMIP6 Zarr data residing in Pangeo's Google Storage.",
  "extent": {
    "spatial": {
      "bbox": [[-180, -90, 180, 90]]
    },
    "temporal": {
      "interval": [["1850-01-15T12:00:00Z", "2014-12-15T12:00:00Z"]]
    }
  },
  "providers": [
    {
      "name": " World Climate Research Programme",
      "roles": ["producer","licensor"],
      "url": "https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6"
    },
    {
      "name": "The Pangeo Project",
      "roles": ["processor"],
      "url": "https://console.cloud.google.com/pangeo.io"
    },
    {
      "name": "Google",
      "roles": ["host"],
      "url": "https://console.cloud.google.com/marketplace/details/noaa-public/cmip6"
    }
  ],
  "license": "proprietary",
  "links": [
    {
      "href": "https://pcmdi.llnl.gov/CMIP6/TermsOfUse/TermsOfUse6-1.html",
      "type": "text/html",
      "rel": "license",
      "title": "CMIP6: Terms of Use"
    }
  ],
  "assets": {
    "thumbnail": {
      "href": "logo.png",
      "title": "A preview image for visualization.",
      "type": "image/png",
      "roles": ["thumbnail"]
    },
    "catalog": {
      "href": "sample-pangeo-cmip6-zarr-stores.csv",
      "title": "Catalog",
      "description": "Path to a the CSV file with the catalog contents.",
      "type": "text/csv",
      "roles": ["esm-catalog"],
      "esm:column_name": "path"
    },
    "activity_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "activity_id"
    },
    "source_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "source_id"
    },
    "institution_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "institution_id"
    },
    "experiment_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "experiment_id"
    },
    "table_id": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "table_id"
    },
    "grid_label": {
      "href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json",
      "type": "application/json",
      "roles": ["esm-vocabulary"],
      "esm:column_name": "grid_label"
    }
  },
  "esm:catalog": {},
  "esm:attributes": ["activity_id", "source_id", "institution_id", "experiment_id", "member_id", "table_id", "variable_id", "grid_label"],
  "esm:aggregation_control": {
    "variable_column_name": "variable_id",
    "groupby_attrs": [
      "activity_id",
      "institution_id",
      "source_id",
      "experiment_id",
      "table_id",
      "grid_label"
    ],
    "aggregations": [
      {
        "type": "join_new",
        "attribute_name": "member_id",
        "options": { "coords": "minimal", "compat": "override" }
      },
      {
        "type": "join_existing",
        "attribute_name": "time_range",
        "options": { "dim": "time" }
      },
      {
        "type": "union",
        "attribute_name": "variable_id"
      }
    ]
  }
}

There were recently also some discussions in STAC on how to best integrate things like zarr. Based on radiantearth/stac-spec#779 I'm working on collection-level assets (PR is coming in the next hours), which we'll probably use for the ESM collection extension. There also have been discussions on how we could allow Items to represent "parts" of a zarr archive and came up with nullable timestamps (see radiantearth/stac-spec#798).

@cholmes
Copy link

cholmes commented May 5, 2020

I've had some family stuff come up, so may miss thursday meeting completely, and at the very least will likely be in and out. But I don't think I'm core to it - psyched to see what the group comes up with!

@rabernat
Copy link
Collaborator Author

rabernat commented May 7, 2020

Hi All! I'm looking forward to little sprint today at 1pm EST. I suggest we convene briefly at https://whereby.com/pangeo at 1pm to discuss our work plan.

@andersy005
Copy link
Contributor

Sounds good 👌! I will be there at 1pm.

@m-mohr
Copy link

m-mohr commented May 7, 2020

Great work today. I went through the example PRs with the new JSON schema in #27 and left comments how they could validate.

@rabernat
Copy link
Collaborator Author

Hi Folks--sorry for letting this hang for so long. I'd like to get the PRs merged asap. It seems like the only PR missing is @jhamman's narrative description of the new spec. Am I remembering things correctly?

I have assigned reviewers to all the PRs. Let's get them reviewed, approved, and merged.

@m-mohr
Copy link

m-mohr commented May 25, 2020

It seems there are some points left for discussion, especially self-contained catalogs (i.e. esm:catalog).

@jhamman
Copy link

jhamman commented Aug 6, 2020

Just wanted to drop a quick note here to highlight the upcoming STAC sprint (https://medium.com/radiant-earth-insights/join-us-for-stac-sprint-6-our-first-fully-remote-event-28e118a5279c). Might be a good opportunity to push things forward on the esp spec front.

@cholmes
Copy link

cholmes commented Aug 12, 2020

Would definitely be great if people could join. I'd really love to get at least a small sample zarr+stac catalog up. May even be able to structure some sort of 'prize' to make that happen, as there are sponsors interested in seeing this happen, and I think it'd be a great test to ensure STAC is ready for 1.0

@m-mohr
Copy link

m-mohr commented Aug 12, 2020

I'll be available the first and last day of the data sprint until around 11pm CEST, if you need me for anything.

@andersy005
Copy link
Contributor

andersy005 commented Aug 12, 2020

@m-mohr, I plan to be at the sprint (excluding times I have meetings at work). Happy to help with getting what we started in #27 done at the sprint

@rabernat
Copy link
Collaborator Author

rabernat commented Sep 2, 2020

I'm curious how this issue has progressed. Are we any closer to being able to catalog our cloud-based data in STAC? Is there a way I can help?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants