Refactor datastore to use STAC #107

rabernat · 2020-05-22T14:52:09Z

When we first started this project, it was not feasible to use STAC for our master catalog, so we went with intake. However, things have changed:

esm-collection-spec is on track to be represented in STAC collections (see Roadmap for merging with STAC NCAR/esm-collection-spec#21)
Zarr can now be represented in STAC collections (Zarr Extension? radiantearth/stac-spec#781), although many question marks remain

So we should now be able to refactor our catalog around STAC.

The steps would be as follows:

Read and understand the STAC Spec. This means understanding the core concepts of catalogs, collections, and items (we likely won't have any items), plus relevant extensions.
Write a markdown document which describes the proposed structure of the Pangeo STAC catalog.
Translate the existing intake catalog to STAC. I would recommend writing a script for doing this, rather than doing it manually. This should probably live in a new repo.
Manually add any extra missing fields
Validate the catalog using the STAC validator
Ensure that intake-stac can crawl the catalog and load all the relevant assets. (Likely will involve some PRs to intake-stac.)
Change existing example notebooks (e.g. from gallery.pangeo.io) to use intake-stac.

This is a lot of work, but I think it is a clear path.

Once we have our catalog in STAC, we can then think about how to re-design the catalog website.

charlesbluca · 2020-05-27T18:55:41Z

Looking into documentation, I'm thinking that a STAC Collection would work well for any catalog directly containing a Zarr store or ESM collection, since they would likely have some licensing/provider information that could be placed under license and provider; however, to my knowledge, there isn't a way to specify what data corresponds to a license/provider, so catalogs like camels.yaml may need to be split up by cloud provider.

extent is also an issue where I'd imagine some of the data we want to group together has radically varying spatial/temporal bounds, but I'm aware that there seems to be some push to redefine this so we could go with your suggestion and just define arbitrary bounds for now.

I now understand that since we are planning to render our Zarr/ESM data as collections, there's no real pressure to group any datasets by license/provider, unless we feel so inclined due to issues like finding a way to easily subset by cloud provider.

Beyond those issues, I think an absolute published catalog/collection seems like the best equivalent to our current catalog setup, since our data is in a different location from the catalogs themselves.

charlesbluca · 2020-05-28T18:38:32Z

Made a repo containing the WIP catalogs here:

https://github.com/charlesbluca/pangeo-datastore-stac

I'm drafting up a README for the repo now that can double as a proposal of the catalog's structure, and I included the script used to generate the catalogs.

github-actions · 2020-12-09T12:18:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2020-12-16T12:19:08Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

github-actions bot added the Stale label Dec 9, 2020

github-actions bot closed this as completed Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor datastore to use STAC #107

Refactor datastore to use STAC #107

rabernat commented May 22, 2020 •

edited by charlesbluca

Loading

charlesbluca commented May 27, 2020 •

edited

Loading

charlesbluca commented May 28, 2020

github-actions bot commented Dec 9, 2020

github-actions bot commented Dec 16, 2020

Refactor datastore to use STAC #107

Refactor datastore to use STAC #107

Comments

rabernat commented May 22, 2020 • edited by charlesbluca Loading

charlesbluca commented May 27, 2020 • edited Loading

charlesbluca commented May 28, 2020

github-actions bot commented Dec 9, 2020

github-actions bot commented Dec 16, 2020

rabernat commented May 22, 2020 •

edited by charlesbluca

Loading

charlesbluca commented May 27, 2020 •

edited

Loading