Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor datastore to use STAC #107

Closed
3 of 7 tasks
rabernat opened this issue May 22, 2020 · 4 comments
Closed
3 of 7 tasks

Refactor datastore to use STAC #107

rabernat opened this issue May 22, 2020 · 4 comments
Labels

Comments

@rabernat
Copy link
Member

rabernat commented May 22, 2020

When we first started this project, it was not feasible to use STAC for our master catalog, so we went with intake. However, things have changed:

So we should now be able to refactor our catalog around STAC.

The steps would be as follows:

  • Read and understand the STAC Spec. This means understanding the core concepts of catalogs, collections, and items (we likely won't have any items), plus relevant extensions.
  • Write a markdown document which describes the proposed structure of the Pangeo STAC catalog.
  • Translate the existing intake catalog to STAC. I would recommend writing a script for doing this, rather than doing it manually. This should probably live in a new repo.
  • Manually add any extra missing fields
  • Validate the catalog using the STAC validator
  • Ensure that intake-stac can crawl the catalog and load all the relevant assets. (Likely will involve some PRs to intake-stac.)
  • Change existing example notebooks (e.g. from gallery.pangeo.io) to use intake-stac.

This is a lot of work, but I think it is a clear path.

Once we have our catalog in STAC, we can then think about how to re-design the catalog website.

@charlesbluca
Copy link
Member

charlesbluca commented May 27, 2020

Looking into documentation, I'm thinking that a STAC Collection would work well for any catalog directly containing a Zarr store or ESM collection, since they would likely have some licensing/provider information that could be placed under license and provider; however, to my knowledge, there isn't a way to specify what data corresponds to a license/provider, so catalogs like camels.yaml may need to be split up by cloud provider.

extent is also an issue where I'd imagine some of the data we want to group together has radically varying spatial/temporal bounds, but I'm aware that there seems to be some push to redefine this so we could go with your suggestion and just define arbitrary bounds for now.

I now understand that since we are planning to render our Zarr/ESM data as collections, there's no real pressure to group any datasets by license/provider, unless we feel so inclined due to issues like finding a way to easily subset by cloud provider.

Beyond those issues, I think an absolute published catalog/collection seems like the best equivalent to our current catalog setup, since our data is in a different location from the catalogs themselves.

@charlesbluca
Copy link
Member

Made a repo containing the WIP catalogs here:

https://github.com/charlesbluca/pangeo-datastore-stac

I'm drafting up a README for the repo now that can double as a proposal of the catalog's structure, and I included the script used to generate the catalogs.

@github-actions
Copy link

github-actions bot commented Dec 9, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Dec 9, 2020
@github-actions
Copy link

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants