Kerchunk! Bless you #8
Replies: 7 comments 8 replies
-
Hopefully this is of interest to other participants. Appreciate any feedback to help shape this into something that is useful for the community! |
Beta Was this translation helpful? Give feedback.
-
As someone who has done their own dabbling in kerchunk indexing, I'd suggest starting to explore from the Pangeo-Forge side of things. While Kerchunk's capabilities are expanding, it's not always the right choice for every dataset (not that Pangeo-Forge is either). Pangeo-Forge also provides the closest thing there currently is to common place for Kerchunk indexes. I think it's also an easier way to get to a good mental model of what kerchunk does, rather than diving directly into it. |
Beta Was this translation helpful? Give feedback.
-
Kerchunk allows any collection of scientific format files to be as performant as then can be on the Cloud, and also provides an easy way to make non-CF compliant datasets compliant. But it doesn't reformat or rechunk data, so if your files have tiny chunks (e.g. 100k chunks or something), or the chunk shapes are such that you need to read 100,000 grib files to extract a time series at a point, it will be slow and you need to rechunk to improve performance. @abkfenris - Is that what you meant when you said "it's not always the right choice for every dataset" ? |
Beta Was this translation helpful? Give feedback.
-
Ok so we have a repository https://github.com/oceanhackweek/ohw22-proj-kerchunk After some digging around on AWS turns out there is a L3 NOAA Gridded SST dataset from the Himawari geostationary satellite available that is an excellent candidate for kerchunk. @martindurant I noticed your comment specifically in relation to subsetting and previous discussions about using parquet as a container for references. The Himawari dataset is hourly, near hemispheric data at 2km resolution, so many many chunks. At this point I'm going to reuse the RefZarrStackSource intake driver from intake-aodn but I wonder what you had in mind for subsetting. Maybe adding additional columns to the parquet that could be filtered before instantiating the ReferenceFilesystem? |
Beta Was this translation helpful? Give feedback.
-
I'm not certain which point you meant by "subsetting", there are a few tings I want to get done:
Note that preffs already has a parquet implementation for referencefilesystem, but without lazy loading or filtering. I intend to build off this, but I can't promise when. It might work well for the >100k references that would be required here, but I have found that Zstd compression is pretty good for file size and load speed with JSON too. How you access the data chunks is another matter and workflow dependent. |
Beta Was this translation helpful? Give feedback.
-
@martindurant I'm working on getting templated access to lots of small netcdf files using kerchunk. Making the references is quick and it allows me to do things like find all the files that were manually qc'd without fetching each file because I can check the reference file attrs! Also I can see what variable names are in them etc. So I have zipped them into a zipfile, but I'm struggling with a way to template the name of the reff file that I want out of the zip using an intake catalogue.
|
Beta Was this translation helpful? Give feedback.
-
Hey Paul,
I just pointed the reference maker at the moorings .nc files and it works well and quick but I guess the thing is handling the multiple files that do not concatenate? If you've got time for a quick chat I can explain
Nick
From: Paul Branson ***@***.***>
Sent: Wednesday, August 17, 2022 3:06 PM
To: oceanhackweek/discussions ***@***.***>
Cc: Mortimer, Nick (O&A, IOMRC Crawley) ***@***.***>; Comment ***@***.***>
Subject: Re: [oceanhackweek/discussions] Kerchunk! Bless you (Discussion #8)
This dataset also seems am excellent (simple) candidate for pangeo-forge too
-
Reply to this email directly, view it on GitHub<#8 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABBDKH5UC24C5VTY5BPWDFLVZSFMJANCNFSM55XOREPQ>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
Title
Kerchunk! Bless you - indexing your way to enhanced analysis of someone else's big data
Summary
So you found an open data bucket with a heap of data you want to analyse. Maybe it is dense grids of numerical model data at global scale in NetCDF format or a stack of daily satellite earth observation TIFF files. If only you could just open them all as if they were one dataset.... Well maybe you can! The Kerchunk library builds on top of the fsspec library and provides a interface via Zarr to create an XArray dataset overlay that can span many files stored in object storage.
Personnel
Paul Branson @pbranson
+++
Specific tasks
Data sets and infrastructure support
A number of awesome NOAA datasets are freely available in us-east-1 AWS region:
https://registry.opendata.aws/noaa-goes/
https://registry.opendata.aws/noaa-gefs-reforecast/
Libraries (pip installable):
kerchunk
intake
fsspec
xarray
zarr
h5netcdf
rasterio
cfgrib
gribscan
git+https://github.com/TomAugspurger/cogrib
The problem
Analysing a large stack of dense array data stored in open data buckets with traditional libraries can be hard, slow and sometimes not even possible. Whilst in an ideal world, datasets would be published in an Analysis Ready format, this is frequently not the case and not universally possible due to the variety of access patterns that may preclude a notionally ideal chunking layout.
Given that these datasets are large, owned by someone else and many researchers have limited organisational capacity to rechunk and mirror such large datasets, solutions that enhance the ability to analyse published datasets 'as-is' are valuable. Recent examples here and here of such an approach are spurring a flurry of activity of people 'kerchunk'-ing available datasets as evidenced by the burgeoning kerchunk issues list!
But once a dataset has been kerchunk-ed once, others could use that index - they typically compress considerably into small (<100 MB) files that could be shared for others to reuse or to feed into a Pangeo-Forge recipe.
This project will dive into using kerchunk, making some indexes and brainstorming about platforms for sharing them, perhaps on the InterPlanetary File System (cc @d70-t)
Application example
Excellent example that analyse geostationary SST have featured at OHW2020:
https://nbviewer.org/github/oceanhackweek/ohw-tutorials/blob/OHW20/10-satellite-data-access/goes-cmp-netcdf-zarr.ipynb
And more recently using kerchunk:
https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685
However the Zarr dataset and kerchunk indexes (which take some effort to build) are not readily available.
Australian dataset examples:
https://github.com/IOMRC/intake-aodn
Existing methods
How would you or others traditionally try to address this problem?
Whilst not exactly 'traditional' more cutting-edge, the Pangeo-Forge allows for the generation of recipes to re-publish datasets in analysis ready Zarr stores.
Proposed methods/tools
Building from what you learn at Oceanhackweek, what new approaches would you like to try to implement?
Contribute examples to kerchunk documentation of index creation and dataset analysis with a published index.
GRIB related issues:
fsspec/kerchunk#150
fsspec/kerchunk#127
Beta Was this translation helpful? Give feedback.
All reactions