Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rechunking example #47

Merged
merged 3 commits into from
May 23, 2024
Merged

Add rechunking example #47

merged 3 commits into from
May 23, 2024

Conversation

jrbourbeau
Copy link
Member

This example reads in 1 TB worth of NVM data, rechunks it to be optimized for time selections, and then writes the rechunked dataset to S3 (in oss-scratch-space in us-east-1).

cc @mrocklin. Happy to keep iterating

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@mrocklin
Copy link
Member

Playing now. Neat. Some thoughts!

  1. It might make sense to arrange the data to make spatial access cheap.

    I think that the most common situation I've heard from people is "My satellite pumps out one file every day/hour, so it's organized by time, but I want it organized spatially, so that I can pick out a timeseries for a lat/lon pair really easily.

  2. Maybe at the end we can open up the data with just zarr/xarray without Dask, and show that it's really cheap to get these timeseries, for example from a web application (what they seem to all want to do). I'm actually a little curious about sub-chunk access times. It may be that we want to store the zarr array with far finer chunking than Dask would want so that we're not accessing a bunch of neighboring lat/lon pairs at once. Maybe Xarray does this by default, but maybe not. My hope is that we could show ~100ms access times for little tiny timeseries'.

  3. Thoughts on combining this into the geospatial notebook? I can imagine that in many cases it'll be nice to go from one example to the next, and I wouldn't mind consolidating example notebooks a little.

@mrocklin
Copy link
Member

Oh, I guess the rechunking isn't very impressive though, because it's mostly chunked in this way already ...

Maybe we keep with time-optimized then but maybe some of the other feedback still holds?

@jrbourbeau
Copy link
Member Author

@mrocklin you made some changes offline to this notebook -- want to push up those changes here, or to a different PR (whichever is easiest)?

@mrocklin
Copy link
Member

I've merged your rechunk example to the xarray example.

@mrocklin mrocklin merged commit 1094cf7 into main May 23, 2024
3 checks passed
@mrocklin mrocklin deleted the rechunking branch May 23, 2024 15:26
@jrbourbeau
Copy link
Member Author

Thanks @mrocklin -- I pushed up one minor update in #49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants