New chunking tutorial #442

adele-morrison · 2024-08-15T02:03:37Z

In @Thomas-Moore-Creative's COSIMA talk, there was a lot of interest in having a tutorial in the Recipes showing best practice for chunking for different types of problems.

e.g. which dimensions to chunk in depending on the problem, how big should chunks be, etc.

This might be a good starting point to base this on:
https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/chunking.html

Thomas-Moore-Creative · 2024-08-15T03:54:18Z

@adele-morrison - this is a really good opportunity to talk about community and others contributions. It's a rather circular path of influences. In 2017 it was, as I understand it, COSIMA that brought @jmunroe to Australia and he then helped @dougiesquire and I get Pangeo-style workflows deployed on CSIRO HPC. We became very interested in the importance of chunking and chunking strategies as we had a very large ( and large ensemble ) dataset to deal with. @dougiesquire took a proactive lead on testing strategies for rechunking and zarr format and I learned from that effort. It's very appropriate you link @dougiesquire 's basic notes on chunking.

What would be great is if someone at COSIMA had a real problem that all those interested could work through. The solutions are general but the details really matter (IMO) and going through the process together with a real problem might be a good way to start?

Thomas-Moore-Creative · 2024-08-15T05:20:39Z

@ongqingyee, @jemmajeffree, et al

Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of /scratch we could use collectively to build up some rechunking / ARD workflows together.

jemmajeffree · 2024-08-15T22:39:00Z

I agree that focussing on a specific problem is important. I don't think I have any current problems that would be useful for this. I'm currently working with 2D fields in large ensembles, and have thoroughly optimised importing these with xarray, but they were already chunked in a useful dimension anyway. I think, given the COSIMA output is currently chunked {time:1}, then the best example is probably something through time.

With 2D fields and <250 yrs monthly data, my approach is usually just to haul the whole thing into memory and then rechunk for analysis, so we'd probably need to use either daily data or deliberately try and do stuff on few cores for rechunking separately to make a difference.

I'd be interested in helping develop this tutorial, but I'm going to be a bit slow and unreliable while I'm still building up the courage to engage with the COSIMA github.

ongqingyee · 2024-08-16T01:28:57Z

@ongqingyee, @jemmajeffree, et al

Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of /scratch we could use collectively to build up some rechunking / ARD workflows together.

I have a simple example that can be good. It involves masking out a region around the Antarctic margin and calculating a circumpolar averaged surface speed. In my experience I found that masking and integrating on an xgcm grid makes applying chunking trickier. I'm happy to put together a draft notebook to start and changes/additions can be made? @Thomas-Moore-Creative @jemmajeffree

Working on the same notebook through github is new to me though, so help on the logistics of that would be great.

Thomas-Moore-Creative · 2024-08-16T03:03:44Z

@ongqingyee, @jemmajeffree, et al
Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of /scratch we could use collectively to build up some rechunking / ARD workflows together.

I have a simple example that can be good. It involves masking out a region around the Antarctic margin and calculating a circumpolar averaged surface speed. In my experience I found that masking and integrating on an xgcm grid makes applying chunking trickier. I'm happy to put together a draft notebook to start and changes/additions can be made? @Thomas-Moore-Creative @jemmajeffree

Working on the same notebook through github is new to me though, so help on the logistics of that would be great.

I'm a bit of a COSIMA outsider so others ( @navidcy ? @anton-seaice ? others ) might have something to say about where best to put your new example notebook on the repo and what practice is for branching? FWIW I'd suggest you start a new branch for this issue and others can then contribute via their own branches off your branch? Again, COSIMA regulars might have other views.

Your problem does seem to have a lot of detail so it would be good to see the code, what the source data is, and the goal for the final output. Thanks.

navidcy · 2024-08-16T03:09:46Z

What do you mean "branching" and "branches of your branch"?
You are referring to repository branches?

The best place for an example is in the recipes directory. Open a PR and submit it. Does this clarify the question above?

Thomas-Moore-Creative · 2024-08-16T03:16:44Z

What do you mean "branching" and "branches of your branch"? You are referring to repository branches?

The best place for an example is in the recipes directory. Open a PR and submit it. Does this clarify the question above?

Hopefully I haven't already added confusion for @ongqingyee who was looking for more clarity around githubby things. =)

edoddridge · 2024-08-16T03:24:22Z

@Thomas-Moore-Creative is describing some more advanced GitHub techniques than we generally use in this repo.

When someone opens a PR, they have suggested changes that live on a branch in their repo. If I want to make changes to their PR (and I don't have write access to their PR branch), I can open a pull request based on the branch in the pull request. The original PR owner can then merge my PR into their PR, and then we can merge their PR in to the main repo.

As an example, you can look at this PR in MITgcm: MITgcm/MITgcm#47
Gael Forget and Erik van Sebille both made pull requests on to my PR branch. Those changes were then incorporated in to the PR.

navidcy · 2024-08-16T03:26:27Z

Oh I see what you mean @Thomas-Moore-Creative

Thomas-Moore-Creative · 2024-08-16T03:57:38Z

Oh I see what you mean @Thomas-Moore-Creative

I think the most important point is that whatever the GitHub practice is that it's simple enough and/or supported enough so newbies can engage and make it to the next level of their Github life.

ongqingyee · 2024-08-16T04:35:30Z

What do you mean "branching" and "branches of your branch"? You are referring to repository branches?

The best place for an example is in the recipes directory. Open a PR and submit it. Does this clarify the question above?

This I can do. Thanks all!

adele-morrison added the 🐣 new recipe label Aug 15, 2024

ongqingyee self-assigned this Aug 15, 2024

aekiss mentioned this issue Aug 15, 2024

Analysis-ready chunking of diagnostic output files COSIMA/access-om3#203

Open

Thomas-Moore-Creative self-assigned this Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New chunking tutorial #442

New chunking tutorial #442

adele-morrison commented Aug 15, 2024

Thomas-Moore-Creative commented Aug 15, 2024

Thomas-Moore-Creative commented Aug 15, 2024

jemmajeffree commented Aug 15, 2024

ongqingyee commented Aug 16, 2024

Thomas-Moore-Creative commented Aug 16, 2024

navidcy commented Aug 16, 2024 •

edited

Loading

Thomas-Moore-Creative commented Aug 16, 2024

edoddridge commented Aug 16, 2024

navidcy commented Aug 16, 2024

Thomas-Moore-Creative commented Aug 16, 2024

ongqingyee commented Aug 16, 2024

New chunking tutorial #442

New chunking tutorial #442

Comments

adele-morrison commented Aug 15, 2024

Thomas-Moore-Creative commented Aug 15, 2024

Thomas-Moore-Creative commented Aug 15, 2024

jemmajeffree commented Aug 15, 2024

ongqingyee commented Aug 16, 2024

Thomas-Moore-Creative commented Aug 16, 2024

navidcy commented Aug 16, 2024 • edited Loading

Thomas-Moore-Creative commented Aug 16, 2024

edoddridge commented Aug 16, 2024

navidcy commented Aug 16, 2024

Thomas-Moore-Creative commented Aug 16, 2024

ongqingyee commented Aug 16, 2024

navidcy commented Aug 16, 2024 •

edited

Loading