-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New chunking tutorial #442
Comments
@adele-morrison - this is a really good opportunity to talk about community and others contributions. It's a rather circular path of influences. In 2017 it was, as I understand it, COSIMA that brought @jmunroe to Australia and he then helped @dougiesquire and I get Pangeo-style workflows deployed on CSIRO HPC. We became very interested in the importance of chunking and chunking strategies as we had a very large ( and large ensemble ) dataset to deal with. @dougiesquire took a proactive lead on testing strategies for rechunking and What would be great is if someone at COSIMA had a real problem that all those interested could work through. The solutions are general but the details really matter (IMO) and going through the process together with a real problem might be a good way to start? |
@ongqingyee, @jemmajeffree, et al Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of |
I agree that focussing on a specific problem is important. I don't think I have any current problems that would be useful for this. I'm currently working with 2D fields in large ensembles, and have thoroughly optimised importing these with xarray, but they were already chunked in a useful dimension anyway. I think, given the COSIMA output is currently chunked {time:1}, then the best example is probably something through time. With 2D fields and <250 yrs monthly data, my approach is usually just to haul the whole thing into memory and then rechunk for analysis, so we'd probably need to use either daily data or deliberately try and do stuff on few cores for rechunking separately to make a difference. I'd be interested in helping develop this tutorial, but I'm going to be a bit slow and unreliable while I'm still building up the courage to engage with the COSIMA github. |
I have a simple example that can be good. It involves masking out a region around the Antarctic margin and calculating a circumpolar averaged surface speed. In my experience I found that masking and integrating on an xgcm grid makes applying chunking trickier. I'm happy to put together a draft notebook to start and changes/additions can be made? @Thomas-Moore-Creative @jemmajeffree Working on the same notebook through github is new to me though, so help on the logistics of that would be great. |
I'm a bit of a COSIMA outsider so others ( @navidcy ? @anton-seaice ? others ) might have something to say about where best to put your new example notebook on the repo and what practice is for branching? FWIW I'd suggest you start a new branch for this issue and others can then contribute via their own branches off your branch? Again, COSIMA regulars might have other views. Your problem does seem to have a lot of detail so it would be good to see the code, what the source data is, and the goal for the final output. Thanks. |
What do you mean "branching" and "branches of your branch"? The best place for an example is in the |
Hopefully I haven't already added confusion for @ongqingyee who was looking for more clarity around githubby things. =) |
@Thomas-Moore-Creative is describing some more advanced GitHub techniques than we generally use in this repo. When someone opens a PR, they have suggested changes that live on a branch in their repo. If I want to make changes to their PR (and I don't have write access to their PR branch), I can open a pull request based on the branch in the pull request. The original PR owner can then merge my PR into their PR, and then we can merge their PR in to the main repo. As an example, you can look at this PR in MITgcm: MITgcm/MITgcm#47 |
Oh I see what you mean @Thomas-Moore-Creative |
I think the most important point is that whatever the GitHub practice is that it's simple enough and/or supported enough so newbies can engage and make it to the next level of their Github life. |
This I can do. Thanks all! |
In @Thomas-Moore-Creative's COSIMA talk, there was a lot of interest in having a tutorial in the Recipes showing best practice for chunking for different types of problems.
e.g. which dimensions to chunk in depending on the problem, how big should chunks be, etc.
This might be a good starting point to base this on:
https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/chunking.html
The text was updated successfully, but these errors were encountered: