-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Along-slope-velocities.ipynb
to use intake from Cosima Coo…
#458
base: main
Are you sure you want to change the base?
Updating Along-slope-velocities.ipynb
to use intake from Cosima Coo…
#458
Conversation
…kbook. Couple of issues: 1. Cosima Cookbook seems to allow searching for coordinates: eg (`st_edges_ocean = cc.querying.getvar(experiment, 'st_edges_ocean', session, start_time=start_time, end_time=end_time, n=1) st_edges_array = st_edges_ocean.expand_dims({'yu_ocean': u.yu_ocean, 'xu_ocean': u.xu_ocean}, axis=[1, 2])` This doesn't seem to be easily achievable with intake (Cell 11 pre-update)? I can't currently figure out how to do this. 2. Updating from `cosima_cookbook` => `intake` seems to massively increase compute time for `topographic _slop_magnitude` calculation: `UserWarning: Sending large graph of size 47.19 MiB.` => `Sending large graph of size 123.9 MiB.`. The source of issue 2 is currently unclear to me & is dramatically increasing computation time.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charles-turner-1 The pattern for selecting out ranges in lat_range = slice(-82, -59)
darray = some_intake_catalog_dataset.to_dask()
ht = darray['ht'].sel(yt_ocean = lat_range) |
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. cat_filtered = catalog.search(name=EXPERIMENT)
This line can go directly to the experiment by key instead:
cat_filtered = catalog[EXPERIMENT]
Reply via ReviewNB
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #18. hu = hu_vars.to_dataset_dict(progress_bar=False).get('ocean_grid.fx').drop(
I suggest splitting this into multiple steps - this makes it clearer what each step is doing, and also makes it easier for those interested to stop the process at a given step and look at what they have.
Reply via ReviewNB
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. u_vars, v_vars = (esm_datastore.search(variable='u',filename='ocean.nc'),
I have pulled a copy of the notebook and am playing with it now, but my initial thought is, why can't this be pulled into one array that contains both variables?
Reply via ReviewNB
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #5. alongslope_velocity = alongslope_velocity.load()
I'd want to check if the load
here is actually necessary - I think the lazy loading magic happens automatically once it's required.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm loading here isn't necessary, but I'm wondering if not doing that is what's made the final plotting cell run so slowly. Although admittedly, I'm only running on 7 cores, and I'm thinking there's some clever Dask/xarray things we could do to speed it up (see below)
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hack I found is that you need to search for and convert to Dask a variable that uses that coordinate scheme. I agree though, there really should be a way to load coordinates only.
Reply via ReviewNB
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. salt_ds = esm_datastore.search(path='/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/*',variable='salt',frequency='1mon').to_dataset_dict()
Again, avoid searching on filenames (or other 'magic things' we know because we're experiment experts)
Reply via ReviewNB
Recipes/Along-slope-velocities.ipynb
Outdated
@@ -23,6 +23,7 @@ | |||
"outputs": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, using a 7-CPU ARE instance, running this cell without pre-loading the alongslope_velocity data took about 90 mins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This speed was back down to <1 min once I put the time select back in.
Couple of general comments:
|
Thanks @marc-white - just digging through your comments now. I've noticed that accessing |
Check the data chunking, perhaps? The default chunking respects the file structure, so if there's one file per time step, it's going to make every time step its own chunk... I'm not familiar with |
- Where possible, all direct file reads have been removed - still requires a bit of a hack to get the grid file (Cell [10]) & `st_edges` (Cell [13]). - Varnames lowercased. - Typos fixed. - Direct indexing of catalog. - Redundant xarray calls fixed. Notes: - It appears the catalog has several data frequencies available for this experiment, unlike `cosima_cookbook`. - I made the choice to disambiguate to 3 month freq. to minimise computation size (lowest frequency available). - Migrating from cosima_cookbook => intake substantially increased dask_graph layers - was necessary to call dask.optimize() to flatten these out. All chunking, etc. is the same as it was - currently not clear to me where the discrepancy in compute graph size originates. - Currently doesn't appear that intake supports any good way to search for coordinate variables? Leads to aforementioned hacks.
I've updated to notebook to address all comments & remove all direct file opens / using known paths, where possible - there's still a couple dodgy workarounds to get coordinates & grid data. @navidcy: My general approach would be to treat things like lon/lat slices, time slices etc. that we are doing the analysis over as constants - hence uppercasing them - does this go against the general convention in Cosima? Happy to follow that if so |
@charles-turner-1 can you explain your method for comparing the dask_graph layers? |
The cookbook did attempt to do some stuff with chunksizes e..g. https://github.com/COSIMA/cosima-cookbook/blob/a9f5fc9b2ff3daf1cca0b9dcd7173636c2e274e4/cosima_cookbook/querying.py#L353 but not without it's issues (e.g. COSIMA/cosima-cookbook#333) The other change that may impact timing is that xarray changed the default chunks it applied, it now uses the chunksizes defined in the source netcdf files (which for a lot of ACCESS-OM data, are very small). I think go to town on specifying some better sizes. |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2024-09-19T07:22:43Z Line #3. .to_dask() I think you can add a xarray_open_kwargs={'chunks':{'time':...}}
to this to_dask.
I would try to aim for the chunks before the sel and mean operations to be ~300MB ? charles-turner-1 commented on 2024-09-20T02:23:44Z Cheers, fiddled with this a little bit more and managed have to remove a good chunk of the explicit optimisations - interestingly the cosima_cookbook version was using the same chunk sizes I was getting here using defaults.
Do you know if the order of operations can affect the graph size? Struggling to see any other likely causes of the number of graph layers blowing up anton-seaice commented on 2024-09-20T03:45:53Z The number of graph layers seems to be coming from the number of files. There are two per file opened. I didn't realise thats how graph layers worked ! I don't know of optimisations for that but maybe there are ?
setting
seemed like an ok starting point marc-white commented on 2024-09-20T06:58:00Z I've been playing with chunking as well this afternoon, and I think that's the 'solution' (or as far as you might get) - all of the |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2024-09-19T07:22:44Z Line #3. I would search the catalog for the data which has st_edges_ocean as a coordinate, and load that instead ? And then grad st_edges_ocean from it ? Saying that, if its fixed frequency, then you might still end up with this ['path'][0] charles-turner-1 commented on 2024-09-20T02:18:47Z AFAIK there is currently no way of searching for coordinates? See intake/intake-esm#660, I'm working on seeing if we can patch that.
I agree the current approach is messy. anton-seaice commented on 2024-09-20T03:52:05Z Nevermind, my suggestion won't work. st_edges_ocean doesn't appear to be the coordinate for any variable (at least not obviously so) |
@marc-white To be honest, I've largely just eyeballed this - I tried to take a look at the dask_graph with I'm still not sure why manual optimisation is necessary, nor where the intermediate layers (not found in the cookbook version) are coming from in here. The Dask API reference reads to me as if it just prunes redundant graph layers & should leave result unchanged. |
AFAIK there is currently no way of searching for coordinates? See intake/intake-esm#660, currently working on seeing if we can patch that.
I agree the current approach is messy. View entire conversation on ReviewNB |
Cheers, fiddled with this a little bit more and managed have to remove a good chunk of the explicit optimisations - interestingly the cosima_cookbook version was using the same chunk sizes I was getting here using defaults.
Do you know if the order of operations can affect the graph size? Struggling to see any other likely causes of the number of graph layers blowing up View entire conversation on ReviewNB |
The number of graph layers seems to be coming from the number of files. There are two per file opened. I didn't realise thats how graph layers worked ! I don't know of optimisations for that but maybe there are ?
setting
seemed like an ok starting point View entire conversation on ReviewNB |
Nevermind, my suggestion won't work. st_edges_ocean doesn't appear to be the coordinates for any variable (at least obviously so) View entire conversation on ReviewNB |
- Extracted depth_slice & data freq to constants & changed default data frequency to 1 month (`1mon`) - Added a bunch of comments & explanation surrounding chunking. - Added some comments surrounding operations & what they're doing
I've been playing with chunking as well this afternoon, and I think that's the 'solution' (or as far as you might get) - all of the View entire conversation on ReviewNB |
I realized that the time selection from the cell creating |
Interestingly, if I omit the |
View / edit / reply to this conversation on ReviewNB marc-white commented on 2024-09-23T06:31:14Z Line #6. Add in a time slice - will be needed later! |
View / edit / reply to this conversation on ReviewNB marc-white commented on 2024-09-23T06:31:15Z Line #15. xarray_open_kwargs = {'chunks': chunk_spec} Chunking on time here will have no effect, given the data are split into files by time step. You'd need to separately chunk on time after creating the array. |
View / edit / reply to this conversation on ReviewNB marc-white commented on 2024-09-23T06:31:15Z Line #17. ).sel(st_ocean=depth_slice Need to select on the time slice created above, otherwise data volume is huge!
Also, ignore the Dask graph layer reports - optimizing seems to have no real effect on the speed of the rest of the notebook. |
…kbook.
Couple of issues:
This doesn't seem to be easily achievable with intake (Cell 11 pre-update)? I can't currently figure out how to do this & will keep digging.
cosima_cookbook
=>intake
seems to massively increase compute time fortopographic _slop_magnitude
calculation:UserWarning: Sending large graph of size 47.19 MiB.
=>Sending large graph of size 123.9 MiB.
.Not quite sure the source of issue 2 yet, discussion in #313 looks there might have been a few other similar performance hits - will keep digging.
Currently don't have the right permissions to create a new branch in cosima-recipes - if someone could add me I'll create a new branch for this