-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of volume_statistics
#1545
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1545 +/- ##
==========================================
+ Coverage 91.39% 91.46% +0.07%
==========================================
Files 204 204
Lines 11176 11128 -48
==========================================
- Hits 10214 10178 -36
+ Misses 962 950 -12
Continue to review full report at Codecov.
|
Looking good so far. Will have a proper look when "Ready for review". |
The CI seems confused. I'll close and reopen to trigger a restart. |
Just merged with main to see if it helps. In any case I wanted to add an extra test to see if the weighting works but I am not sure how to do so. |
Is there anything pending to be adressed otherwise? Maybe it would be good if someone could double check that the changes do not affect the diagnostics. |
@ESMValGroup/esmvaltool-coreteam Anyone with time to review this? I think if the changes are reasonable it would be an nice improvement for 2.6, and if this needs further work at least there is still time to correct whatever is needed. Many thanks in advance |
I tried implementing the following test to take into account the weights: def test_volume_statistics_weights(self):
data = np.ma.arange(1, 25).reshape(2, 3, 2, 2)
self.grid_4d.data = data
measure = iris.coords.CellMeasure(
data,
standard_name='ocean_volume',
units='m3',
measure='volume'
)
self.grid_4d.add_cell_measure(measure, range(0, measure.ndim))
result = volume_statistics(self.grid_4d, 'mean')
expected = np.ma.array(
[8.333333333333334, 19.144144144144143],
mask=[False, False])
self.assert_array_equal(result.data, expected) This passes for both implementations. However there is a difference when the input data is defined as regular array instead of a masked array, even if the mask of the previous example is set to def test_volume_statistics_weights(self):
data = np.arange(1, 25).reshape(2, 3, 2, 2)
self.grid_4d.data = data
measure = iris.coords.CellMeasure(
data,
standard_name='ocean_volume',
units='m3',
measure='volume'
)
self.grid_4d.add_cell_measure(measure, range(0, measure.ndim))
result = volume_statistics(self.grid_4d, 'mean')
expected = np.ma.array(
[8.333333333333334, 19.144144144144143],
mask=[False, False])
self.assert_array_equal(result.data, expected) This passes the test for the new implementation, but not for the current one. Even though the content of the arrays is the same, the only difference is that in the first example the array is defined as a masked array object with the mask set to False at every point. This difference is coming from this part of the code in try:
layer_vol = np.ma.masked_where(cube[time_itr, z_itr].data.mask,
grid_volume[time_itr,
z_itr]).sum()
except AttributeError:
# ####
# No mask in the cube data.
layer_vol = grid_volume.sum() When the data in the cube is defined as a masked array object, I think that since this preprocessor is mostly used in ocean variables that tend to have a mask, the change in implementations should not affect the results though. |
Good catch, @sloosvel. Indeed, I think at least with recent Iris versions all cubes loaded from netcdf files have a mask. In any case, I would suggest to replace the code in main that you posted, i.e.
with
To reiterate what we all know:
|
Yes, I already implemented that in a lazy way. My question is that I don't understand why the grid volume get's computed differently depending on the type of array and whether or not this behaviour is intentional or a bug. Because if it's intentional I would need to change the code. |
Gotcha. Yes, I think that was a bug. |
Is there anything missing in this PR for it to get approved and merged? |
It was not clear to me that this is ready for review now, since new commits were added. Will have a look later. |
Unfortunately, I can't do later because things are starting to pile up for me, and with the holidays and everything I would rather not die trying to get this release out. Please @ESMValGroup/esmvaltool-coreteam reviews are needed here! |
Ran out of ideas as for how can I get this reviewed. As you can see @kserradell, @pabretonniere, I tried. Guess we will have to tell users running in HR that ESMValTool is not ready to handle their data, as the current implementation dies before finishing the computations. |
I will have a look at this on Friday or early next week! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a very stupid question, but why does a simple
result = cube.collapsed(
[cube.coord(axis='Z'), cube.coord(axis='Y'), cube.coord(axis='X')],
iris.analysis.MEAN,
weights=grid_volume)
instead of first collapsing X,Y and then Z not work?
I think I got confused with the double loop. You can collapse all coordinates at once, but you need to mask the volume first. If you don't mask the volume and collapse all coordinates at once, you don't get the same results. Maybe doing it in two steps helps with the memory use though? After all this implementation is much faster, but the previous one uses less memory. |
I think if collapsing all coordinates at once works (with masking the volume first), you should go for it. It (1) makes the code much easier and (2) there is one step less for the dask scheduler. |
The tests pass, but there is an slight difference in the results with real data:
I used the recipe in #1603 (review). However, the tool we have at the BSC collapses all coordinates at once so I don't think it's a wrong approach, these must be precision differences. |
Yes, I think so, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!! The two recipes that use that (recipe_ocean_bgc.yml
and recipe_ocean_example.yml
) finish in about 40s with this change, before that, they didn't finish after 10min on Levante (no idea how they finished in 3min for our v2.5 testing...)
(Tested this with 32GB available).
Great work!! 🚀
Description
This PR tries to improve the performance of
volume_statistics
using iris and dask functions.Closes #1498
Link to documentation:
Before you get started
Checklist
It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.
- [ ] 🛠 Any changed dependencies have been added or removed correctlyTo help with the number pull requests: