Improve performance of `volume_statistics` #1545

sloosvel · 2022-03-17T13:43:00Z

Description

This PR tries to improve the performance of volume_statistics using iris and dask functions.
Closes #1498

Link to documentation:

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
~~- [ ] 🛠 Any changed dependencies have been added or removed correctly~~
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov · 2022-03-17T13:46:21Z

Codecov Report

Merging #1545 (8d89633) into main (d7ce1d2) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1545      +/-   ##
==========================================
+ Coverage   91.39%   91.46%   +0.07%     
==========================================
  Files         204      204              
  Lines       11176    11128      -48     
==========================================
- Hits        10214    10178      -36     
+ Misses        962      950      -12

Impacted Files	Coverage Δ
esmvalcore/preprocessor/_volume.py	`87.20% <100.00%> (+4.37%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7ce1d2...8d89633. Read the comment docs.

zklaus · 2022-03-21T13:49:29Z

Looking good so far. Will have a proper look when "Ready for review".

zklaus · 2022-04-05T12:08:04Z

The CI seems confused. I'll close and reopen to trigger a restart.

sloosvel · 2022-04-05T14:15:39Z

Just merged with main to see if it helps. In any case I wanted to add an extra test to see if the weighting works but I am not sure how to do so.

sloosvel · 2022-05-10T07:18:44Z

Is there anything pending to be adressed otherwise? Maybe it would be good if someone could double check that the changes do not affect the diagnostics.

sloosvel · 2022-05-16T10:55:14Z

@ESMValGroup/esmvaltool-coreteam Anyone with time to review this?
@ledm do you have any example for a small test as you mentioned in #1498 (comment)?

I think if the changes are reasonable it would be an nice improvement for 2.6, and if this needs further work at least there is still time to correct whatever is needed.

Many thanks in advance

sloosvel · 2022-05-30T17:00:44Z

I tried implementing the following test to take into account the weights:

    def test_volume_statistics_weights(self):
        data = np.ma.arange(1, 25).reshape(2, 3, 2, 2)
        self.grid_4d.data = data
        measure = iris.coords.CellMeasure(
            data,
            standard_name='ocean_volume',
            units='m3',
            measure='volume'
        )
        self.grid_4d.add_cell_measure(measure, range(0, measure.ndim))
        result = volume_statistics(self.grid_4d, 'mean')
        expected = np.ma.array(
            [8.333333333333334, 19.144144144144143],
            mask=[False, False])
        self.assert_array_equal(result.data, expected)

This passes for both implementations.

However there is a difference when the input data is defined as regular array instead of a masked array, even if the mask of the previous example is set to False :

    def test_volume_statistics_weights(self):
        data = np.arange(1, 25).reshape(2, 3, 2, 2)
        self.grid_4d.data = data
        measure = iris.coords.CellMeasure(
            data,
            standard_name='ocean_volume',
            units='m3',
            measure='volume'
        )
        self.grid_4d.add_cell_measure(measure, range(0, measure.ndim))
        result = volume_statistics(self.grid_4d, 'mean')
        expected = np.ma.array(
            [8.333333333333334, 19.144144144144143],
            mask=[False, False])
        self.assert_array_equal(result.data, expected)

This passes the test for the new implementation, but not for the current one. Even though the content of the arrays is the same, the only difference is that in the first example the array is defined as a masked array object with the mask set to False at every point. This difference is coming from this part of the code in main:

            try:
                layer_vol = np.ma.masked_where(cube[time_itr, z_itr].data.mask,
                                               grid_volume[time_itr,
                                                           z_itr]).sum()

            except AttributeError:
                # ####
                # No mask in the cube data.
                layer_vol = grid_volume.sum()

When the data in the cube is defined as a masked array object, layer_vol is computed by using a slice of grid_volume and summing it. Whereas when the data is defined as an array and has no mask in it, layer_volume sums over the whole grid_volume array, instead of a slice. I guess this is a bug?

I think that since this preprocessor is mostly used in ocean variables that tend to have a mask, the change in implementations should not affect the results though.

zklaus · 2022-05-31T08:49:29Z

Good catch, @sloosvel. Indeed, I think at least with recent Iris versions all cubes loaded from netcdf files have a mask. In any case, I would suggest to replace the code in main that you posted, i.e.

            try:
                layer_vol = np.ma.masked_where(cube[time_itr, z_itr].data.mask,
                                               grid_volume[time_itr,
                                                           z_itr]).sum()

            except AttributeError:
                # ####
                # No mask in the cube data.
                layer_vol = grid_volume.sum()

with

            layer_vol = np.ma.masked_where(np.ma.getmask(cube[time_itr, z_itr].core_data()),
                                           grid_volume[time_itr, z_itr]).sum()

To reiterate what we all know:

Avoid cube.data in favor of cube.core_data() to keep things lazy where possible
Avoid masked_array.data and masked_array.mask in favor of np.ma.getdata and (np.ma.getmask or np.ma.getmaskarray).

sloosvel · 2022-05-31T09:32:55Z

Yes, I already implemented that in a lazy way. My question is that I don't understand why the grid volume get's computed differently depending on the type of array and whether or not this behaviour is intentional or a bug. Because if it's intentional I would need to change the code.

zklaus · 2022-05-31T09:40:41Z

Gotcha. Yes, I think that was a bug.

sloosvel · 2022-06-02T07:13:16Z

Is there anything missing in this PR for it to get approved and merged?

zklaus · 2022-06-02T08:42:46Z

It was not clear to me that this is ready for review now, since new commits were added. Will have a look later.

sloosvel · 2022-06-03T12:43:09Z

Will have a look later.

Unfortunately, I can't do later because things are starting to pile up for me, and with the holidays and everything I would rather not die trying to get this release out.

Please @ESMValGroup/esmvaltool-coreteam reviews are needed here!

sloosvel · 2022-06-07T09:32:25Z

Ran out of ideas as for how can I get this reviewed. As you can see @kserradell, @pabretonniere, I tried. Guess we will have to tell users running in HR that ESMValTool is not ready to handle their data, as the current implementation dies before finishing the computations.

schlunma · 2022-06-15T18:36:00Z

I will have a look at this on Friday or early next week!

schlunma

This is probably a very stupid question, but why does a simple

result = cube.collapsed(
        [cube.coord(axis='Z'), cube.coord(axis='Y'), cube.coord(axis='X')],
        iris.analysis.MEAN,
        weights=grid_volume)

instead of first collapsing X,Y and then Z not work?

sloosvel · 2022-06-17T12:46:07Z

I think I got confused with the double loop. You can collapse all coordinates at once, but you need to mask the volume first. If you don't mask the volume and collapse all coordinates at once, you don't get the same results.

Maybe doing it in two steps helps with the memory use though? After all this implementation is much faster, but the previous one uses less memory.

schlunma · 2022-06-17T13:34:05Z

I think if collapsing all coordinates at once works (with masking the volume first), you should go for it. It (1) makes the code much easier and (2) there is one step less for the dask scheduler.

sloosvel · 2022-06-17T14:36:27Z

The tests pass, but there is an slight difference in the results with real data:

current:  thetao = 4.274955 ;

new:  thetao = 4.27496 ;

I used the recipe in #1603 (review). However, the tool we have at the BSC collapses all coordinates at once so I don't think it's a wrong approach, these must be precision differences.

schlunma · 2022-06-17T14:38:29Z

However, the tool we have at the BSC collapses all coordinates at once so I don't think it's a wrong approach, these must be precision differences.

Yes, I think so, too.

schlunma

Awesome!! The two recipes that use that (recipe_ocean_bgc.yml and recipe_ocean_example.yml) finish in about 40s with this change, before that, they didn't finish after 10min on Levante (no idea how they finished in 3min for our v2.5 testing...)

(Tested this with 32GB available).

Great work!! 🚀

sloosvel added 4 commits March 15, 2022 15:43

Use dask and iris

65d60ce

Fix flake

e5006aa

Add mask

8cc4966

Fix

3162d55

sloosvel added the enhancement New feature or request label Mar 17, 2022

sloosvel added this to the v2.6.0 milestone Mar 17, 2022

sloosvel requested a review from zklaus March 17, 2022 13:43

sloosvel added 2 commits March 23, 2022 10:07

Improve test coverage

7b31b43

Fix flake

4db4cff

sloosvel marked this pull request as ready for review March 23, 2022 09:33

Fix typo

83264ef

zklaus closed this Apr 5, 2022

zklaus reopened this Apr 5, 2022

Merge remote-tracking branch 'origin/main' into dev_vol_stats

665fb33

sloosvel added 4 commits May 31, 2022 14:26

Add test

8a1f6d1

Remove duplicated test

0f15747

Add docstring

bf9eda7

Add back accidentally removed test

3a2ee99

sloosvel modified the milestones: v2.6.0, v2.7.0 Jun 8, 2022

schlunma self-requested a review June 15, 2022 18:35

schlunma reviewed Jun 17, 2022

View reviewed changes

sloosvel added 2 commits June 17, 2022 16:08

Merge remote-tracking branch 'origin/main' into dev_vol_stats

01d1805

Collapse all axis at once

8d89633

schlunma approved these changes Jun 17, 2022

View reviewed changes

sloosvel merged commit acc5377 into main Jun 17, 2022

sloosvel deleted the dev_vol_stats branch June 17, 2022 16:16

sloosvel modified the milestones: v2.7.0, v2.6.0 Jun 17, 2022

sloosvel mentioned this pull request Jul 8, 2022

Recipe running results for v2.6.0rc3 and v2.6.0rc4 ESMValGroup/ESMValTool#2704

Closed

79 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `volume_statistics` #1545

Improve performance of `volume_statistics` #1545

sloosvel commented Mar 17, 2022 •

edited

Loading

codecov bot commented Mar 17, 2022 •

edited

Loading

zklaus commented Mar 21, 2022

zklaus commented Apr 5, 2022

sloosvel commented Apr 5, 2022

sloosvel commented May 10, 2022

sloosvel commented May 16, 2022

sloosvel commented May 30, 2022

zklaus commented May 31, 2022

sloosvel commented May 31, 2022

zklaus commented May 31, 2022

sloosvel commented Jun 2, 2022

zklaus commented Jun 2, 2022

sloosvel commented Jun 3, 2022

sloosvel commented Jun 7, 2022

schlunma commented Jun 15, 2022

schlunma left a comment

sloosvel commented Jun 17, 2022

schlunma commented Jun 17, 2022

sloosvel commented Jun 17, 2022

schlunma commented Jun 17, 2022

schlunma left a comment

Improve performance of volume_statistics #1545

Improve performance of volume_statistics #1545

Conversation

sloosvel commented Mar 17, 2022 • edited Loading

Description

Before you get started

Checklist

codecov bot commented Mar 17, 2022 • edited Loading

Codecov Report

zklaus commented Mar 21, 2022

zklaus commented Apr 5, 2022

sloosvel commented Apr 5, 2022

sloosvel commented May 10, 2022

sloosvel commented May 16, 2022

sloosvel commented May 30, 2022

zklaus commented May 31, 2022

sloosvel commented May 31, 2022

zklaus commented May 31, 2022

sloosvel commented Jun 2, 2022

zklaus commented Jun 2, 2022

sloosvel commented Jun 3, 2022

sloosvel commented Jun 7, 2022

schlunma commented Jun 15, 2022

schlunma left a comment

Choose a reason for hiding this comment

sloosvel commented Jun 17, 2022

schlunma commented Jun 17, 2022

sloosvel commented Jun 17, 2022

schlunma commented Jun 17, 2022

schlunma left a comment

Choose a reason for hiding this comment

Improve performance of `volume_statistics` #1545

Improve performance of `volume_statistics` #1545

sloosvel commented Mar 17, 2022 •

edited

Loading

codecov bot commented Mar 17, 2022 •

edited

Loading