Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of volume_statistics #1545

Merged
merged 14 commits into from
Jun 17, 2022
Merged

Improve performance of volume_statistics #1545

merged 14 commits into from
Jun 17, 2022

Conversation

sloosvel
Copy link
Contributor

@sloosvel sloosvel commented Mar 17, 2022

Description

This PR tries to improve the performance of volume_statistics using iris and dask functions.
Closes #1498

Link to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@sloosvel sloosvel added the enhancement New feature or request label Mar 17, 2022
@sloosvel sloosvel added this to the v2.6.0 milestone Mar 17, 2022
@sloosvel sloosvel requested a review from zklaus March 17, 2022 13:43
@codecov
Copy link

codecov bot commented Mar 17, 2022

Codecov Report

Merging #1545 (8d89633) into main (d7ce1d2) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1545      +/-   ##
==========================================
+ Coverage   91.39%   91.46%   +0.07%     
==========================================
  Files         204      204              
  Lines       11176    11128      -48     
==========================================
- Hits        10214    10178      -36     
+ Misses        962      950      -12     
Impacted Files Coverage Δ
esmvalcore/preprocessor/_volume.py 87.20% <100.00%> (+4.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7ce1d2...8d89633. Read the comment docs.

@zklaus
Copy link

zklaus commented Mar 21, 2022

Looking good so far. Will have a proper look when "Ready for review".

@sloosvel sloosvel marked this pull request as ready for review March 23, 2022 09:33
@zklaus
Copy link

zklaus commented Apr 5, 2022

The CI seems confused. I'll close and reopen to trigger a restart.

@zklaus zklaus closed this Apr 5, 2022
@zklaus zklaus reopened this Apr 5, 2022
@sloosvel
Copy link
Contributor Author

sloosvel commented Apr 5, 2022

Just merged with main to see if it helps. In any case I wanted to add an extra test to see if the weighting works but I am not sure how to do so.

@sloosvel
Copy link
Contributor Author

Is there anything pending to be adressed otherwise? Maybe it would be good if someone could double check that the changes do not affect the diagnostics.

@sloosvel
Copy link
Contributor Author

@ESMValGroup/esmvaltool-coreteam Anyone with time to review this?
@ledm do you have any example for a small test as you mentioned in #1498 (comment)?

I think if the changes are reasonable it would be an nice improvement for 2.6, and if this needs further work at least there is still time to correct whatever is needed.

Many thanks in advance

@sloosvel
Copy link
Contributor Author

I tried implementing the following test to take into account the weights:

    def test_volume_statistics_weights(self):
        data = np.ma.arange(1, 25).reshape(2, 3, 2, 2)
        self.grid_4d.data = data
        measure = iris.coords.CellMeasure(
            data,
            standard_name='ocean_volume',
            units='m3',
            measure='volume'
        )
        self.grid_4d.add_cell_measure(measure, range(0, measure.ndim))
        result = volume_statistics(self.grid_4d, 'mean')
        expected = np.ma.array(
            [8.333333333333334, 19.144144144144143],
            mask=[False, False])
        self.assert_array_equal(result.data, expected)

This passes for both implementations.

However there is a difference when the input data is defined as regular array instead of a masked array, even if the mask of the previous example is set to False :

    def test_volume_statistics_weights(self):
        data = np.arange(1, 25).reshape(2, 3, 2, 2)
        self.grid_4d.data = data
        measure = iris.coords.CellMeasure(
            data,
            standard_name='ocean_volume',
            units='m3',
            measure='volume'
        )
        self.grid_4d.add_cell_measure(measure, range(0, measure.ndim))
        result = volume_statistics(self.grid_4d, 'mean')
        expected = np.ma.array(
            [8.333333333333334, 19.144144144144143],
            mask=[False, False])
        self.assert_array_equal(result.data, expected)

This passes the test for the new implementation, but not for the current one. Even though the content of the arrays is the same, the only difference is that in the first example the array is defined as a masked array object with the mask set to False at every point. This difference is coming from this part of the code in main:

            try:
                layer_vol = np.ma.masked_where(cube[time_itr, z_itr].data.mask,
                                               grid_volume[time_itr,
                                                           z_itr]).sum()

            except AttributeError:
                # ####
                # No mask in the cube data.
                layer_vol = grid_volume.sum()

When the data in the cube is defined as a masked array object, layer_vol is computed by using a slice of grid_volume and summing it. Whereas when the data is defined as an array and has no mask in it, layer_volume sums over the whole grid_volume array, instead of a slice. I guess this is a bug?

I think that since this preprocessor is mostly used in ocean variables that tend to have a mask, the change in implementations should not affect the results though.

@zklaus
Copy link

zklaus commented May 31, 2022

Good catch, @sloosvel. Indeed, I think at least with recent Iris versions all cubes loaded from netcdf files have a mask. In any case, I would suggest to replace the code in main that you posted, i.e.

            try:
                layer_vol = np.ma.masked_where(cube[time_itr, z_itr].data.mask,
                                               grid_volume[time_itr,
                                                           z_itr]).sum()

            except AttributeError:
                # ####
                # No mask in the cube data.
                layer_vol = grid_volume.sum()

with

            layer_vol = np.ma.masked_where(np.ma.getmask(cube[time_itr, z_itr].core_data()),
                                           grid_volume[time_itr, z_itr]).sum()

To reiterate what we all know:

  • Avoid cube.data in favor of cube.core_data() to keep things lazy where possible
  • Avoid masked_array.data and masked_array.mask in favor of np.ma.getdata and (np.ma.getmask or np.ma.getmaskarray).

@sloosvel
Copy link
Contributor Author

Yes, I already implemented that in a lazy way. My question is that I don't understand why the grid volume get's computed differently depending on the type of array and whether or not this behaviour is intentional or a bug. Because if it's intentional I would need to change the code.

@zklaus
Copy link

zklaus commented May 31, 2022

Gotcha. Yes, I think that was a bug.

@sloosvel
Copy link
Contributor Author

sloosvel commented Jun 2, 2022

Is there anything missing in this PR for it to get approved and merged?

@zklaus
Copy link

zklaus commented Jun 2, 2022

It was not clear to me that this is ready for review now, since new commits were added. Will have a look later.

@sloosvel
Copy link
Contributor Author

sloosvel commented Jun 3, 2022

Will have a look later.

Unfortunately, I can't do later because things are starting to pile up for me, and with the holidays and everything I would rather not die trying to get this release out.

Please @ESMValGroup/esmvaltool-coreteam reviews are needed here!

@sloosvel
Copy link
Contributor Author

sloosvel commented Jun 7, 2022

Ran out of ideas as for how can I get this reviewed. As you can see @kserradell, @pabretonniere, I tried. Guess we will have to tell users running in HR that ESMValTool is not ready to handle their data, as the current implementation dies before finishing the computations.

@sloosvel sloosvel modified the milestones: v2.6.0, v2.7.0 Jun 8, 2022
@schlunma schlunma self-requested a review June 15, 2022 18:35
@schlunma
Copy link
Contributor

I will have a look at this on Friday or early next week!

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a very stupid question, but why does a simple

result = cube.collapsed(
        [cube.coord(axis='Z'), cube.coord(axis='Y'), cube.coord(axis='X')],
        iris.analysis.MEAN,
        weights=grid_volume)

instead of first collapsing X,Y and then Z not work?

@sloosvel
Copy link
Contributor Author

I think I got confused with the double loop. You can collapse all coordinates at once, but you need to mask the volume first. If you don't mask the volume and collapse all coordinates at once, you don't get the same results.

Maybe doing it in two steps helps with the memory use though? After all this implementation is much faster, but the previous one uses less memory.

@schlunma
Copy link
Contributor

I think if collapsing all coordinates at once works (with masking the volume first), you should go for it. It (1) makes the code much easier and (2) there is one step less for the dask scheduler.

@sloosvel
Copy link
Contributor Author

The tests pass, but there is an slight difference in the results with real data:

current:  thetao = 4.274955 ;

new:  thetao = 4.27496 ;

I used the recipe in #1603 (review). However, the tool we have at the BSC collapses all coordinates at once so I don't think it's a wrong approach, these must be precision differences.

@schlunma
Copy link
Contributor

However, the tool we have at the BSC collapses all coordinates at once so I don't think it's a wrong approach, these must be precision differences.

Yes, I think so, too.

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!! The two recipes that use that (recipe_ocean_bgc.yml and recipe_ocean_example.yml) finish in about 40s with this change, before that, they didn't finish after 10min on Levante (no idea how they finished in 3min for our v2.5 testing...)

(Tested this with 32GB available).

Great work!! 🚀

@sloosvel sloosvel merged commit acc5377 into main Jun 17, 2022
@sloosvel sloosvel deleted the dev_vol_stats branch June 17, 2022 16:16
@sloosvel sloosvel modified the milestones: v2.7.0, v2.6.0 Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance of volume_statistics
3 participants