Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMIP6 climate patterns #2785

Merged
merged 84 commits into from
Jun 20, 2024
Merged

CMIP6 climate patterns #2785

merged 84 commits into from
Jun 20, 2024

Conversation

mo-gregmunday
Copy link
Contributor

@mo-gregmunday mo-gregmunday commented Sep 1, 2022

Description

This diagnostic generates climate patterns for CMIP6 models.


Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

New or updated recipe/diagnostic

@valeriupredoi
Copy link
Contributor

hi @mo-gregmunday many thanks for opening this PR! Could I please ask you to set a descriptive title (it's not very clear to what the PR does from the current title), and also look for two reviewers - Pull Requests to ESMValTool usually need both a scientific and a technical reviewer - for the scientific bit you will have to ask someone who's got some experience with the implementations related to the sciencey bits, and the tech reviewer is usually someone who's just gonna go through the code and reviewe its technical/programming/deployment etc bits. Cheers 🍺

Copy link
Contributor

@Jon-Lillis Jon-Lillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, @mo-gregmunday. This diagnostic is looking good, documentation builds and reads well, and the metric runs as expected.

I think there are still a few Codacy warnings that we can easily address, so I’ve made a few comments on this throughout the review above (edit: below 😄). Reducing the number of local variables in some of your calculation functions could be a bit more challenging, but if we address enough of the low hanging fruit elsewhere then maybe people will look kindly on their complexity.

One general note is that I think it would be useful to add a few more debug logs throughout the diagnostic, e.g. logger.debug('Processing model: {}'.format(model)) at the start of the patterns function to make the logs a bit more useful.

@ESMValGroup/esmvaltool-coreteam, I've got a few questions about parallelisation within a diagnostic. Given the number of datasets being processed by this metric, @mo-gregmunday has successfully used a multiprocessing pool to help bring the execution time down significantly by processing each dataset concurrently. Do you think this is an appropriate way to do this within ESMValTool, or is there a more efficient way that we haven’t considered?

If it is appropriate, then I’ve another question. The number of cores to be divvied up by the pool is currently defined in the recipe file. My instinct was to suggest that either all available cores or ‘max_parallel_tasks’ in the config-user.yml file could be used instead, but both could be problematic when the second diagnostic (coming in a future PR) is introduced and attempts to use the same number when ESMValTool itself runs them concurrently. Is there a way to get the number of diagnostics from a recipe file within a diagnostic so that the pool can be given ‘max_parallel_tasks / number of diagnostics’?

esmvaltool/recipes/recipe_climate_patterns.yml Outdated Show resolved Hide resolved
esmvaltool/diag_scripts/climate_patterns/sub_functions.py Outdated Show resolved Hide resolved
esmvaltool/diag_scripts/climate_patterns/sub_functions.py Outdated Show resolved Hide resolved
@mo-gregmunday
Copy link
Contributor Author

hi @mo-gregmunday many thanks for opening this PR! Could I please ask you to set a descriptive title (it's not very clear to what the PR does from the current title), and also look for two reviewers - Pull Requests to ESMValTool usually need both a scientific and a technical reviewer - for the scientific bit you will have to ask someone who's got some experience with the implementations related to the sciencey bits, and the tech reviewer is usually someone who's just gonna go through the code and reviewe its technical/programming/deployment etc bits. Cheers 🍺

Hi @valeriupredoi, sure - I'll add one now. @Jon-Lillis is leading the technical review on this one, and I'm still on the hunt for an 'ESMValTool-certified' scientific reviewer, although scientifically speaking this code has been verified by experts internally.

@valeriupredoi
Copy link
Contributor

cheers @mo-gregmunday 🍺 Maybe you can add one of those internal reviewers as sci reviewer here? If they're on GH, that is

@mo-gregmunday
Copy link
Contributor Author

cheers @mo-gregmunday 🍺 Maybe you can add one of those internal reviewers as sci reviewer here? If they're on GH, that is

Hi @valeriupredoi, apologies for the slow response, I've been on annual leave for the last few weeks!

I've got their GH username: eleanorgb, however I think they may need to be added to the ESMValTool repo team on here before I can add them?

@bouweandela
Copy link
Member

I've got their GH username: @eleanorgb, however I think they may need to be added to the ESMValTool repo team on here before I can add them?

Anyone with a GitHub account can review any pull request, but if you would like to add her to the organization you can send an email to @axel-lauer.

@bouweandela
Copy link
Member

bouweandela commented Oct 7, 2022

https://github.com/orgs/ESMValGroup/teams/esmvaltool-coreteam, I've got a few questions about parallelisation within a diagnostic. Given the number of datasets being processed by this metric, @mo-gregmunday has successfully used a multiprocessing pool to help bring the execution time down significantly by processing each dataset concurrently. Do you think this is an appropriate way to do this within ESMValTool, or is there a more efficient way that we haven’t considered?

It works, but it does have its problems. For example:

  • each process will start its own dask scheduler which will try to run things in parallel. This may lead to needless context switching, slowing down the application. Also, dask will need to be configured to spill to disk as soon as the memory reaches something like (80 / the number of schedulers you're running) percent of the memory, or you may run out of memory.
  • dask is not designed to be used like this, so it will hang if you open a file from the parent process and then try to open it again in the child process (see New preprocessor to clip values to a certain range. ESMValCore#403).

A better solution would be to make use of the features provided by dask. However, this may first need better support in ESMValCore. We're currently experimenting with this in ESMValGroup/ESMValCore#1714. In the future, I think we may automatically add a bit of code to every Python diagnostic so it uses the dask scheduler that is configured for ESMValCore. Make sure that you do not needlessly realize the data in the diagnostic script (e.g. use cube.core_data() instead of cube.data, where cube is an iris.cube.Cube, wherever appropriate).

My advice would be to use multiprocessing for now if you have to, but make sure you add it in such a way that it can easily be removed in the future (perhaps add a switch to disable it from the recipe?).

@mo-gregmunday
Copy link
Contributor Author

I would like to propose that suggestions that would take a long time are postponed to a follow-up issue

My experience from past pull requests is that this kind of thing ends up never happening, so I'm not too keen. What kind of changes are these that are so time consuming?

@bouweandela @ehogan I've gone ahead and made the changes, so should all be ready!

Copy link
Contributor

@ehogan ehogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mo-gregmunday I did a final sanity check on the latest version of the changes and have a few minor comments 👍

@mo-gregmunday
Copy link
Contributor Author

https://github.com/orgs/ESMValGroup/teams/esmvaltool-coreteam, I've got a few questions about parallelisation within a diagnostic. Given the number of datasets being processed by this metric, @mo-gregmunday has successfully used a multiprocessing pool to help bring the execution time down significantly by processing each dataset concurrently. Do you think this is an appropriate way to do this within ESMValTool, or is there a more efficient way that we haven’t considered?

It works, but it does have its problems. For example:

  • each process will start its own dask scheduler which will try to run things in parallel. This may lead to needless context switching, slowing down the application. Also, dask will need to be configured to spill to disk as soon as the memory reaches something like (80 / the number of schedulers you're running) percent of the memory, or you may run out of memory.
  • dask is not designed to be used like this, so it will hang if you open a file from the parent process and then try to open it again in the child process (see New preprocessor to clip values to a certain range. ESMValCore#403).

A better solution would be to make use of the features provided by dask. However, this may first need better support in ESMValCore. We're currently experimenting with this in ESMValGroup/ESMValCore#1714. In the future, I think we may automatically add a bit of code to every Python diagnostic so it uses the dask scheduler that is configured for ESMValCore. Make sure that you do not needlessly realize the data in the diagnostic script (e.g. use cube.core_data() instead of cube.data, where cube is an iris.cube.Cube, wherever appropriate).

My advice would be to use multiprocessing for now if you have to, but make sure you add it in such a way that it can easily be removed in the future (perhaps add a switch to disable it from the recipe?).

There is a switch in the recipe file which allows the script to be parallelised or not (using multiprocessing). I've not found any need for Dask in terms of optimisation of the script itself - I've vectorised the linear regression operations on cube data which is very fast, and elsewhere in the script I've used cube.core_data() where I can to optimise memory usage.

Copy link
Contributor

@Jon-Lillis Jon-Lillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, Greg! I think there may still be some dask optimisation to be done in future but for now I'm happy that my comments have been discussed and addressed, and I think this looks ready to be included in the next release.

@mo-gregmunday
Copy link
Contributor Author

Changes look good, Greg! I think there may still be some dask optimisation to be done in future but for now I'm happy that my comments have been discussed and addressed, and I think this looks ready to be included in the next release.

Thanks so much @Jon-Lillis!

Copy link
Contributor

@ehogan ehogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @mo-gregmunday, many thanks for addressing all my review comments! @bouweandela and / or @valeriupredoi, would it be possible for one of you to merge this asap, please? We are planning to start the process of finalising the release tomorrow! 😁

@mo-gregmunday
Copy link
Contributor Author

Great work @mo-gregmunday, many thanks for addressing all my review comments! @bouweandela and / or @valeriupredoi, would it be possible for one of you to merge this asap, please? We are planning to start the process of finalising the release tomorrow! 😁

Thanks so much @ehogan, @Jon-Lillis and @eleanorgb for your time and hard work reviewing this!! :)

@ehogan
Copy link
Contributor

ehogan commented Jun 19, 2024

One last comment to add that I did run the recipe at the MO:

  • with parallelise: true:
INFO    [27875] Time for running the recipe was: 0:16:28.707423
INFO    [27875] Maximum memory used (estimate): 97.7 GB
[...]
INFO    [27875] Run was successful
  • with parallelise: false:
INFO    [44881] Time for running the recipe was: 0:26:49.341768
INFO    [44881] Maximum memory used (estimate): 43.3 GB
[...]
INFO    [44881] Run was successful

@bouweandela bouweandela merged commit e7c5cd5 into main Jun 20, 2024
8 checks passed
@bouweandela bouweandela deleted the climate_patterns_only branch June 20, 2024 08:43
@valeriupredoi
Copy link
Contributor

good work, folks! Afraid this merge broke the GA tests https://github.com/ESMValGroup/ESMValTool/actions/runs/9606425046/job/26495922820 - but @ehogan is fixing it by ontroducing package-level importing in #3672

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generating Climate Patterns from CMIP6 Models
7 participants