Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

Merged
merged 152 commits into from
Feb 24, 2023

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented May 31, 2022

Description

Add a dataset class that simplifies dealing with datasets as defined in the recipe.

Goals of this pull request are to:

  • Support wildcards in the dataset definitions in the recipe
  • Add a new syntax to the recipe for specifying supplementary datasets for attaching ancillary variables and cell measures to the main variable
  • Write out the versions of datasets to a 'filled' copy of the recipe, this will make it much easier to reproduce results from previous runs because this stops the data from changing under our feet, taking at least one changing factor out that makes testing from one version to the next difficult. The filled recipe is saved in the run directory, e.g. run/recipe_example_filled.yml when running a recipe called recipe_example.yml.
  • Refactor esmvalcore/recipe/_recipe.py so it uses the esmvalcore.dataset.Dataset class added in Add esmvalcore.dataset module #1877

Example recipe demonstrating the new wildcard features

preprocessors:
  global_mean:
    area_statistics:
      operator: mean

datasets:
  - dataset: '*'
    institute: '*'
    project: CMIP6
    exp: historical
    ensemble: '*'
    grid: '*'

diagnostics:
  example_diagnostic:
    description: Global mean temperature.
    variables:
      tas:
        mip: Amon
        preprocessor: global_mean
        timerange: '1990/2000'
    scripts: null

this will expand to all CMIP6 datasets and attach the areacella variable that is most similar to the main variable.

Documentation

Temporary use_legacy_supplementaries command-line/config-user option

When the fx_variables keyword is used in a preprocessor function, the tool will automatically run with --use_legacy_supplementaries=True and use these definitions to define the supplementary datasets in the recipe. To try out the new behaviour, run the tool with --use_legacy_supplementaries=False. --use_legacy_supplementaries=False is the default when no fx_variables are defined in the recipe. This option will be removed in v2.10.0. To upgrade your recipes, remove all entries fx_variables and if necessary, define the ancillaries as described in

Backward incompatible changes

There is a new and much smarter system for automatically defining the supplementary variables (ancillary variables and cell measures) needed by preprocessor functions in the recipe. In v2.8 it will be used by default when there are no fx_variables defined as preprocessor function arguments in the recipe, in v2.9 it will be used by default in all cases, and in v2.10 it will be the only way. This new system automatically defines supplementary variables using the new wildcards feature, so it will find many more supplementary variables. It will also add only a single supplementary variable, e.g. when a preprocessor function requires cell area, it will add areacello if the main variable is from modeling_realm ocean, seaIce, or ocnBgchem, and areacella in other cases. Previously, both variables were added and downloaded, but then only the one with the matching shape was used. This had the downside that unnecessary data was downloaded and errors in the supplementary variable were silently ignored.

In some cases, the new way of adding supplementary variables may not work (yet), specifically when the facet values cannot be read from the path to the file or from ESGF. In that case, the supplementary variables can be specified in the recipe. Here are some examples of where things may not work and how to solve them. If you prefer to postpone upgrading to v2.10, there is also the option to run the tool with --use-legacy-supplementaries=True, which will use the previous behaviour.

Example with OBS data

When using preprocessor function weighting_landsea_fraction with variable cSoil from Lmon and a dataset from the OBS or OBS6 projects with the default drs setting. This is expected to start working only in v2.9 (see #1609 (comment) for details).

        additional_datasets:
          - dataset: HWSD
            project: OBS
            type: reanaly
            version: 1.2
            tier: 3

the tool will automatically try to use

        additional_datasets:
          - dataset: HWSD
            project: OBS
            type: reanaly
            version: 1.2
            tier: 3
            supplementary_variables:
              - short_name: sftlf
                mip: '*'
                version: '*'

but that does not work yet because the tool cannot yet read the version from the filename, only from the directory name and it is not used in the directory name with the default drs #1943. Therefore, the user needs to specify the supplementary variable like this:

        additional_datasets:
          - dataset: HWSD
            project: OBS
            type: reanaly
            version: 1.2
            tier: 3
            supplementary_variables:
              - short_name: sftlf
                mip: fx

Example with a variable that is on an unusual grid

When using the preprocessor function area_statistics with variable fgco2 from Omon and no supplementary variables defined in the recipe. The tool will automatically add the areacello variable because fgco2 is from an ocean realm, so it will replace

        additional_datasets:
          - dataset: MIROC-ESM
            project: CMIP5
            exp: historical
            ensemble: r1i1p1

with

        additional_datasets:
          - dataset: MIROC-ESM
            project: CMIP5
            exp: historical
            ensemble: r1i1p1
            supplementary_variables:
              - short_name: areacello
                mip: '*'
                exp: '*'
                ensemble: '*'
                institute: '*'
                product: '*'

but that does not work because unusually, the fgco2 variable of MIROC-ESM is not on an ocean grid. To make the recipe work, the user needs to specify the supplementary variable in the recipe for this dataset and tell the tool to not add areacello automatically using the skip flag:

        additional_datasets:
          - dataset: MIROC-ESM
            project: CMIP5
            exp: historical
            ensemble: r1i1p1
            supplementary_variables:
              - short_name: sftlf
                mip: fx
                ensemble: r0i0p0
              - short_name: sftof
                skip: true

see Specifying supplementary variables in the recipe and Preprocessor functions that use ancillary variables and cell measures for more information.

Deprecations

  • The preprocessor argument fx_variables for various preprocessor functions that use ancillary variables or cell measures is deprecated and will be removed in v2.10.0. The recommended upgrade procedure is to remove the fx_variables section. If automatically defining the required supplementary variables does not work, define them in the variable or (additional_)datasets section as described in the documentation.
  • The new option use_legacy_supplementaries on the command line/config-user.yml keeps support for the old ways of specifying ancillary variables and cell measures in the recipe, i.e. automatically adding some and customizing those with the preprocessor argument fx_variables. It is currently enabled only if fx_variables is used in the recipe, but will be disabled by default in v2.9 and is scheduled for removal in v2.10.
  • The preprocessor functions add_fx_variables and remove_fx_variables are deprecated and will be removed in v2.10. Use the new preprocessor functions add_supplementary_variables and remove_supplementary_variables instead.
  • The callback argument of the function esmvalcore.preprocessor.load is deprecated and will be removed in v2.10 Proposal to deprecate the callback argument from esmvalcore.preprocessor.load in v2.8 and remove it in v2.10. #1800

Please comment if you think this schedule is too fast.

Fixed issues

Closes #56
Closes #377
Closes #589
Closes #1138
Closes #1185
Closes #1297
Closes #1454
Closes #1896

Potential follow ups: #1760, #1891


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@bouweandela bouweandela mentioned this pull request Jun 7, 2022
10 tasks
- Use configuration from experimental

Output file is now a Path

fx file facets are now correctly set when parsing the recipe, no need to do it in data finder anymore
Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bouwe again for this impressive PR, all these new features and the refactoring of ancient code is really awesome 🚀

I ran all my tests again, and everything works as expected. 🎉 All remaining problems are well documented, so this PR is ready to be merged from my side 🍻

Copy link
Contributor

@sloosvel sloosvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments @bouweandela, this PR should definitely be on the highlights for this release!!

Copy link
Contributor

@remi-kazeroni remi-kazeroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for all the work and addressing all the comments Bouwe! That's really great to have this in time for v2.8 👍

Apart from the many new features that this PR offers, there are also nice examples of how to document backward incompatible changes and deprecated features in the description 👍 I have some final tests based on these explanations and all looks good to me.

Can I ask for 2 final quick things?

  • removing the branch from .github/workflows/run-tests.yml
  • using the option name use_legacy_supplementaries everywhere in the description. The use-legacy-ancillaries is not used and the options with - are not working if specified in the config file (only _ works).

If no final comments, I'd would propose to merge this PR today at 15:00 CET (14:00 GMT) so that we can have this in main before the weekend and see how the testing machinery reacts. This would also allow those working on the final v2.8 PRs depending on this one to merge the main branch and push their PRs to the finish line 👍

@remi-kazeroni
Copy link
Contributor

It would be interesting to discuss in a wider circle including the @ESMValGroup/scientific-lead-development-team (maybe at a workshop?) where we want to go with these 2 new recipe formats to have a common understanding of which formats should be used for recipes in the main branch. Once this PR is merged (and v2.8 released), two new recipe formats will be allowed:

  • concise recipes with possibly plenty of wildcards;
  • very lengthy recipes containing one key-value pair per line.

I understand these two new formats would be extremely useful to compose recipes and to ease reproducibility by recording dataset versions as stated in the docs with this PR. But I wonder about the readability of the new format, particularly the second one. For example, recipe_impact.yml has 227 lines and recipe_impact_filled.yml 7137 lines. I wonder if we should clarify in our docs which formats could be accepted in the main branch of the Tool and have a policy about that.

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many a test was done by me too, and I find this monster to be a good monster, like Godzilla - mega well done @bouweandela and all the other peeps who tested and suggested 🍻 x 10

@bouweandela
Copy link
Member Author

Thanks to everyone who helped with testing and reviewing this! 🎉

@valeriupredoi valeriupredoi mentioned this pull request Mar 2, 2023
8 tasks
@bouweandela bouweandela mentioned this pull request Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment