Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

bouweandela · 2022-05-31T10:43:19Z

Description

Add a dataset class that simplifies dealing with datasets as defined in the recipe.

Goals of this pull request are to:

Support wildcards in the dataset definitions in the recipe
Add a new syntax to the recipe for specifying supplementary datasets for attaching ancillary variables and cell measures to the main variable
Write out the versions of datasets to a 'filled' copy of the recipe, this will make it much easier to reproduce results from previous runs because this stops the data from changing under our feet, taking at least one changing factor out that makes testing from one version to the next difficult. The filled recipe is saved in the run directory, e.g. run/recipe_example_filled.yml when running a recipe called recipe_example.yml.
Refactor esmvalcore/recipe/_recipe.py so it uses the esmvalcore.dataset.Dataset class added in Add esmvalcore.dataset module #1877

Example recipe demonstrating the new wildcard features

preprocessors:
  global_mean:
    area_statistics:
      operator: mean

datasets:
  - dataset: '*'
    institute: '*'
    project: CMIP6
    exp: historical
    ensemble: '*'
    grid: '*'

diagnostics:
  example_diagnostic:
    description: Global mean temperature.
    variables:
      tas:
        mip: Amon
        preprocessor: global_mean
        timerange: '1990/2000'
    scripts: null

this will expand to all CMIP6 datasets and attach the areacella variable that is most similar to the main variable.

Documentation

Temporary `use_legacy_supplementaries` command-line/config-user option

When the fx_variables keyword is used in a preprocessor function, the tool will automatically run with --use_legacy_supplementaries=True and use these definitions to define the supplementary datasets in the recipe. To try out the new behaviour, run the tool with --use_legacy_supplementaries=False. --use_legacy_supplementaries=False is the default when no fx_variables are defined in the recipe. This option will be removed in v2.10.0. To upgrade your recipes, remove all entries fx_variables and if necessary, define the ancillaries as described in

Backward incompatible changes

There is a new and much smarter system for automatically defining the supplementary variables (ancillary variables and cell measures) needed by preprocessor functions in the recipe. In v2.8 it will be used by default when there are no fx_variables defined as preprocessor function arguments in the recipe, in v2.9 it will be used by default in all cases, and in v2.10 it will be the only way. This new system automatically defines supplementary variables using the new wildcards feature, so it will find many more supplementary variables. It will also add only a single supplementary variable, e.g. when a preprocessor function requires cell area, it will add areacello if the main variable is from modeling_realm ocean, seaIce, or ocnBgchem, and areacella in other cases. Previously, both variables were added and downloaded, but then only the one with the matching shape was used. This had the downside that unnecessary data was downloaded and errors in the supplementary variable were silently ignored.

In some cases, the new way of adding supplementary variables may not work (yet), specifically when the facet values cannot be read from the path to the file or from ESGF. In that case, the supplementary variables can be specified in the recipe. Here are some examples of where things may not work and how to solve them. If you prefer to postpone upgrading to v2.10, there is also the option to run the tool with --use-legacy-supplementaries=True, which will use the previous behaviour.

Example with OBS data

When using preprocessor function weighting_landsea_fraction with variable cSoil from Lmon and a dataset from the OBS or OBS6 projects with the default drs setting. This is expected to start working only in v2.9 (see #1609 (comment) for details).

        additional_datasets:
          - dataset: HWSD
            project: OBS
            type: reanaly
            version: 1.2
            tier: 3

the tool will automatically try to use

        additional_datasets:
          - dataset: HWSD
            project: OBS
            type: reanaly
            version: 1.2
            tier: 3
            supplementary_variables:
              - short_name: sftlf
                mip: '*'
                version: '*'

but that does not work yet because the tool cannot yet read the version from the filename, only from the directory name and it is not used in the directory name with the default drs #1943. Therefore, the user needs to specify the supplementary variable like this:

        additional_datasets:
          - dataset: HWSD
            project: OBS
            type: reanaly
            version: 1.2
            tier: 3
            supplementary_variables:
              - short_name: sftlf
                mip: fx

Example with a variable that is on an unusual grid

When using the preprocessor function area_statistics with variable fgco2 from Omon and no supplementary variables defined in the recipe. The tool will automatically add the areacello variable because fgco2 is from an ocean realm, so it will replace

        additional_datasets:
          - dataset: MIROC-ESM
            project: CMIP5
            exp: historical
            ensemble: r1i1p1

with

        additional_datasets:
          - dataset: MIROC-ESM
            project: CMIP5
            exp: historical
            ensemble: r1i1p1
            supplementary_variables:
              - short_name: areacello
                mip: '*'
                exp: '*'
                ensemble: '*'
                institute: '*'
                product: '*'

but that does not work because unusually, the fgco2 variable of MIROC-ESM is not on an ocean grid. To make the recipe work, the user needs to specify the supplementary variable in the recipe for this dataset and tell the tool to not add areacello automatically using the skip flag:

        additional_datasets:
          - dataset: MIROC-ESM
            project: CMIP5
            exp: historical
            ensemble: r1i1p1
            supplementary_variables:
              - short_name: sftlf
                mip: fx
                ensemble: r0i0p0
              - short_name: sftof
                skip: true

see Specifying supplementary variables in the recipe and Preprocessor functions that use ancillary variables and cell measures for more information.

Deprecations

The preprocessor argument fx_variables for various preprocessor functions that use ancillary variables or cell measures is deprecated and will be removed in v2.10.0. The recommended upgrade procedure is to remove the fx_variables section. If automatically defining the required supplementary variables does not work, define them in the variable or (additional_)datasets section as described in the documentation.
The new option use_legacy_supplementaries on the command line/config-user.yml keeps support for the old ways of specifying ancillary variables and cell measures in the recipe, i.e. automatically adding some and customizing those with the preprocessor argument fx_variables. It is currently enabled only if fx_variables is used in the recipe, but will be disabled by default in v2.9 and is scheduled for removal in v2.10.
The preprocessor functions add_fx_variables and remove_fx_variables are deprecated and will be removed in v2.10. Use the new preprocessor functions add_supplementary_variables and remove_supplementary_variables instead.
The callback argument of the function esmvalcore.preprocessor.load is deprecated and will be removed in v2.10 Proposal to deprecate the callback argument from esmvalcore.preprocessor.load in v2.8 and remove it in v2.10. #1800

Please comment if you think this schedule is too fast.

Fixed issues

Closes #56
Closes #377
Closes #589
Closes #1138
Closes #1185
Closes #1297
Closes #1454
Closes #1896

Potential follow ups: #1760, #1891

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

…ecipe

- Use configuration from experimental Output file is now a Path fx file facets are now correctly set when parsing the recipe, no need to do it in data finder anymore

… function

schlunma

Thanks Bouwe again for this impressive PR, all these new features and the refactoring of ancient code is really awesome 🚀

I ran all my tests again, and everything works as expected. 🎉 All remaining problems are well documented, so this PR is ready to be merged from my side 🍻

sloosvel

Thanks for addressing the comments @bouweandela, this PR should definitely be on the highlights for this release!!

remi-kazeroni

Thanks a lot for all the work and addressing all the comments Bouwe! That's really great to have this in time for v2.8 👍

Apart from the many new features that this PR offers, there are also nice examples of how to document backward incompatible changes and deprecated features in the description 👍 I have some final tests based on these explanations and all looks good to me.

Can I ask for 2 final quick things?

removing the branch from .github/workflows/run-tests.yml
using the option name use_legacy_supplementaries everywhere in the description. The use-legacy-ancillaries is not used and the options with - are not working if specified in the config file (only _ works).

If no final comments, I'd would propose to merge this PR today at 15:00 CET (14:00 GMT) so that we can have this in main before the weekend and see how the testing machinery reacts. This would also allow those working on the final v2.8 PRs depending on this one to merge the main branch and push their PRs to the finish line 👍

remi-kazeroni · 2023-02-24T11:16:46Z

It would be interesting to discuss in a wider circle including the @ESMValGroup/scientific-lead-development-team (maybe at a workshop?) where we want to go with these 2 new recipe formats to have a common understanding of which formats should be used for recipes in the main branch. Once this PR is merged (and v2.8 released), two new recipe formats will be allowed:

concise recipes with possibly plenty of wildcards;
very lengthy recipes containing one key-value pair per line.

I understand these two new formats would be extremely useful to compose recipes and to ease reproducibility by recording dataset versions as stated in the docs with this PR. But I wonder about the readability of the new format, particularly the second one. For example, recipe_impact.yml has 227 lines and recipe_impact_filled.yml 7137 lines. I wonder if we should clarify in our docs which formats could be accepted in the main branch of the Tool and have a policy about that.

valeriupredoi

many a test was done by me too, and I find this monster to be a good monster, like Godzilla - mega well done @bouweandela and all the other peeps who tested and suggested 🍻 x 10

bouweandela · 2023-02-27T09:13:23Z

Thanks to everyone who helped with testing and reviewing this! 🎉

bouweandela added 6 commits September 10, 2021 16:47

First draft of dataset class

12242e6

Merge branch 'main' of github.com:esmvalgroup/esmvalcore into split-r…

2e3b339

…ecipe

Work in progress

f4c2d72

Merge branch 'main' of github.com:esmvalgroup/esmvalcore into split-r…

696c9f2

…ecipe

Fix error in _dataset.py

a116f88

Add ancillary variables

9b28044

This was referenced May 31, 2022

Allow wildcard searches when specifying fx variables in preprocessor #1082

Closed

Apply clip_timerange to time dependent fx variables #1603

Merged

bouweandela added 3 commits May 31, 2022 16:55

Make loading a cube work from the Python API

ba2ba53

Add an integration test for API load

dca7634

Move stuff from recipe to dataset

e5dd013

bouweandela mentioned this pull request Jun 7, 2022

Make _find_input_files public #1618

Closed

10 tasks

bouweandela added 6 commits June 12, 2022 23:51

Some progess

ccf44f1

Restore previous code

1f3ab2f

Make new recipe format for ancillaries work

6496f78

Work in progress

679f674

Progress

55ad481

... is slow

bff4918

bouweandela mentioned this pull request Jun 20, 2022

Moved extraction of reference datasets for (horizontal and vertical) regridding into preprocessor functions #1455

Closed

10 tasks

bouweandela added this to the v2.7.0 milestone Jun 20, 2022

bouweandela added 4 commits July 4, 2022 17:54

Improved addition of ancillary variables in recipe

c5e2430

Merge branch 'main' of github.com:esmvalgroup/esmvalcore into split-r…

e5a1382

…ecipe

Fix things that got broken due to lack of tests

f1e54a4

Remove old config-user.yml code

9e95091

bouweandela force-pushed the split-recipe branch from eddd511 to 9e95091 Compare July 8, 2022 20:33

bouweandela added 5 commits July 8, 2022 23:21

Update data finder tests

3c36ecb

- Use configuration from experimental Output file is now a Path fx file facets are now correctly set when parsing the recipe, no need to do it in data finder anymore

Update CMOR table load tests to use file instead of dict

5d54394

Remove unnessary code

2d19798

Update PreprocessorFile creation and remove test for no longer needed…

8becb37

… function

Work in progress

7ad9c4e

bouweandela added the backwards incompatible change label Feb 24, 2023

schlunma approved these changes Feb 24, 2023

View reviewed changes

sloosvel approved these changes Feb 24, 2023

View reviewed changes

remi-kazeroni approved these changes Feb 24, 2023

View reviewed changes

valeriupredoi approved these changes Feb 24, 2023

View reviewed changes

turn off GA tests

7004a9d

remi-kazeroni merged commit 638a399 into main Feb 24, 2023

remi-kazeroni deleted the split-recipe branch February 24, 2023 14:45

bouweandela mentioned this pull request Feb 27, 2023

Add a directory and filename template with facets stored in directory for OBS and OBS6? #1944

Closed

bouweandela mentioned this pull request Feb 27, 2023

More reproducible recipes in ESMValTool ESMValGroup/ESMValTool#3054

Open

Peter9192 mentioned this pull request Feb 27, 2023

dataset.from_files always returns empty #1896

Closed

valeriupredoi mentioned this pull request Mar 2, 2023

Changelog for v2.8.0rc1 #1952

Merged

8 tasks

bouweandela mentioned this pull request Mar 9, 2023

Fix issue where data was not loaded and saved #1963

Merged

8 tasks

remi-kazeroni mentioned this pull request Mar 10, 2023

empty ancestor list in recipe_esacci_lst #1967

Closed

axel-lauer mentioned this pull request Mar 20, 2023

Removed fx_variables in recipe_mpqb_xch4 and recipe_lauer22jclim_fig8 ESMValGroup/ESMValTool#3117

Merged

5 tasks

This was referenced May 4, 2023

area_statistics unable to load areacello FX files when diagnostic MIP doesn't match fx MIP #440

Closed

Implement multiple ensemble style syntax used for datasets in fx variable descriptions for pre processor #1081

Closed

schlunma mentioned this pull request Jul 14, 2023

iris.analysis.cartography.area_weights() not working with 2D ocean coordinates #380

Closed

bouweandela mentioned this pull request Sep 28, 2023

Remove the deprecated option use_legacy_supplementaries #2202

Merged

9 tasks

bouweandela mentioned this pull request Apr 3, 2024

Shrink _recipe.py #639

Closed

schlunma mentioned this pull request Oct 15, 2024

Remove recipe filler utility ESMValGroup/ESMValTool#3777

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

bouweandela commented May 31, 2022 •

edited

Loading

schlunma left a comment

sloosvel left a comment

remi-kazeroni left a comment

remi-kazeroni commented Feb 24, 2023

valeriupredoi left a comment

bouweandela commented Feb 27, 2023

Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

Conversation

bouweandela commented May 31, 2022 • edited Loading

Description

Example recipe demonstrating the new wildcard features

Documentation

Temporary use_legacy_supplementaries command-line/config-user option

Backward incompatible changes

Example with OBS data

Example with a variable that is on an unusual grid

Deprecations

Fixed issues

Before you get started

Checklist

schlunma left a comment

Choose a reason for hiding this comment

sloosvel left a comment

Choose a reason for hiding this comment

remi-kazeroni left a comment

Choose a reason for hiding this comment

remi-kazeroni commented Feb 24, 2023

valeriupredoi left a comment

Choose a reason for hiding this comment

bouweandela commented Feb 27, 2023

bouweandela commented May 31, 2022 •

edited

Loading

Temporary `use_legacy_supplementaries` command-line/config-user option