Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow wildcard searches when specifying fx variables in preprocessor #1082

Closed
wants to merge 45 commits into from

Conversation

thomascrocker
Copy link
Contributor

@thomascrocker thomascrocker commented Apr 26, 2021

Description

Closes #1081
Finding data for fx variables is a bit of a pain depending on MIP project.
For CMIP5, fx variables are usually stored under a new ensemble (r0i0p0) although there are a small number of exceptions, see for example,
https://esgf-index1.ceda.ac.uk/esg-search/search/?offset=0&limit=10&type=Dataset&replica=false&latest=true&project=CMIP5&cmor_table=fx&ensemble=r0i0p1%2Cr0i0p141%2Cr0i0p142%2Cr0i0p150%2Cr0i0p151%2Cr0i0p2%2Cr0i0p3%2Cr0i0p4%2Cr0i0p5%2Cr0i0p6%2Cr1i1p1&facets=project%2Cproduct%2Cinstitute%2Cmodel%2Cexperiment%2Cexperiment_family%2Ctime_frequency%2Crealm%2Ccmor_table%2Censemble%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson
For CORDEX: It appears to vary arbitrarily depending on modelling centre. "Invariant fields (frequency=fx) may have the value r0i0p0 or that of the corresponding GCMEnsembleMember attribute" http://is-enes-data.github.io/cordex_archive_specifications.pdf
For CMIP6: I can't find any specifications, but it seems fx files are mostly under individual ensemble memebers, but not sure if this is the case for all models and fx fields.

This fix essentially detects if finding fx files has failed. If it has, it will automatically search again under r0i0p0. This way, the user can still specify explicitly the fx ensemble to use (and ESMValTool will prefer the files stored under the same ensemble as the variable if the user doesn't specify anything), and CMIP6 will work correctly. For CORDEX and CMIP5 the software will happily find the correct files under r0i0p0 and writes a line to the debug logger to record as such.
This fix solves the issue by allowing the user to specify FX variables in the recipe using glob style wildcards, this can then be used to allow the pre-processor to find FX data that is stored in different locations.


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@schlunma
Copy link
Contributor

Hi @thomascrocker, thanks for your contribution!

I had a brief look at this and found that our tests for data_finder.py are not covering fx files at the moment. I will add thorough tests for that tomorrow. I think it makes sense to include the additional tests first before reviewing/merging this PR. I hope that is okay for you!

Sorry for any inconvenience!

@thomascrocker
Copy link
Contributor Author

Hi @thomascrocker, thanks for your contribution!

I had a brief look at this and found that our tests for data_finder.py are not covering fx files at the moment. I will add thorough tests for that tomorrow. I think it makes sense to include the additional tests first before reviewing/merging this PR. I hope that is okay for you!

Sorry for any inconvenience!

Thanks, I saw that some of the tests had failed under the commit test, but didn't understand enough about them to be honest. Happy for you to add whatever you feel is necessary.

esmvalcore/_data_finder.py Outdated Show resolved Hide resolved
@valeriupredoi
Copy link
Contributor

nice work @thomascrocker 👍

@thomascrocker
Copy link
Contributor Author

Ok, I've added a short explanation to the documentation. I realised this functionality affects more than just land / sea masking, (i.e. anywhere that FX files are used by the pre processor) so I've added the same paragraph in all the places I think it is relevant. It could be that I may have missed some though.

@valeriupredoi
Copy link
Contributor

cheers @thomascrocker 👍 Indeed it does but it looks OK to me. Still I'd be happy if @schlunma could have one last look at the tests, just to be 100% sure things work the way they should, one extra pair of technical eyes is always good 🍺

@valeriupredoi
Copy link
Contributor

frenly ping @schlunma for a final review (tests changed) and @thomascrocker fix the conflict pls 🍺

@bouweandela
Copy link
Member

bouweandela commented May 14, 2021

Is this still needed, given that an improved syntax for specifying fx variables was introduced in #999?

@zklaus As our expert on conventions, do you have an opinion on these changes?

@valeriupredoi
Copy link
Contributor

Is this still needed, given that an improved syntax for specifying fx variables was introduced in #999?

@zklaus As our expert on conventions, do you have an opinion on these changes?

yes it's needed to find the actual files, this doesn't impact the loading and use of the fx variables

@ledm
Copy link
Contributor

ledm commented Feb 2, 2022

In a perfect world, we wouldn't need this as every model would have it's fx data in the correct place and we wouldn't need wildcards. More pragmatically, I've been using this code for at least 6 months and it has been massively labour saving. I have no doubt that others would find it just as helpful as I have in the short term.

In the longer term, while I don't fully understand what @zklaus and @bouweandela are asking for instead of this, it sounds like it may require a significant re-write of good parts of this code. (And may be superseded by up-coming changes in iris?) I think that so long as it's passing tests, and works as described, merging this now might be helpful for a lot of people.

It's a fantastic feature that is very powerful and very user friendly.

@bouweandela
Copy link
Member

The problem with this is that recipes that use this feature are not reproducible and you will never notice. Let me try to get the fx file specification in the recipe right for version v2.6 and if I don't manage by then you can merge this. Does that sound like a compromise?

@ledm
Copy link
Contributor

ledm commented Feb 8, 2022

As I'm still in the process of using this code, (it's really the only tool for the job), can you please help with the conflicts in _data_finder at some point? There's something weird going on there that's preventing me from running.

@schlunma
Copy link
Contributor

schlunma commented Feb 8, 2022

I can have a look at the conflicts, but probably won't have the time to test everything.

@schlunma
Copy link
Contributor

schlunma commented Feb 8, 2022

I solved the conflicts now, let me know if the branch works again @ledm

@ledm
Copy link
Contributor

ledm commented Feb 8, 2022

You're a hero. Thanks @schlunma

@ledm
Copy link
Contributor

ledm commented Feb 14, 2022

I used this branch without the most recent patch (5c455ed) and with several additional fixes but no significant changes, to produce an analysis over the weekend on jasmin. The preprocessor ran to completion, and everything appears fine in the output text, however all netcdfs in preproc are fully masked and contain no data. It's really weird and I don't get it. Any ideas?

Edit: This looks like a different issue. nevermind.

@ledm
Copy link
Contributor

ledm commented May 25, 2022

Hi, was this functionality added to 2.6? I'm still very happy with @schlunma's code and using it a lot.

@zklaus
Copy link

zklaus commented May 30, 2022

@bouweandela, I think @ledm's comment refers to your earlier comment

The problem with this is that recipes that use this feature are not reproducible and you will never notice. Let me try to get the fx file specification in the recipe right for version v2.6 and if I don't manage by then you can merge this. Does that sound like a compromise?

So... what do you think? Did we make progress? Is there an issue or perhaps even already a PR to track here?

@bouweandela
Copy link
Member

Only a branch so far, the deadline for the release seemed so far away ;-) I'll try to make some progress today.

@bouweandela
Copy link
Member

Just fixed the merge conflict and turned the branch into a draft pull request: #1609

@bouweandela
Copy link
Member

It seems unlikely that I'll get #1609 finished in time for v2.6, it turns out to be quite a bit of work, but I'm off to a good start so can I ask that you please wait with this for one more release?

@ledm
Copy link
Contributor

ledm commented Oct 6, 2022

So we haven't merged this yet?

I still need to use wildcards for FX files every time that I run ESMValTool, and probably will need to keep doing so for the foreseeable future. What is the suggested best alternative practice instead of using this method? #1609 is still a draft and this #1082 conflicts with main now.

@bouweandela
Copy link
Member

Hi @ledm, #1609 is almost ready to be used, maybe I can show you how to use it and you can try it out at the upcoming workshop?

@ledm
Copy link
Contributor

ledm commented Oct 7, 2022

Yes please, I'm incredibly frustrated with this. FX files are crucial, and yet they have never worked properly or had a flexible enough interface. It's been four years that I've been raising this as an issue and it's still broken and undocumented.

The solutions that I wrote for this work in 2021 and earlier no longer work, so now I'm about to submit two different papers on the basis of this PR, and neither of them can be reproduced right now (or re-run!) This is the exact reason why you initially declined this PR - reproducibility of recipes in publications.

However, for obvious reasons, I'm extremely keen to not have to re-write everything at the last minute right before submission.

@ledm
Copy link
Contributor

ledm commented Oct 7, 2022

Also, there's no changes to documentation in #1609 yet, so if you want me to use it, you can expect a lot of emails/whatsapp messages! (Just ask @valeriupredoi about that, lol!)

@bouweandela
Copy link
Member

I added an example recipe to the description of the pull request. I'm still working on it so expect things to break, especially commits called WIP (Work In Progress), but the example runs if you check out commit 98f052f and set offline: false in config-user.yml.

@bouweandela
Copy link
Member

It's been four years that I've been raising this as an issue and it's still broken and undocumented.

Raising an issue is a good start, but in the end, someone needs to do the actual work.

Note that the current functionality is actually documented, it just doesn't work very well because it assumes a certain organization of files that just isn't the reality on ESGF.

@schlunma
Copy link
Contributor

schlunma commented May 4, 2023

Closing this but keeping the branch alive; this has been properly implemented in #1609.

@schlunma schlunma closed this May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data issue preprocessor Related to the preprocessor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement multiple ensemble style syntax used for datasets in fx variable descriptions for pre processor
6 participants