Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DRS path templates configurable per rootpath #1894

Merged
merged 12 commits into from
May 29, 2024

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented Jan 13, 2023

Description

Add the possibility to specify the names of the directory and filename templates in config-user.yml per path instead of per project.

Example

download_dir: /work/bd0854/DATA/ESMValTool2/download
rootpath:
  CMIP6:
    /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ: DKRZ
    /work/bd0854/DATA/ESMValTool2/download: ESGF

Closes #129

Link to documentation: https://esmvaltool--1894.org.readthedocs.build/projects/ESMValCore/en/1894/quickstart/find_data.html#explaining-config-user-rootpath


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@bouweandela bouweandela added the enhancement New feature or request label Jan 13, 2023
@codecov
Copy link

codecov bot commented Jan 13, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.52%. Comparing base (4053008) to head (bfdd2e4).
Report is 38 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1894   +/-   ##
=======================================
  Coverage   94.51%   94.52%           
=======================================
  Files         246      246           
  Lines       14024    14036   +12     
=======================================
+ Hits        13255    13267   +12     
  Misses        769      769           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bouweandela bouweandela added this to the v2.8.0 milestone Feb 6, 2023
Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to say awesome, this is very cool! and it is, but I am starting to quiver at every addition to the configuration. This one in particular is cumbersome, and as awesome its functionality, as bad as the way it's done in the configuration; here "bad" in the sense of user friendliness, nothing to do with how it's implemented. A few points:

  • I would add a check if the user actually inputs drs: ... and that differs from the one in /path/: drs
  • I know you are mapping path to drs but the name ESGF to which the path is mapped to is confusing especially since the examples on DKRZ are not related to ESGF but, rather, to some bundled data in one rando directory
  • overall, this is testing the limit of how more complicated can we make things for users, I am not at all at ease with this TBF, and I know I'm not helping by moaning but - there's gotta be an easier route. Just imagine you are a first time user of ESMValTool - how would you feel seeing such a configuration file? And not you, as a talented software engineer, but you as a PhD student just starting studying climate modelling

~/climate_data: ESGF

Note that this notation combines the ``rootpath`` and ``drs`` settings, so it
is not necessary to specify the directory structure in under ``drs`` in this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is not necessary to specify the directory structure in under ``drs`` in this
is not necessary to specify the directory structure under ``drs`` in this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if they do, and either that (drs: ...) or this one is wrong, or, if they differ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well OK, but one needs to toss a warning at least, say eg "You specified drs: DRS but the other one is used... so that one gets ignored etc"

@bouweandela
Copy link
Member Author

bouweandela commented Feb 9, 2023

The alternative would be to completely redesign how we configure finding data. There are some good ideas out there already, see e.g. #795 (comment), the issues linked therein, and #1165. However, none of these options would be backward compatible. That's why I opted for keeping the existing functionality and extending it with yet another option.

@valeriupredoi
Copy link
Contributor

The alternative would be to completely redesign how we configure finding data. There are some good ideas out there already, see e.g. #795 (comment), the issues linked therein, and #1165. However, none of these options would be backward compatible. That's why I opted for keeping the existing functionality and extending it with yet another option.

yes! Redesign is the key here - at the end of the day, the user will still get their data, but going through much less a pain how to configure their environment. If we go on like this, even us, the devs, will start getting lost in configuration - I, for one, am starting to already, and only have about 6 years experience with the Tool 😁

# CORDEX: /work/ik1017/C3SCORDEX/data/c3s-cordex/output
# CMIP6:
# /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ: DKRZ
# /work/bd0854/DATA/ESMValTool2/download: ESGF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's try work with this - leaving aside my rather negative comments above for now, can we at least make this a reverse key-val instance, so maybe it's a bit more user-firndly? eg:

  CMIP6:
    - DKRZ: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
    - ESGF: /work/bd0854/DATA/ESMValTool2/download

Copy link
Member Author

@bouweandela bouweandela Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it more user-friendly to reverse the key and values? Do you mean to use a list or a dict? This example doesn't look like valid yaml.

@remi-kazeroni
Copy link
Contributor

Hi @bouweandela, sorry for being late in reviewing this. I will take a look once the comment's from @valeriupredoi are addressed, would that be ok? Please let us know what is needed to bring this PR to the finish line 🍺

@bouweandela
Copy link
Member Author

bouweandela commented Feb 13, 2023

Redesign is the key here

How about adding a new item to config-user.yml that can be used to directly construct esmvalcore.local.DataSource objects from? We could then easily make that extensible by allowing the user to provide custom DataSource classes, so they could instead use e.g. intake-esm #1218 or whatever they want.

data_sources:
  - type: esmvalcore.local.DataSource
    project: CMIP6
    rootpath: /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
    dirname_template: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
    filename_template: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'

the type key would only be needed if it's something else than esmvalcore.local.DataSource.

This has some advantages:

  • It does not hide where the path templates are defined, making it easier to understand
  • No nested dicts with many possible entries, just a single list of DataSource objects
  • Extensible if we provide an abstract base class for DataSource. It should have a find_files method and the kind of things that it can return, current examples: esmvalcore.local.LocalFile and esmvalcore.esgf.ESGFFile

There's also a few disadvantages:

  • It may look a bit more intimidating at first glance
  • For user-friendliness, it we would still to finally do something with the proposal of defining sites (e.g. Jasmin, Levante, etc) and having some simple way of getting a default configuration for those
  • No solution for where the CMOR tables or output filenames are defined, those would still be in config-developer.yml
  • I'm not sure if I have time to implement this, having this in v2.8 seems unlikely

@ESMValGroup/technical-lead-development-team Please comment if you have an opinion.

@schlunma
Copy link
Contributor

schlunma commented Feb 14, 2023

I personally like this new feature very much (the way it is implemented right now)! Specifying DRS per rootpath will make my daily work much easier (I don't know how much time I wasted in the past with creating nested directories when I wanted to use data from two different sources). So I am definitely in favor of having this in v2.8.

About the key-value order: since rootpaths are unique and DRS are not (each rootpath has a single DRS, but each DRS can be used for multiple rootpaths) the order that @bouweandela proposed (e.g., /path/to/cmip6: DKRZ) makes much more sense to me.

About the customized data sources: I also like this idea very much! Being able to define custom file and dir templates quickly in the config-user file without messing with a custom config-developer file is great! However, I agree that this might be more intimidating for new users. Maybe we could add some defaults in the config-developer file and the user has the possibility to add custom entries in the config-user file? This could mean that the config-developer and config-user files contain identical options with the config-developer as machine/default configuration, and the config-user as the user-defined configuration. The current distinction of these two files feels rather arbitrary to me. I think this has been proposed in the past, but I cannot find the corresponding issue. Maybe something for a v3.0?

TLDR; I am in favor of merging this as is, and we should think about allowing custom data sources in a future version.

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Feb 14, 2023

So here's the thing - just like Eisenhower once said, I like all this, as long as it's not in the config-user file 😁 Are you guys not realizing that we're slowly beginning to become a tool with such a complicated configuration that one needs less time to learn to fly a Boeing 737 than to learn to correctly configure ESMValTool? I'll bark more on the call today 🍺

@schlunma
Copy link
Contributor

To be honest, I think the current version of specifying DRS+rootpath in two different places (drs and rootpath) is more complicated than what Bouwe proposes here, which is "for rootpath XYZ, use DRS XZY".

@valeriupredoi
Copy link
Contributor

To be honest, I think the current version of specifying DRS+rootpath in two different places (drs and rootpath) is more complicated than what Bouwe proposes here, which is "for rootpath XYZ, use DRS XZY".

this is not decommissioning the other config option, it adds to it it as an extra option. At any rate, let's talk on the call, we can add here and in #1609 an executive summary of our discussion on the call, later on

@bouweandela
Copy link
Member Author

The @ESMValGroup/technical-lead-development-team decided to postpone this feature to the next release.

@bouweandela bouweandela modified the milestones: v2.8.0, v2.10.0, v2.9.0 Feb 14, 2023
@remi-kazeroni
Copy link
Contributor

Thanks @bouweandela for all your work on this already and agreeing to shift this to the next release. Let's make sure that we have a release planning meeting soon after v2.8 is released to discuss this PR. Or even a specific meeting to discuss how we could revamp the configuration of ESMValTool in order to increase the user-friendliness of the Tool

@schlunma
Copy link
Contributor

I'd really like to revive this discussion here. The changes proposed in this PR, i.e., being able to do something like this

download_dir: /work/bd0854/DATA/ESMValTool2/download
rootpath:
  CMIP6:
    /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ: DKRZ
    /work/bd0854/DATA/ESMValTool2/download: ESGF

would be really helpful, especially when dealing with native model output. I really hope we can include this in v2.9.

@bouweandela
Copy link
Member Author

Since there has been no movement on this and we'll have the discussion about how to configure the tool in the future at the SMHI workshop ESMValGroup/Community#98 planned right after the v2.9 release, I'm bumping this to the next release.

@bouweandela bouweandela modified the milestones: v2.9.0, v2.10.0 May 31, 2023
@bouweandela
Copy link
Member Author

At the workshop it was agreed that a new configuration file format is acceptable and I'll write a proposal. Dropping this from the v2.10 milestone because it seems unlikely that we can agree on the new format before that time.

@bouweandela bouweandela removed this from the v2.10.0 milestone Sep 21, 2023
@schlunma
Copy link
Contributor

At the workshop it was agreed that a new configuration file format is acceptable and I'll write a proposal. Dropping this from the v2.10 milestone because it seems unlikely that we can agree on the new format before that time.

It's been 8 months since the workshop, and there was no progress regarding the new configuration. Is there any way we can merge this feature as it is proposed in this PR? I know that it would be much cleaner to refactor our configuration, but this just doesn't seem realistic at this point.

On the other hand, this issue has been open since 2018 (!), and many people would benefit a lot from this.

@bouweandela
Copy link
Member Author

Sorry for being slow with the writeup for the new configuration, I'll try to get it done next week. About merging this as-is, maybe that would be something to discuss at the next @ESMValGroup/technical-lead-development-team meeting?

@schlunma
Copy link
Contributor

Good idea, I just added it to the agenda.

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judas has re-converted himself 😁

@valeriupredoi
Copy link
Contributor

don't blame me for holding this up for too long and now we have a bazillion conflicts 😁

@schlunma
Copy link
Contributor

@bouweandela I would really like to have this in v2.12. Would it be fine for you if I solve the merge conflicts and finalize this PR?

@bouweandela
Copy link
Member Author

Yes, that would be great!

@schlunma schlunma removed the request for review from remi-kazeroni May 27, 2024 14:39
Copy link
Contributor

@axel-lauer axel-lauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this for a number of different cases and it works fine but only as long as you do not remove the DRS entry completely. If you do so while not explicitly specifying a DRS format (e.g. when reusing lines from an old config file) the tool shows some unexpected behavior. For example obs4mips is then expected to follow the ESGF DRS instead of 'default' as it was the case for old configuration files.

Maybe this behavior could be changed so the DRS section can be removed safely?

@schlunma
Copy link
Contributor

Thanks for testing @axel-lauer! 🚀

The reason for this behavior is the default entry for drs, which says ESGF for the obs4mips data. If you don't specify drs in your own configuration file, this default is used. If you specify drs in your configuration file but omit obs4mips, then default instead of ESGF is used.

I agree that this behavior is confusing, but this was not introduced with this PR. With the current main branch you get the same result.

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone, looks great! 🚀

@schlunma schlunma merged commit d8b4d4d into main May 29, 2024
6 checks passed
@schlunma schlunma deleted the configure-drs-per-rootpath branch May 29, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

Specify directory stucture per path instead of per project in config-user.yml
5 participants