Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow wildcard searches when specifying fx variables in preprocessor #1082

Closed
wants to merge 45 commits into from
Closed
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
d644526
Search r0i0p0 if fx data not found under original ensemble
thomascrocker Apr 26, 2021
da8cf7e
typo. Moved second search under if statement
thomascrocker Apr 26, 2021
574f9f8
fixed line length complaint
thomascrocker Apr 26, 2021
4378116
fix by using globbing
thomascrocker May 5, 2021
54a5105
Merge branch 'master' into fix_fx_ensembles
valeriupredoi May 10, 2021
0c539c3
fix line too long
valeriupredoi May 10, 2021
5f5bb10
modded tests
valeriupredoi May 10, 2021
a5a933b
Added docs to additional places where FX variables are used
thomascrocker May 11, 2021
d471de2
Merge branch 'master' into fix_fx_ensembles
thomascrocker May 14, 2021
c17f85b
downgrade logging message to debug
thomascrocker May 18, 2021
7df08ee
addressing review comments and maintaining backwards compatibility
thomascrocker May 18, 2021
87b5f48
reverting tests to original
thomascrocker May 18, 2021
b53a991
refactored to resolve wildcards in separate function
thomascrocker May 19, 2021
3611451
CMIP6 wildcard FX test
thomascrocker May 19, 2021
a87d290
added CORDEX wildcard fx test
thomascrocker May 19, 2021
3f1b881
minor tweak to docs
thomascrocker May 19, 2021
8d3bf2d
flake8 fixes
thomascrocker May 19, 2021
443208f
Update _data_finder.py
thomascrocker May 19, 2021
62d091a
codacy fix
thomascrocker May 19, 2021
e2c76b1
minor changes to address review comments
thomascrocker May 20, 2021
ad9b702
additional tests
thomascrocker May 20, 2021
6909061
docs update
thomascrocker May 20, 2021
c88898b
Merge branch 'fix_fx_ensembles' of https://github.com/ESMValGroup/ESM…
thomascrocker May 20, 2021
a7bccf8
pylint fix
thomascrocker May 20, 2021
439ce43
Merge remote-tracking branch 'origin/master' into fix_fx_ensembles
schlunma May 20, 2021
3961cfc
Added further check on variable's ensemble during fx file retrieval i…
schlunma May 20, 2021
893aaaf
refactor to deal with more file path cases
thomascrocker May 21, 2021
c9cdbe4
further dir finder test
thomascrocker May 21, 2021
f0637f2
tweaks for codacy
thomascrocker May 21, 2021
fc61e06
Fixed globbing for fx files when latestversion tag is present
schlunma May 21, 2021
09543a0
Fixed output path for fx files when wildcards are used
schlunma May 21, 2021
92a6207
Fixed output path for fx files when wildcards are used (again)
schlunma May 21, 2021
3676f66
Fixed output path for fx files when wildcards are used (this time for…
schlunma May 21, 2021
3b1c0a2
Removed print() statement
schlunma May 21, 2021
845cc8d
Merge remote-tracking branch 'origin/master' into fix_fx_ensembles
schlunma May 21, 2021
b562249
Fixed tests
schlunma May 21, 2021
b50b094
Merge remote-tracking branch 'origin/main' into fix_fx_ensembles
schlunma Sep 9, 2021
c9b0f21
Merge branch 'main' into fix_fx_ensembles
valeriupredoi Nov 17, 2021
33cd03a
add jump over None dirs
valeriupredoi Nov 17, 2021
f1a3dcb
Merge remote-tracking branch 'origin/main' into fix_fx_ensembles
schlunma Feb 8, 2022
354f36b
Ran isor
schlunma Feb 8, 2022
5c455ed
Undo changes that are not relevant for this PR
schlunma Feb 8, 2022
dc171c6
add fixes
ledm Feb 10, 2022
3d544a1
added several fixes
ledm Feb 14, 2022
97f62eb
Merge remote-tracking branch 'origin/fix_fx_ensembles' into fix_fx_en…
ledm Feb 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions doc/recipe/preprocessor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -374,6 +374,17 @@ or alternatively:
{'short_name': 'sftof', 'exp': 'piControl'}
]

Additionally, it is possible to search across all ensembles and experiments (or any other keys)
when specifying the fx variable, by using the ``*`` character, which is useful for some projects
where the location of the fx files is not consistent.
This makes it possible to search for fx files under multiple ensemble members or experiments.
For example: ``ensemble: '*'``. Note that the ``*`` character must be quoted since ``*`` is a
special charcter in YAML. This functionality is only supported for time invariant fx variables
(i.e. frequency ``fx``). Note also that if multiple folders of matching fx files are found,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(i.e. frequency ``fx``). Note also that if multiple folders of matching fx files are found,
(i.e. frequency ``fx`` or ``Ofx`` or ``Efx``). Note also that if multiple folders of matching fx files are found,

ESMValTool will default to ensemble r0i0p0 if it exists and then first folder found only
if it does not.


See also :func:`esmvalcore.preprocessor.weighting_landsea_fraction`.


Expand Down Expand Up @@ -455,6 +466,17 @@ or alternatively:
{'short_name': 'sftof', 'exp': 'piControl', 'ensemble': 'r2i1p1f1'}
]

Additionally, it is possible to search across all ensembles and experiments (or any other keys)
when specifying the fx variable, by using the ``*`` character, which is useful for some projects
where the location of the fx files is not consistent.
This makes it possible to search for fx files under multiple ensemble members or experiments.
For example: ``ensemble: '*'``. Note that the ``*`` character must be quoted since ``*`` is a
special charcter in YAML. This functionality is only supported for time invariant fx variables
(i.e. frequency ``fx``). Note also that if multiple folders of matching fx files are found,
ESMValTool will default to ensemble r0i0p0 if it exists and then first folder found only
if it does not.


If the corresponding fx file is not found (which is
the case for some models and almost all observational datasets), the
preprocessor attempts to mask the data using Natural Earth mask files (that are
Expand Down Expand Up @@ -507,6 +529,16 @@ or alternatively:
mask_out: sea
fx_variables: [{'short_name': 'sftgif', 'exp': 'piControl'}]

Additionally, it is possible to search across all ensembles and experiments (or any other keys)
when specifying the fx variable, by using the ``*`` character, which is useful for some projects
where the location of the fx files is not consistent.
This makes it possible to search for fx files under multiple ensemble members or experiments.
For example: ``ensemble: '*'``. Note that the ``*`` character must be quoted since ``*`` is a
special charcter in YAML. This functionality is only supported for time invariant fx variables
(i.e. frequency ``fx``). Note also that if multiple folders of matching fx files are found,
ESMValTool will default to ensemble r0i0p0 if it exists and then first folder found only
if it does not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just write this paragraph once (at the top, or wherever you think it's best suited) and then just reference it via a Markdown reference, rather tha writing it three times. Up to you

See also :func:`esmvalcore.preprocessor.mask_landseaice`.

Glaciated masking
Expand Down
184 changes: 138 additions & 46 deletions esmvalcore/_data_finder.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,19 @@ def get_start_end_year(filename):
start_year = end_year = None

# First check for a block of two potential dates separated by _ or -
daterange = re.findall(r'([0-9]{4,12}[-_][0-9]{4,12})', stem)
daterange = re.findall(r"([0-9]{4,12}[-_][0-9]{4,12})", stem)
if daterange:
start_date, end_date = re.findall(r'([0-9]{4,12})', daterange[0])
start_date, end_date = re.findall(r"([0-9]{4,12})", daterange[0])
start_year = start_date[:4]
end_year = end_date[:4]
else:
# Check for single dates in the filename
dates = re.findall(r'([0-9]{4,12})', stem)
dates = re.findall(r"([0-9]{4,12})", stem)
if len(dates) == 1:
start_year = end_year = dates[0][:4]
elif len(dates) > 1:
# Check for dates at start or end of filename
outerdates = re.findall(r'^[0-9]{4,12}|[0-9]{4,12}$', stem)
outerdates = re.findall(r"^[0-9]{4,12}|[0-9]{4,12}$", stem)
if len(outerdates) == 1:
start_year = end_year = outerdates[0][:4]

Expand All @@ -61,16 +61,16 @@ def get_start_end_year(filename):
for cube in cubes:
logger.debug(cube)
try:
time = cube.coord('time')
time = cube.coord("time")
except iris.exceptions.CoordinateNotFoundError:
continue
start_year = time.cell(0).point.year
end_year = time.cell(-1).point.year
break

if start_year is None or end_year is None:
raise ValueError(f'File {filename} dates do not match a recognized'
'pattern and time can not be read from the file')
raise ValueError(f"File {filename} dates do not match a recognized"
"pattern and time can not be read from the file")

logger.debug("Found start_year %s and end_year %s", start_year, end_year)
return int(start_year), int(end_year)
Expand All @@ -92,7 +92,7 @@ def select_files(filenames, start_year, end_year):
def _replace_tags(paths, variable):
"""Replace tags in the config-developer's file with actual values."""
if isinstance(paths, str):
paths = set((paths.strip('/'),))
paths = set((paths.strip('/'), ))
else:
paths = set(path.strip('/') for path in paths)
tlist = set()
Expand All @@ -101,10 +101,9 @@ def _replace_tags(paths, variable):
if 'sub_experiment' in variable:
new_paths = []
for path in paths:
new_paths.extend((
re.sub(r'(\b{ensemble}\b)', r'{sub_experiment}-\1', path),
re.sub(r'({ensemble})', r'{sub_experiment}-\1', path)
))
new_paths.extend(
(re.sub(r'(\b{ensemble}\b)', r'{sub_experiment}-\1', path),
re.sub(r'({ensemble})', r'{sub_experiment}-\1', path)))
tlist.add('sub_experiment')
paths = new_paths
logger.debug(tlist)
Expand All @@ -113,7 +112,7 @@ def _replace_tags(paths, variable):
original_tag = tag
tag, _, _ = _get_caps_options(tag)

if tag == 'latestversion': # handled separately later
if tag == "latestversion": # handled separately later
continue
if tag in variable:
replacewith = variable[tag]
Expand All @@ -140,10 +139,10 @@ def _replace_tag(paths, tag, replacewith):
def _get_caps_options(tag):
lower = False
upper = False
if tag.endswith('.lower'):
if tag.endswith(".lower"):
lower = True
tag = tag[0:-6]
elif tag.endswith('.upper'):
elif tag.endswith(".upper"):
upper = True
tag = tag[0:-6]
return tag, lower, upper
Expand All @@ -163,60 +162,114 @@ def _resolve_latestversion(dirname_template):
This implementation avoid globbing on centralized clusters with very
large data root dirs (i.e. ESGF nodes like Jasmin/DKRZ).
"""
if '{latestversion}' not in dirname_template:
if "{latestversion}" not in dirname_template:
return dirname_template

# Find latest version
part1, part2 = dirname_template.split('{latestversion}')
part1, part2 = dirname_template.split("{latestversion}")
part2 = part2.lstrip(os.sep)
if os.path.exists(part1):
versions = os.listdir(part1)
versions.sort(reverse=True)
for version in ['latest'] + versions:
for version in ["latest"] + versions:
dirname = os.path.join(part1, version, part2)
if os.path.isdir(dirname):
return dirname

return dirname_template


def _resolve_wildcards_and_version(dirname, basepath, project, drs):
"""Resolve wildcards and latestversion tag."""
if "{latestversion}" in dirname:
dirname_version_wildcard = dirname.replace("{latestversion}", "*")

# Find all directories that match the template
all_dirs = sorted(glob.glob(dirname_version_wildcard))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want reverse=True here since latest should be first in line


# Sort directories by version
all_dirs_dict = {}
for directory in all_dirs:
version = dir_to_var(
directory, basepath, project, drs)['latestversion']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you find latest exit the loop and return quickly to save time (hence latest to be first in the sorted list of dirs

all_dirs_dict.setdefault(version, [])
all_dirs_dict[version].append(directory)

# Select latest version
if not all_dirs_dict:
dirnames = []
elif 'latest' in all_dirs_dict:
dirnames = all_dirs_dict['latest']
else:
all_versions = sorted(list(all_dirs_dict))
dirnames = all_dirs_dict[all_versions[-1]]

# No {latestversion} tag
else:
dirnames = sorted(glob.glob(dirname))

# No directories found
if not dirnames:
logger.debug("Unable to resolve %s", dirname)
return dirname

# Exactly one directory found
if len(dirnames) == 1:
return dirnames[0]

# Warn if multiple directories have been found and prioritize r0i0p0
logger.warning("Multiple directories for fx variables found: %s", dirnames)
r0i0p0_matches = [d for d in dirnames if "r0i0p0" in d]
if r0i0p0_matches:
return r0i0p0_matches[0]
return dirnames[0]


def _select_drs(input_type, drs, project):
"""Select the directory structure of input path."""
cfg = get_project_config(project)
input_path = cfg[input_type]
if isinstance(input_path, str):
return input_path

structure = drs.get(project, 'default')
structure = drs.get(project, "default")
if structure in input_path:
return input_path[structure]

raise KeyError(
'drs {} for {} project not specified in config-developer file'.format(
"drs {} for {} project not specified in config-developer file".format(
structure, project))


def get_rootpath(rootpath, project):
"""Select the rootpath."""
if project in rootpath:
return rootpath[project]
if 'default' in rootpath:
return rootpath['default']
raise KeyError('default rootpath must be specified in config-user file')
if "default" in rootpath:
return rootpath["default"]
raise KeyError("default rootpath must be specified in config-user file")


def _find_input_dirs(variable, rootpath, drs):
"""Return a the full paths to input directories."""
project = variable['project']
project = variable["project"]

root = get_rootpath(rootpath, project)
path_template = _select_drs('input_dir', drs, project)
path_template = _select_drs("input_dir", drs, project)

dirnames = []
for dirname_template in _replace_tags(path_template, variable):
for base_path in root:
dirname = os.path.join(base_path, dirname_template)
dirname = _resolve_latestversion(dirname)
if variable["frequency"] == "fx" and "*" in dirname:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you also want Ofx, Efx? I think those are all time-independent too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Ofx, Efx are the MIP table descriptions, however the frequency property should still just be fx.
Certainly for data from for example, the Amon mip tables for CMIP, the frequency property of the data is still just mon, not Amon. So I think frequency==fx covers everything.

dirname = _resolve_wildcards_and_version(dirname, base_path,
project, drs)
var_from_dir = dir_to_var(dirname, base_path, project, drs)
for (key, val) in variable.items():
if val == '*':
variable[key] = var_from_dir.get(key, '*')
else:
dirname = _resolve_latestversion(dirname)
matches = glob.glob(dirname)
valeriupredoi marked this conversation as resolved.
Show resolved Hide resolved
matches = [match for match in matches if os.path.isdir(match)]
if matches:
Expand All @@ -231,65 +284,104 @@ def _find_input_dirs(variable, rootpath, drs):

def _get_filenames_glob(variable, drs):
"""Return patterns that can be used to look for input files."""
path_template = _select_drs('input_file', drs, variable['project'])
path_template = _select_drs("input_file", drs, variable["project"])
filenames_glob = _replace_tags(path_template, variable)
return filenames_glob


def _find_input_files(variable, rootpath, drs):
short_name = variable['short_name']
variable['short_name'] = variable['original_short_name']
short_name = variable["short_name"]
variable["short_name"] = variable["original_short_name"]
input_dirs = _find_input_dirs(variable, rootpath, drs)
filenames_glob = _get_filenames_glob(variable, drs)
files = find_files(input_dirs, filenames_glob)
variable['short_name'] = short_name
variable["short_name"] = short_name
return (files, input_dirs, filenames_glob)


def get_input_filelist(variable, rootpath, drs):
"""Return the full path to input files."""
# change ensemble to fixed r0i0p0 for fx variables
# this is needed and is not a duplicate effort
if variable['project'] == 'CMIP5' and variable['frequency'] == 'fx':
if all([
variable['project'] == 'CMIP5', variable['frequency'] == 'fx',
variable.get('ensemble') != '*'
]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - how about Ofx, Efx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

variable['ensemble'] = 'r0i0p0'
(files, dirnames, filenames) = _find_input_files(variable, rootpath, drs)

# do time gating only for non-fx variables
if variable['frequency'] != 'fx':
files = select_files(files, variable['start_year'],
variable['end_year'])
if variable["frequency"] != "fx":
files = select_files(files, variable["start_year"],
variable["end_year"])
return (files, dirnames, filenames)


def get_output_file(variable, preproc_dir):
"""Return the full path to the output (preprocessed) file."""
cfg = get_project_config(variable['project'])
cfg = get_project_config(variable["project"])

# Join different experiment names
if isinstance(variable.get('exp'), (list, tuple)):
if isinstance(variable.get("exp"), (list, tuple)):
variable = dict(variable)
variable['exp'] = '-'.join(variable['exp'])
variable["exp"] = "-".join(variable["exp"])

outfile = os.path.join(
preproc_dir,
variable['diagnostic'],
variable['variable_group'],
_replace_tags(cfg['output_file'], variable)[0],
variable["diagnostic"],
variable["variable_group"],
_replace_tags(cfg["output_file"], variable)[0],
)
if variable['frequency'] != 'fx':
outfile += '_{start_year}-{end_year}'.format(**variable)
outfile += '.nc'
if variable["frequency"] != "fx":
outfile += "_{start_year}-{end_year}".format(**variable)
outfile += ".nc"
return outfile


def get_statistic_output_file(variable, preproc_dir):
"""Get multi model statistic filename depending on settings."""
template = os.path.join(
preproc_dir,
'{diagnostic}',
'{variable_group}',
'{dataset}_{mip}_{short_name}_{start_year}-{end_year}.nc',
"{diagnostic}",
"{variable_group}",
"{dataset}_{mip}_{short_name}_{start_year}-{end_year}.nc",
)

outfile = template.format(**variable)

return outfile


def dir_to_var(dirname, basepath, project, drs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a public function, so you either add numpy-style API documentation for it, or you make it private (even if private I'd reckon it needs a bit more description - lots of operations it performs and not well documented)

"""Convert directory path to variable :obj:`dict`."""
if dirname != os.sep:
dirname = dirname.rstrip(os.sep)
if basepath != os.sep:
basepath = basepath.rstrip(os.sep)
path_template = _select_drs("input_dir", drs, project).rstrip(os.sep)
rel_dir = os.path.relpath(dirname, basepath)
keys = path_template.split(os.sep)
vals = rel_dir.split(os.sep)
if len(keys) != len(vals):
raise ValueError(
f"Cannot extract tags '{path_template}' from directory "
f"'{rel_dir}' (root: '{basepath}') with different numbers of "
f"elements")
variable = {}
for (idx, full_key) in enumerate(keys):
matches = re.findall(r'.*\{(.*)\}.*', full_key)
if len(matches) != 1:
continue
key = matches[0]
regex = rf"{full_key.replace(key, '(.*)')}"
regex = regex.replace('{', '').replace('}', '')
matches = re.findall(regex, vals[idx])
while '' in matches:
matches.remove('')
if len(matches) != 1:
raise ValueError(
f"Regex pattern '{regex}' for '{full_key}' cannot be "
f"(uniquely) matched to element '{vals[idx]}' in directory "
f"'{dirname}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a brilliantly opaque error message, can you pls make it more user-friendly? Imagine you're a first time user and this pops up πŸ˜†

variable[key] = matches[0]
return variable
Loading