Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/coarse grain c384 diagnostic data #122

Merged
merged 76 commits into from
Feb 4, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
fcd233b
use rundirs to construct train datasets instead of zarrs
AnnaKwa Jan 16, 2020
60701ca
remove temp download dirs
AnnaKwa Jan 16, 2020
3a2d4d3
fix temp dir naming
AnnaKwa Jan 16, 2020
91e35b6
fix temp dir naming
AnnaKwa Jan 16, 2020
fa345a1
fix bugs
AnnaKwa Jan 16, 2020
fda5d49
fix forecast time coords
AnnaKwa Jan 17, 2020
1d49b19
Merge branch 'master' into feature/rundir-to-train-pipeline
AnnaKwa Jan 21, 2020
f56e103
remove deprecated to_zarr files
AnnaKwa Jan 21, 2020
0f2ac64
move sample stacking to the dataflow job before saving to zarr
AnnaKwa Jan 21, 2020
8e8d08a
move sample stacking to the dataflow job before saving to zarr
AnnaKwa Jan 21, 2020
3934060
Start xy grid coords at 1
AnnaKwa Jan 21, 2020
7e47aa6
Drop null samples before writing
AnnaKwa Jan 21, 2020
890bd0a
Allow open_cubed_sphere to load remote data
AnnaKwa Jan 21, 2020
26a1712
merge remote
AnnaKwa Jan 21, 2020
0128a24
roll back changes to cubesphere.io
AnnaKwa Jan 21, 2020
646cd5a
fix drop na
AnnaKwa Jan 21, 2020
9adaa6d
More fixes to drop na
AnnaKwa Jan 21, 2020
1484fca
Fix forecast time dim assignmebnt
AnnaKwa Jan 22, 2020
38bae7a
rechunk to uniform sizes before write
AnnaKwa Jan 22, 2020
44306bd
sort to train/test subdirs at write step
AnnaKwa Jan 22, 2020
1ca83e6
linting
AnnaKwa Jan 22, 2020
dc853dc
Fix failing test (keep dim that allows test fixture to drop forecast_…
AnnaKwa Jan 22, 2020
8c587f8
Remove unused docstring arg
AnnaKwa Jan 22, 2020
754cb12
address PR comments
AnnaKwa Jan 23, 2020
a9aac18
remove accidental changes to diagnostics files
AnnaKwa Jan 23, 2020
c873d3d
fix missing import of _set_forecast_time_coord
AnnaKwa Jan 23, 2020
ef0cd69
linting
AnnaKwa Jan 23, 2020
6646442
import time dim constants in vcm.calc.calc
AnnaKwa Jan 24, 2020
b66ab90
removed time coordinates and arguments from open_restarts
brianhenn Jan 24, 2020
1ed31c5
fixed tests
brianhenn Jan 24, 2020
bccca45
add forecast time using parsed filename
AnnaKwa Jan 24, 2020
92ec01b
replaced time with file prefix dim/coords in open_restarts
brianhenn Jan 24, 2020
77bdcde
added func and open_restarts argument to add time coordinates
brianhenn Jan 25, 2020
b32cd58
Add intake catalog entries for the remote data
nbren12 Jan 27, 2020
c11c0d5
Add script to coarse-grain the data
nbren12 Jan 27, 2020
df62c81
added tests for open_restarts date inference
brianhenn Jan 27, 2020
300c48a
fixed conflict and updated branch to master
brianhenn Jan 27, 2020
8a62171
Merge branch 'master' into feature/rundir-to-train-pipeline
AnnaKwa Jan 27, 2020
1efd400
Merge branch 'master' into feature/coarse-grain-c384
AnnaKwa Jan 27, 2020
5d4a421
fixed tests and linting
brianhenn Jan 27, 2020
6278476
fixing tests
brianhenn Jan 27, 2020
0b1eab5
fixing tests
brianhenn Jan 28, 2020
5fccf62
add step to merge variables from coarsened C384
AnnaKwa Jan 28, 2020
b74e34a
add step to merge variables from coarsened C384
AnnaKwa Jan 28, 2020
afd9e1b
Merge branch 'feature/rundir-to-train-pipeline' into feature/coarse-g…
AnnaKwa Jan 28, 2020
9ed9f22
Move some stuff to helpers
AnnaKwa Jan 28, 2020
e6776a9
add arg for diag path to scripts
AnnaKwa Jan 28, 2020
0aa1804
Merge branch 'feature/open_restarts_no_time_coords' into feature/coar…
AnnaKwa Jan 28, 2020
0e13f13
remove old open_restarts
AnnaKwa Jan 28, 2020
de2fa73
add retry and exception handling
AnnaKwa Jan 29, 2020
8dbb17e
log when step fails
AnnaKwa Jan 29, 2020
9a14922
linting
AnnaKwa Jan 29, 2020
2540467
remove retries; all steps are try and except
AnnaKwa Jan 29, 2020
6a53ef0
Fix error handling
AnnaKwa Jan 29, 2020
d359ea1
Shorten error messages
AnnaKwa Jan 29, 2020
0fe06fc
Move the filling of init/forecast time dims back to the data pipeline
AnnaKwa Jan 29, 2020
06562bc
Merge branch 'master' into feature/coarse-grain-c384
AnnaKwa Jan 30, 2020
51de990
linting
AnnaKwa Jan 30, 2020
d55cb20
Revert to master version of lint diffed files unrelated to this PR
AnnaKwa Jan 30, 2020
b732c72
Use already coarsened diag input
AnnaKwa Jan 31, 2020
e97dce6
Add write step to diag coarsening workflow
AnnaKwa Jan 31, 2020
ef29194
Fix coarse graining input diag file and chunking
AnnaKwa Jan 31, 2020
b66c7e0
Move the time rounding to coarsen pipeline
AnnaKwa Jan 31, 2020
7ec9cfd
linting
AnnaKwa Jan 31, 2020
5ad6dd6
linting
AnnaKwa Jan 31, 2020
a433d80
Remove references to gridding in pipeline (taken care of in coarsenin…
AnnaKwa Jan 31, 2020
e923b60
Address PR comments
AnnaKwa Feb 3, 2020
6823ed3
Fix not found dim error
AnnaKwa Feb 3, 2020
de41139
Fix key/list type error
AnnaKwa Feb 4, 2020
af95b38
linting
AnnaKwa Feb 4, 2020
6964155
Remove the reset_index from stacking to sample dim.
AnnaKwa Feb 4, 2020
dbf9ccd
Put reset_index back in- multindex cannot be serialized to netcdf whe…
AnnaKwa Feb 4, 2020
efb0651
Merge branch 'master' into feature/coarse-grain-c384
AnnaKwa Feb 4, 2020
9a9f0b0
Address PR comments
AnnaKwa Feb 4, 2020
b1d156d
change fv3config hash to match master
AnnaKwa Feb 4, 2020
1c21024
Merge branch 'master' into feature/coarse-grain-c384
AnnaKwa Feb 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions catalog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ sources:
project: 'vcm-ml'
access: read_only
urlpath: 'gcs://vcm-ml-data/2019-07-17-GFDL_FV3_DYAMOND_0.25deg_15minute/3d.zarr'

marat_sam_tend:
description: Initial data sample provided by Lucas Harris. 1 day of 3D fields sampled every 3 hours.
driver: netcdf
Expand All @@ -43,7 +43,7 @@ sources:
description: 2D and 3D data sampled every 15 minutes from a 2-day 3-km simulation, regridded to a roughly 0.25 degree x 0.25 degree regular lat-lon grid.
driver: zarr
metadata:
data_transforms:
data_transforms:
- _rename_SHiELD_varnames_to_orig
args:
storage_options:
Expand All @@ -61,18 +61,27 @@ sources:
urlpath: 'gcs://vcm-ml-data/2019-09-24_GFDL-SHiELD-15-minute-2-days_regrid_1degree.zarr'

2019-10-09-SAM-SOCRATES_tend_9216x4608x74_7.5s_4km_nudge24h:
description: 5 day forecast for SOCRATES with 24 hour nudging to ERA5, with bug fixes for tendency calculations, from Marat Khairoutdinov.
description: 5 day forecast for SOCRATES with 24 hour nudging to ERA5, with bug fixes for tendency calculations, from Marat Khairoutdinov.
driver: zarr
args:
storage_options:
project: 'vcm-ml'
access: read_only
urlpath: 'gcs://vcm-ml-data/2019-10-09-SAM-SOCRATES_tend_9216x4608x74_7.5s_4km_nudge24h.zarr'


40day_c384_diags_time_avg:
description: Time-averaged diagnostics for 40-day nudged simulation at C384 resolution
driver: zarr
args:
storage_options:
project: 'vcm-ml'
access: read_only
urlpath: 'gs://vcm-ml-data/2019-12-05-40-day-X-SHiELD-simulation-C384-diagnostics/gfsphysics_15min_coarse.zarr/'

## Local Data Intake ##
# TODO: Could this be replicated with intake caching? Or switch to an ignored file?
local_2019-09-24-GFDL-SHiELD-15-minute-2-days_regrid_1degree:
description: Local version of 2D and 3D SHieLD data at 1 degree resolution
driver: zarr
args:
urlpath: '{{ CATALOG_DIR }}/data/interim/2019-09-24_GFDL-SHiELD-15-minute-2-days_regrid_1degree.zarr'
urlpath: '{{ CATALOG_DIR }}/data/interim/2019-09-24_GFDL-SHiELD-15-minute-2-days_regrid_1degree.zarr'
1 change: 1 addition & 0 deletions external/vcm/vcm/cubedsphere/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,4 @@
GRID_VARS = [VAR_LAT_CENTER, VAR_LAT_OUTER, VAR_LON_CENTER, VAR_LON_OUTER, "area"]
INIT_TIME_DIM = "initialization_time"
FORECAST_TIME_DIM = "forecast_time"
TILE_COORDS = range(6)
8 changes: 8 additions & 0 deletions fv3net/pipelines/create_training_data/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
help="Location of input data in Google Cloud Storage bucket. "
"Don't include bucket in path.",
)
parser.add_argument(
"--diag-c48-path",
type=str,
required=False,
help="Location of C48 (coarsened from C384) high res diagnostic zarr for "
"features (SHF, LHF, etc.) that are not saved in restarts. If not provided, "
"features from diagnostics will not be in the final training data set.",
)
parser.add_argument(
"--gcs-output-data-dir",
type=str,
Expand Down
100 changes: 100 additions & 0 deletions fv3net/pipelines/create_training_data/helpers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
from datetime import timedelta
import fsspec
import logging
import os
import xarray as xr

from vcm.fv3_restarts import _split_url
from vcm.cubedsphere.constants import (
INIT_TIME_DIM,
AnnaKwa marked this conversation as resolved.
Show resolved Hide resolved
FORECAST_TIME_DIM,
TIME_FMT,
TILE_COORDS,
)


logger = logging.getLogger()
logger.setLevel(logging.INFO)


def _round_time(t):
""" The high res data timestamps are often +/- a few 1e-2 seconds off the
initialization times of the restarts, which makes it difficult to merge on
time. This rounds time to the nearest second, assuming the init time is at most
1 sec away from a round minute.
Args:
t: datetime or cftime object
Returns:
datetime or cftime object rounded to nearest minute
"""
if t.second == 0:
return t.replace(microsecond=0)
elif t.second == 59:
return t.replace(microsecond=0) + timedelta(seconds=1)
else:
raise ValueError(
f"Time value > 1 second from 1 minute timesteps for "
"C48 initialization time {t}. Are you sure you're joining "
"the correct high res data?"
)


def _path_from_first_timestep(ds, train_test_labels=None):
""" Uses first init time as zarr filename, and appends a 'train'/'test' subdir
if a dict of labels is provided
Args:
ds: input dataset
train_test_labels: optional dict with keys ["test", "train"] and values lists of
timestep strings that go to each set
Returns:
path in args.gcs_output_dir to write the zarr to
"""
timestep = min(ds[INIT_TIME_DIM].values).strftime(TIME_FMT)
if isinstance(train_test_labels, dict):
try:
if timestep in train_test_labels["train"]:
train_test_subdir = "train"
elif timestep in train_test_labels["test"]:
train_test_subdir = "test"
except KeyError:
logger.warning(
"train_test_labels dict does not have keys ['train', 'test']."
"Will write zarrs directly to gcs_output_dir."
)
train_test_subdir = ""
else:
logger.info(
"No train_test_labels dict provided."
"Will write zarrs directly to gcs_output_dir."
)
train_test_subdir = ""
return os.path.join(train_test_subdir, timestep + ".zarr")


def _set_relative_forecast_time_coord(ds):
delta_t_forecast = (
ds[FORECAST_TIME_DIM].values[-1] - ds[FORECAST_TIME_DIM].values[-2]
)
ds.reset_index([FORECAST_TIME_DIM], drop=True)
return ds.assign_coords(
{FORECAST_TIME_DIM: [timedelta(seconds=0), delta_t_forecast]}
)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note- since pipeline.py was getting long as additional functions were added to merge in the coarsened high res features, I moved the helper functions that are not called directly by run() into helpers.py. The functions above this line have already been reviewed in previous PRs, so the functions that need to be reviewed for this PR are the ones below this line related to the high res data.

def load_diag(diag_data_path, init_times):
protocol, path = _split_url(diag_data_path)
fs = fsspec.filesystem(protocol)
ds_diag = xr.open_zarr(fs.get_mapper(diag_data_path), consolidated=True).rename(
{"time": INIT_TIME_DIM}
)
ds_diag = ds_diag.assign_coords(
{
INIT_TIME_DIM: [_round_time(t) for t in ds_diag[INIT_TIME_DIM].values],
"tile": TILE_COORDS,
}
)
return ds_diag.sel({INIT_TIME_DIM: init_times})
Loading