+Task descriptions
+Though derived from the standard Regional Nested Suite (see e.g. Regional Practical 2023), the AUS2200 suite contains a number of changes designed to optimise the suite workflow. The model runs, whether continuous or restarted, are divided into 6 hour cycles, with the task graph for each cycle represented in the following image
+
+
+Cold start tasks
+The tasks install_cold_idl
and install_cold_hpc
symlink the appropriate STASH master, ancil configurations and prebuilt software into the suite share directory. In this version of the suite, the install_ec_startdata
task does nothing as the initial conditions are derived from data already present on disk.
+
+
+fcm_make_um
+This task builds the Unified Model in the rose suite share directory. It is recommended to only run this task once, and have every other variant of the AUS2200 suite use the existing initial build. This can be configured in ‘General run options’ section of the rose config editor.
+
+
+
+get_lbc_from_archive
+This task attempts to retrieve the spatial boundary conditions for the upcoming model run from the directory specified by the following set of jinja2 variables:
+ARCHIVE_DIR = {{BC_ARCHIVE_DIR}}/{{mod["lbc_freq"][0]}}/{{NLBC_TIMES}}
+
+
+Where
+
+{{BC_ARCHIVE_DIR}}
is a top-level directory specified in rose-suite.conf
+{{mod["lbc_freq"][0]}}
is the frequency with which the boundary is synchronised with the top-level domain run in seconds
+{{NLBC_TIMES}}
is the number of times the boundary conditions will be syncrhonised within a model run.
+
+This is to ensure that should the run length or boundary synchronisation periods are modified at any time, the incorrect boundary conditions will not be re-used.
+This task is used to determine whether to generate boundary conditions at all. If the boundary conditions are successfully retrieved from the archive directory, the tasks associated with the driving model required to create the boundary conditions will not run. The exception to this is in the initial cycle point, where a new input dump also needs to be created from the driving model.
+
+
Note
+
This task will fail if the boundary conditions do not exist. This is intentional. cylc7 suites do not support multiple paths through task graphs except in response to task failure.
+
+Should the boundary conditions not be present, this task will be marked as failed and the driving model tasks will run. Once the driving model tasks are completed, the failed get_lbc_from_archive
task is deleted and the workflow can proceed. This task is not run if Archive LBCS
is set to false in the ‘Driving model setup’ section of the rose suite configuration ({{ARCHIVE_LBCS}}
in rose-suite.conf
)
+
+
+nci_era5grib
+This task converts ERA5 and ERA5Land data to GRIB format in order to be used as input into the driving model reconfiguration tasks. This task will run on the initial cycle point of a suite, and when archived boundary conditions have not been found in any other cycle point. By default, this task saves the converted grib files to a central working directory outside of the cylc-run directory. This acts as an archiving task, and the converted grib files will not be re-created if they already exist. This directory is given by {{dm_ec_ic_lbc_dir}}
in both rose-suite.conf
and the rose suite configuration.
+The runtime for this task is variable, if all grib files are already present, then it completes in a few seconds. If none of the grib files are present, this task can take from 10 minutes to almost an hour. Unfortunately the cause of this variability is not known.
+The configuration settings for this task and for get_lbc_from_archive
can be found in the ‘Driving model setup’ section of the rose config editor
+
+
+
+ec_recon_nnn
+This is the primary driving model reconfiguration task. It creates output files on two domains, the initial AUS2200 domain, and a slightly larger domain of the same resolution that acts as the spatial boundary conditions as the model progresses. One of these tasks is run for each point in the model at which the boundary conditions are synchronised, including the initial and final timesteps. In the default configuration, this means seven tasks per 6 hour cycle. These tasks are not run if the get_lbc_from_archive
task in the same cycle completed successfully. The exception to this is in the inital cycle, where ec_recon_000
is run in order to create the initial dump file for the model run.
+By default these tasks are configured to run on 2 Sapphire Rapids nodes (normalsr
queue, 208 cpus) each and complete in approximately 38 minutes. The tasks are independent, and are configured to run in parallel. This block of tasks is the second most expensive task set of the workflow, at a cost of about 1.8kSU/cycle. The output is saved to {{CYLC_SUITE_SHARE_DIR}}/cycle/{{CYLC_TASK_CYCLE_POINT}}/ec/um
, and can be cleared after the forecast task has been completed.
+
+
+aus2200_d0198_RA3_um_createbc_000
+This task takes the driving model reconfiguration output and creates the boundary ‘frames’ that are read by the model every hour to synchronise along the spatial boundaries. These frames are fairly small and data-derived, therefore are ideal for re-use in the ‘restart’ run mode where spin-up times and model runtimes overlap. This task is not run if the get_lbc_from_archive
task in the same cycle was successful. The task takes approximately 5 minutes to complete on a single core.
+
+
+put_lbc_to_archive
+This task archives the boundary conditions created by aus2200_d0198_RA3_um_createbc_000
to the directory determined above in the get_lbc_from_archive
task. Note that there is a race condition between testing whether archived boundary conditions are available, to this task running. When running overlapping restart suites, it is best to ensure that the same cycle point is not running in two different suites to avoid this race condition. This task is not run if Archive LBCS
is set to false in the ‘Driving model setup’ section of the rose suite configuration ({{ARCHIVE_LBCS}}
in rose-suite.conf
)
+
+
+recover_sst
+This task retrieves the sea-surface temperature from input dump file created by the ec_recon_000
task. This is necessary as there is an error in the sea ice concentration in the dump file caused by the interpolation from the low-resolution ERA5 dataset, combined with the era5grib
library’s handling of missing values. This can be summarised visually in the following images (Right image created by Martin Dix)
+
+Any of the large resolution pixels that the reconfiguration task has determined to contain more than some fraction of ocean is treated as an ocean point with a sea ice fraction of about 14%. This error is rectified by restoring the sea ice fraction from the ancillary files, however the UM cannot do this without also loading the sea surface temperature from an ancillary file. This task ensures the sea surface temperature is not modified. The addition of these ancillary files into the input dump is controlled by the 2
and 3
entries under ‘Configure ancils and initialise dump fields’ section of the rose suite configuration
+
+
+
+create_soil_moisture_ancil
+This task is used to carry over soil moisture from a previous run of the suite. When performing restart runs, it is advisable to carry the soil moisture through the entire run, as soil moisture has a much longer spin-up time than atmospheric processes. The restart run procedure is explained in more detail in the Restart and Spinup section below. This task is used to partially serialise restart runs, it can be submitted at any time and will wait until the correct dump file has been created from a previous run of the suite before creating the soil moisture ancillary file. The addition of this ancillary file into the input dump is controlled by the 1
entry under ‘Configure ancils and initialise dump fields’.
+
+The creation of this ancillary file is controlled by the SOIL_MOISTURE_CARRYOVER
setting under ‘Cycling options’ under the ‘Nesting Suite’ area in the rose suite configuration.
+Note that if this is set to false, the addition of the soil moisture field must be removed from the input dump. This is done by right-clicking on the row in the ‘Configure ancils and initialise dump fields’ area and selecting ‘Ignore this section’.
+
+
+aus2200_d0198_RA3_um_recon
+This is the main reconfiguration task for creating the model input dump. It is run once per suite, and for the remainder of the cycle points, the output dump is fed directly into the model as the input dump for the next cycle. This job is configured to run on 2 sapphire rapids nodes (normalsr
queue, 208 cores) and runs for approximately 38 minutes.
+
+
+aus2200_d0198_RA3_um_fcst_000
+This is the main UM model run task. It is configured to complete 6 model hours, and is run on 98 sapphire rapids nodes (normalsr
queue, 10,192 cores) for approximately 8 minutes. This is the most expensive task of the suite, and uses approximately 2.7kSU.
+The model runs in a 68x68 processor decomposition, with 2 threads per MPI process, 48 IO server tasks arranged in a 6x8 configuration (6 IO servers, 8 tasks per server). This gives a total of 4672 MPI processes, 9344 threads. In order to align with the 8 NUMA domains on the sapphire rapids nodes, the model is run with 96 threads per node, with 12 threads running on each NUMA domain. This gives a total of 98 nodes, with the final node utilised at one third capacity.
+A significant amount of optimisation work was performed in order to reach this level of performance. We advise not to change any of the UM processor decomposition or IO System Settings unless the current settings are preventing the model from running entirely.
+
+
+housekeep_cycle
+This is a standard task in all UM suites that archives logs and cleans working directories. In this suite it is configured to tar the log files and remove the work directory from the previous cycle.
+
+
+convert_to_nc
+This task converts the raw UM output from the previous cycle to netCDF format. It is advised to retain this task, as the converted netCDF output takes up around 1/6th the disk space of the raw UM output. This task removes the need to run the ACCESS-Archiver and can be used to move the output to its final location. This task is only run when ‘Convert output to netCDF’ is set to true under ‘General Run Options’ ({{CONVERT_TO_NETCDF}}
in rose-suite.conf
). The final location of the netCDF files is set in the ‘Path to converted netCDF files’ configuration option ({{NC_OUTPUT_PATH}}
in rose-suite.conf
)
+
+The netCDF files are created by the um2netcdf4
python package, which uses iris
to create the netCDF files. This task runs on 26 sapphire rapids cores (normalsr
queue) and completes in about 20 minutes.
+
+
+verify_nc
+This task uses iris
and xarray
to verify that the netCDF files created in the convert_to_nc
task are correct. This task runs on one Sapphire Rapids node (normalsr
queue, 104 cpus) and completes in about 10 minutes.
+
+
+
+Restart and Spinup
+There are two different run styles for the AUS2200 suite used in experiments, they are the ‘continuous’ and ‘restart’ style runs. The continuous run style is fairly straight forward. All that is necessary to configure is the INITIAL_CYCLE_POINT
and FINAL_CYCLE_POINT
, and the model will run in 6-hour cycle points between those two time stamps.
+Restart style runs are more complex, as each restart may require spin-up time, and output fields may need to be carried over between overlapping suite runs.The AUS2200 suite has the ability to manage this, specifically for soil moisture fields. The suite can be configured to carry over soil moisture data between 24-hour restart runs. This is controlled by a combination of the create_soil_moisture_ancil
task, and the input into the aus2200_d0198_RA3_um_recon
task. To enable soil moisture carry-over, set SOIL_MOISTURE_CARRYOVER
to true, and set SPINUP_HOURS
to an integer values
+.
+Spin up hours is only used in restart runs, and determines how many hours of the model run are estimated to be required to ‘spin-up’ the atmosphere model. Typically, the model is restarted every 24 hours, and run for 24 + SPINUP_HOURS
model hours, with the first SPINUP_HOURS
discarded. In order for the soil moisture carryover to work, the suites must adhere to a strict naming convention that allows the create_soil_moisture_ancil
task to derive the name of the previous restart suite. The convention is as follows:
+
+
Suite naming convention
+
For a suite named u-ab123 with an initial cycle point of 20010101T0000 with SPINUP_HOURS
set to 24, and set to run for 48 hours, the soil moisture will be carried over from the model dump created at cycle point 20001231T1800 of the cylc suite named u-ab123-20001231T0000
. The rose date
function is used to calculate the date stamps for the previous suite, but this does rely on SPINUP_HOURS
being set correctly.
+
+The create_soil_moisture_ancil
task is configured to wait for the completion of the aus2200_d0198_RA3_um_fcst_000
task from the previous suite. Therefore, it is possible to run multiple versions of the AUS2200 suite that rely on soil moisture carryover simultaneously. This will allow initial and boundary conditions to be created and cached ahead of time, whilst the reconfiguration and model run tasks will only commence when the data required is present. This saves a significant amount of resources, as it is typical for model runs to overlap due to the required spinup time. Later model runs can therefore use cache boundary conditions created for earlier cases.
+
+Typical restart workflow
+A typical daily restart run will require multiple copies of the AUS2200 suite to run simultaneously. The following encapsulates the workflow for the AUS2200 model run described here. The simulations took place over 2-9 June 2016 with a 2-day soil moisture spinup preceding the full simulation. The simulation will be restarted every day, and run with a 24 hour spin-up time. This means that a total of 9 variants of the AUS2200 suite will be run.
+
+Therefore, the full model run needs to start on the 30th of May 2016. For this initial simulation, SOIL_MOISTURE_CARRYOVER
must be false, item 1
in the ‘Configure ancils and initialise dump fields’ section of the um
task must be set to ignored, INITIAL_CYCLE_POINT
must be set to 20160530T0000Z, and FINAL_CYCLE_POINT
must be set to 20160503T1800Z. Submit this suite with rose suite-run
.
+In your ~/roses
directory, make a copy of the AUS2200 suite, following the naming convention outlined above. This example uses u-cs142
for the original suite id.
+
+u-cs142$ cd ..
+roses$ cp -r u-cs142 u-cs142-20160601T0000
+roses$ cd u-cs142-20160601T0000
+u-cs142-20160601T0000$ rose edit
+
+With the rose config editor window open, change INITIAL_CYCLE_POINT
to match the date stamp of the directory, set FINAL_CYCLE_POINT
to 20160602T1800. Re-enable SOIL_MOISTURE_CARRYOVER
and ensure SPINUP_HOURS
is set to 24. Ensure that item 1
in the ‘Configure ancils and initialise dump fields’ section of the um
task has been re-enabled.
+
+Once this is done, the suite can now be started with rose suite-run
. At this point, the driving model reconfiguration tasks and and create_soil_moisture_ancil
will run. As a 24 hour spinup time has been requested, create_soil_moisture_ancil
will expect the file $HOME/cylc-run/u-cs142-20160531T0000/share/cycle/20160531T1800Z/aus2200/d0198/RA3/ics/umnsaa_d006
to exist in order to extract the soil moisture fields. Create a symlink with the following name in your ~/cylc-run
directory
+
+cylc-run$
+ln -s u-cs142 u-cs142-20160531T0000
+
+From this point, create a copy of u-cs142-20160601T0000
and move INITIAL_CYCLE_POINT
and FINAL_CYCLE_POINT
forward by one day.
+
+u-cs142-20160601T0000$ cd ..
+roses$ cp -r u-cs142-20160601T0000 u-cs142-20160602T0000
+roses$ cd u-cs142-20160602T0000
+u-cs142-20160602T0000$ rose edit
+
+Then run the newly modified suite with rose suite-run
. Repeat this for the six remaining starting cycle points. There is no limit to how many cylc suites can be run simultaneously. The order of the tasks between suites is controlled by the create_soil_moisture_ancil
tasks. The AUS2200 reconfiguration task will not be permitted to start until the corresponding cycle point from the previous suite has completed. All suites can then be left to run to completion.
+
+
+
+netCDF conversion post-processing
+
+Acknowledgements
+Thanks to Davide Marchegiani for animated-terminal.js. Thanks to Martin Dix and Scott Wales for resolving issues with the model.
+
+
+