Skip to content

ECP-CANDLE/Supervisor

Repository files navigation

Overview of CANDLE/Supervisor

CANDLE/Supervisor is a collection of workflows and supporting scripts to run ensembles of deep learning models on exascale computers. These include flat parameter sweeps, hyperparameter optimization, and other analysis workflows.

Architecture

The CANDLE/Supervisor architecture is depicted in the figure below:

Supervisor

In the image, blue components are those to be heavily edited by users, where the white components will require little or no changes.

Invocation

Supervisor is invoked via the supervisor tool or via a test script. Both approaches internally call the workflow.sh script for the appropriate workflow.

The supervisor tool

The supervisor tool allows the user to pass in Bash-formatted configuration files, which may be flat or structured. These are loaded and then the workflow is run. Configuration scripts compatible with this tool are conventionally named cfg-*.sh

Test scripts

Supervisor test script starts with a Bash test script (shown as "test L"); these are intended to be rapidly copied, changed, and executed. Supervisor workflow examples include cases named test-1 or test-nightly that are the simplest cases to run. These test scripts can override settings from the cfg files for quick tests. The command-line arguments to the test script include the site name and the experiment directory name to use for the run, these can be extended. The test script also specifies MODEL_RETURN, which is the value to be returned by the model run, typically val_loss, etc.

Settings files

The cfg files are Bash scripts. Conventionally, they are split into settings that configure the run with respect to the computing system ("sys") or the numerical parameters ("prm") of the run. This allows settings to be reused across tests. These can also be easily duplicated and modified. cfg-sys settings include QUEUE, WALLTIME, PROCS, and other basic compute job settings. cfg-prm settings are typically related to the optimizer for HPO or other numerical settings. These settings can also be merged into a single file.

The site settings are Bash scripts provided by Candle but can be modified or duplicated for new machines. These refer to env-SITE, sched-SITE, and langs-app-SITE files in Supervisor/workflows/common/sh . env-SITE configures the login node environment, sched-SITE configures the use of the system scheduler, and langs-app configures the environment to be used on the compute node (if Swift/T app functions are used). env-SITE typically sets up PATH, PYTHONPATH, etc., for Python, Swift/T, etc. sched-SITE sets the Swift/T MACHINE variable so that Swift/T uses the right scheduler, and any other site-specific scheduler settings. langs-app-SITES typically sets PATH, PYTHONPATH, etc., for the compute node, which is commonly different from the settings for the login node.

The workflow layer

The workflow shell script (workflow.sh) is a Bash script that loads the previously mentioned settings passes them to Swift/T. Some settings are passed as environment variables, some as command-line arguments, and so on. Other workflow-specific configuration may also be done here (setting up restarts, etc.).

The Swift/T launcher is part of Swift/T. This includes the Swift/T scheduler templates that are filtered and launched by Swift/T. See the Swift/T Sites Guide for details.

Finally, qsub or equivalent is run by the Swift/T launcher scripts.

Execution on the compute nodes

Swift/T runs as single multi-node MPI job, shown as workflow.swift . Swift/T logic encoded in workflow.swift proceeds until a Swift/T app function or call to python() launches execution. In Supervisor, this conventionally happens inside the function called candle_model_train(), which launches a CANDLE-compliant model run. candle_model_train() is the CANDLE objective function, in the sense of optimization or whatever else the workflow is doing. candle_model_train() has multiple implementations for external forked execution (model_app.swift), in-memory Python execution (model_py.swift), container execution (model_container.swift) and a debugging non-execution mode that reports the hyperparameters (model_echo.swift). This is controlled by environment variable CANDLE_MODEL_IMPL and is actuated by the swift-t -i argument, which imports the selected Swift/T module into the workflow.

Execution via the Model Shell

External forked execution in Supervisor starts with a call to the Model Shell (model.sh) which redirects output to model.log, loads the langs-app-SITE settings described above, and runs the Model Runner in Python. The Model Shell sets up PYTHONPATH and prepends APP_PYTHONPATH - this is because the Python used by Swift/T may be different from the Python the user desires to use on the compute nodes. PYTHONPATH will affect Swift/T but APP_PYTHONPATH will not. APP_PYTHONPATH can contain any entries needed to run the user model, including entries that will break Swift/T, such as libraries from an entirely different Python environment.

Execution via the Model Runner

In-memory execution skips straight to the Model Runner without fork or the ability to perform shell configuration. APP_PYTHONPATH is not used here, so the PYTHONPATH must be correct for both Swift/T and the user model, since they are sharing the same Python instance.

The Model Runner (model_runner.py) sets up and runs the model. It imports the required Python modules, processes the hyperparameters (formatted as JSON), and performs other optional configuration, and then runs the model via CANDLE-compliant interfaces. The return value is specified by the MODEL_RETURN value.

Other Python models

To run an external model, that is, not one of the CANDLE Benchmarks, the user should set environment variable MODEL_NAME and PYTHONPATH and/or APP_PYTHONPATH. The Model Runner will attempt to load the model package via the Python command import MODEL_NAME_baseline_FRAMEWORK where MODEL_NAME and FRAMEWORK are set in the environment. It will then invoke the CANDLE-compliant API on that package.

Containerized models via IMPROVE

Containerized models are specified with CANDLE_MODEL_TYPE=SINGULARITY. They are executed by model.sh, which simply runs the command

singularity exec SIF train.sh

instead of invoking python directly. Output is gathered from stdout as in the normal Python case.

Examples of IMPROVE Singularity container definitions may be found here.

How to call your model

These environment variable settings may be placed in any Supervisor configuration file.

Models in plain Python

Assuming your model is named MODEL:

  1. Wrap your model in the CANDLE-compliant API

  2. Name your module MODEL_baseline_keras2.py

  3. Set PYTHONPATH to the location of your module

  4. Set export MODEL_NAME=MODEL

Models in Singularity containers

  1. Package your container according to IMPROVE guidelines

  2. export MODEL_NAME=/path/to/model.sif

  3. export CANDLE_MODEL_TYPE=SINGULARITY

Output directories

For each workflow run, an experiment directory is created named:

EXPDIR=$CANDLE_DATA_DIR/WORKFLOW_NAME/EXP00X

by referencing CANDLE_DATA_DIR, the workflow name, and a unique number 00X.

Each rank R (worker process) of the workflow directs its stdout to EXPDIR/out/out-R.txt.

When models are run (including under Singularity), their output is redirected to

$CANDLE_DATA_DIR/MODEL_NAME/Output/EXP00X/RUN00N

for run directories numbered 00R.

Installation

Local Linux or Mac/ARM system:

  1. Install Python 3.9

    1. Specific workflows such as GA may require certain Python libraries

  2. Install candle_lib

  3. Install Swift/T from Anaconda:

    conda install -c conda-forge -c swift-t swift-t
  4. Check out Supervisor

    1. Switch to branch develop

    2. Add Supervisor/bin to PATH

  5. Run supervisor

Compute cluster or supercomputer

Install as for local system but build Swift/T from source.