Modules

This directory contains a set of core modules built for the Cluster Toolkit. Modules describe the building blocks of an AI/ML and HPC deployment. The expected fields in a module are listed in more detail below. Blueprints can be extended in functionality by incorporating modules from GitHub repositories.

All Modules

Modules from various sources are all listed here for visibility. Badges are used to indicate the source and status of many of these resources.

Modules listed below with the badge are located in this folder and are tested and maintained by the Cluster Toolkit team.

Modules labeled with the badge are contributed by the community (including the Cluster Toolkit team, partners, etc.). Community modules are located in the community folder.

Modules labeled with the badge are now deprecated and may be removed in the future. Customers are advised to transition to alternatives.

Modules that are still in development and less stable are labeled with the badge.

Compute

vm-instance : Creates one or more VM instances.
schedmd-slurm-gcp-v5-partition : Creates a partition to be used by a slurm-controller.
schedmd-slurm-gcp-v5-node-group : Creates a node group to be used by the schedmd-slurm-gcp-v5-partition module.
schedmd-slurm-gcp-v6-partition : Creates a partition to be used by a slurm-controller.
schedmd-slurm-gcp-v6-nodeset : Creates a nodeset to be used by the schedmd-slurm-gcp-v6-partition module.
schedmd-slurm-gcp-v6-nodeset-tpu : Creates a TPU nodeset to be used by the schedmd-slurm-gcp-v6-partition module.
schedmd-slurm-gcp-v6-nodeset-dynamic : Creates a dynamic nodeset to be used by the schedmd-slurm-gcp-v6-partition module and instance template.
gke-node-pool : Creates a Kubernetes node pool using GKE.
gke-job-template : Creates a Kubernetes job file to be used with a gke-node-pool.
htcondor-execute-point : Manages a group of execute points for use in an HTCondor pool.
pbspro-execution : Creates execution hosts for use in a PBS Professional cluster.
mig : Creates a Managed Instance Group.
notebook : Creates a Vertex AI Notebook. Primarily used for FSI - MonteCarlo Tutorial.

Database

slurm-cloudsql-federation : Creates a Google SQL Instance meant to be integrated with a slurm-controller.
bigquery-dataset : Creates a BQ dataset. Primarily used for FSI - MonteCarlo Tutorial.
bigquery-table : Creates a BQ table. Primarily used for FSI - MonteCarlo Tutorial.

File System

filestore : Creates a filestore file system.
parallelstore : Creates a parallelstore file system.
pre-existing-network-storage : Specifies a pre-existing file system that can be mounted on a VM.
DDN-EXAScaler : Creates a DDN EXAscaler lustre file system. This module has license costs.
Intel-DAOS : Creates a DAOS file system.
cloud-storage-bucket : Creates a Google Cloud Storage (GCS) bucket.
gke-persistent-volume : Creates persistent volumes and persistent volume claims for shared storage.
nfs-server : Creates a VM and configures an NFS server that can be mounted by other VM.

Monitoring

dashboard : Creates a monitoring dashboard for visually tracking a Cluster Toolkit deployment.

Network

vpc : Creates a Virtual Private Cloud (VPC) network with regional subnetworks and firewall rules.
multivpc : Creates a variable number of VPC networks using the vpc module.
pre-existing-vpc : Used to connect newly built components to a pre-existing VPC network.
firewall-rules : Add custom firewall rules to existing networks (commonly used with pre-existing-vpc).
private-service-access : Configures Private Services Access for a VPC network (commonly used with filestore and slurm-cloudsql-federation).

Packer

custom-image : Creates a custom VM Image based on the GCP HPC VM image.

Project

new-project : Creates a Google Cloud Project.
service-account : Creates service accounts for a GCP project.
service-enablement : Allows enabling various APIs for a Google Cloud Project.

Pub/Sub

topic : Creates a Pub/Sub topic. Primarily used for FSI - MonteCarlo Tutorial.
bigquery-sub : Creates a Pub/Sub subscription. Primarily used for FSI - MonteCarlo Tutorial.

Remote Desktop

chrome-remote-desktop : Creates a GPU accelerated Chrome Remote Desktop.

Scheduler

batch-job-template : Creates a Google Cloud Batch job template that works with other Toolkit modules.
batch-login-node : Creates a VM that can be used for submission of Google Cloud Batch jobs.
gke-cluster : Creates a Kubernetes cluster using GKE.
pre-existing-gke-cluster : Retrieves an existing GKE cluster. Substitute for (gke-cluster) module.
schedmd-slurm-gcp-v5-controller : Creates a Slurm controller node using slurm-gcp-version-5.
schedmd-slurm-gcp-v5-login : Creates a Slurm login node using slurm-gcp-version-5.
schedmd-slurm-gcp-v5-hybrid : Creates hybrid Slurm partition configuration files using slurm-gcp-version-5.
schedmd-slurm-gcp-v6-controller : Creates a Slurm controller node using slurm-gcp-version-6.
schedmd-slurm-gcp-v6-login : Creates a Slurm login node using slurm-gcp-version-6.
htcondor-setup : Creates the base infrastructure for an HTCondor pool (service accounts and Cloud Storage bucket).
htcondor-pool-secrets : Creates and manages access to the secrets necessary for secure operation of an HTCondor pool.
htcondor-access-point : Creates a regional instance group managing a highly available HTCondor access point (login node).
pbspro-client : Creates a client host for submitting jobs to a PBS Professional cluster.
pbspro-server : Creates a server host for operating a PBS Professional cluster.

Scripts

startup-script : Creates a customizable startup script that can be fed into compute VMs.
windows-startup-script : Creates Windows PowerShell (PS1) scripts that can be used to customize Windows VMs and VM images.
htcondor-install : Creates a startup script to install HTCondor and exports a list of required APIs
omnia-install : Installs Slurm via Dell Omnia onto a cluster of VM instances. This module has been deprecated and will be removed on August 1, 2024.
pbspro-preinstall : Creates a Cloud Storage bucket with PBS Pro RPM packages for use by PBS clusters.
pbspro-install : Creates a Toolkit runner to install PBS Professional from RPM packages.
pbspro-qmgr : Creates a Toolkit runner to run common qmgr commands when configuring a PBS Pro cluster.
ramble-execute : Creates a startup script to execute Ramble commands on a target VM
ramble-setup : Creates a startup script to install Ramble on an instance or a slurm login or controller.
spack-setup : Creates a startup script to install Spack on an instance or a slurm login or controller.
spack-execute : Defines a software build using Spack.
wait-for-startup : Waits for successful completion of a startup script on a compute VM.

NOTE: Slurm V4 is deprecated. In case, you want to use V4 modules, please use ghpc-v1.27.0 source code and build ghpc binary from this. This source code also contains deprecated examples using V4 modules for your reference.

Module Fields

ID (Required)

The id field is used to uniquely identify and reference a defined module. ID's are used in variables and become the name of each module when writing the terraform main.tf file. They are also used in the use and outputs lists described below.

For terraform modules, the ID will be rendered into the terraform module label at the top level main.tf file.

Source (Required)

The source is a path or URL that points to the source files for Packer or Terraform modules. A source can either be a filesystem path or a URL to a git repository:

Filesystem paths
- modules embedded in the gcluster executable
- modules in the local filesystem
Remote modules using Terraform URL syntax
- Hosted on GitHub
- Google Cloud Storage Buckets
- Generic git repositories
when modules are in a subdirectory of the git repository, a special double-slash // notation can be required as described below

An important distinction is that those URLs are natively supported by Terraform so they are not copied to your deployment directory. Packer does not have native support for git-hosted modules so the Toolkit will copy these modules into the deployment folder on your behalf.

Embedded Modules

Embedded modules are added to the gcluster binary during compilation and cannot be edited. To refer to embedded modules, set the source path to modules/<<MODULE_PATH>> or community/modules/<<MODULE_PATH>>.

The paths match the modules in the repository structure for core modules and community modules. Because the modules are embedded during compilation, your local copies may differ unless you recompile gcluster.

For example, this example snippet uses the embedded pre-existing-vpc module:

  - id: network1
    source: modules/network/pre-existing-vpc

Local Modules

Local modules point to a module in the file system and can easily be edited. They are very useful during module development. To use a local module, set the source to a path starting with /, ./, or ../. For instance, the following module definition refers the local pre-existing-vpc modules.

  - id: network1
    source: ./modules/network/pre-existing-vpc

NOTE: Relative paths (beginning with . or .. must be relative to the working directory from which gcluster is executed. This example would have to be run from a local copy of the Cluster Toolkit repository. An alternative is to use absolute paths to modules.

GitHub-hosted Modules and Packages

The Intel DAOS blueprint makes extensive use of GitHub-hosted Terraform and Packer modules. You may wish to use it as an example reference for this documentation.

To use a Terraform module available on GitHub, set the source to a path starting with github.com (HTTPS) or git@github.com (SSH). For instance, the following module definition sources the Toolkit vpc module:

  - id: network1
    source: github.com/GoogleCloudPlatform/hpc-toolkit//modules/network/vpc

This example uses the double-slash notation (//) to indicate that the Toolkit is a "package" of multiple modules whose root directory is the root of the git repository. The remainder of the path indicates the sub-directory of the vpc module.

The example above uses the default main branch of the Toolkit. Specific revisions can be selected with any valid git reference. (git branch, commit hash or tag). If the git reference is a tag or branch, we recommend setting &depth=1 to reduce the data transferred over the network. This option cannot be set when the reference is a commit hash. The following examples select the vpc module on the active develop branch and also an older release of the filestore module:

  - id: network1
    source: github.com/GoogleCloudPlatform/hpc-toolkit//modules/network/vpc?ref=develop
  ...
  - id: homefs
    source: github.com/GoogleCloudPlatform/hpc-toolkit//modules/file-system/filestore?ref=v1.22.1&depth=1

Because Terraform modules natively support this syntax, gcluster will not copy GitHub-hosted modules into your deployment folder. Terraform will download them into a hidden folder when you run terraform init.

GitHub-hosted Packer modules

Packer does not natively support GitHub-hosted modules so gcluster create will copy modules into your deployment folder.

If the module uses // package notation, gcluster create will copy the entire repository to the module path: deployment_name/group_name/module_id. However, when gcluster deploy is invoked, it will run Packer from the subdirectory deployment_name/group_name/module_id/subdirectory/after/double_slash.

Referring back to the Intel DAOS blueprint, we see that it will create 2 deployment groups at pfs-daos/daos-client-image and pfs-daos/daos-server-image. However, Packer will actually be invoked from a subdirectories ending in daos-client-image/images and daos-server-image/images.

If the module does not use // package notation, gcluster create will copy only the final directory in the path to deployment_name/group_name/module_id.

In all cases, gcluster create will remove the .git directory from the packer module to ensure that you can manage the entire deployment directory with its own git versioning.

GitHub over SSH

Get module from GitHub over SSH:

  - id: network1
    source: git@github.com:GoogleCloudPlatform/hpc-toolkit.git//modules/network/vpc

Specific versions can be selected as for HTTPS:

  - id: network1
    source: git@github.com:GoogleCloudPlatform/hpc-toolkit.git//modules/network/vpc?ref=v1.22.1&depth=1

Generic Git Modules

To use a Terraform module available in a non-GitHub git repository such as gitlab, set the source to a path starting git::. Two Standard git protocols are supported, git::https:// for HTTPS or git::git@github.com for SSH.

Additional formatting and features after git:: are identical to that of the GitHub Modules described above.

Google Cloud Storage Modules

To use a Terraform module available in a Google Cloud Storage bucket, set the source to a URL with the special gcs:: prefix, followed by a GCS bucket object URL.

For example: gcs::https://www.googleapis.com/storage/v1/BUCKET_NAME/PATH_TO_MODULE

Kind (May be Required)

kind refers to the way in which a module is deployed. Currently, kind can be either terraform or packer. It must be specified for modules of type packer. If omitted, it will default to terraform.

Settings (May Be Required)

The settings field is a map that supplies any user-defined variables for each module. Settings values can be simple strings, numbers or booleans, but can also support complex data types like maps and lists of variable depth. These settings will become the values for the variables defined in either the variables.tf file for Terraform or variable.pkr.hcl file for Packer.

For some modules, there are mandatory variables that must be set, therefore settings is a required field in that case. In many situations, a combination of sensible defaults, deployment variables and used modules can populated all required settings and therefore the settings field can be omitted.

Use (Optional)

The use field is a powerful way of linking a module to one or more other modules. When a module "uses" another module, the outputs of the used module are compared to the settings of the current module. If they have matching names and the setting has no explicit value, then it will be set to the used module's output. For example, see the following blueprint snippet:

modules:
- id: network1
  source: modules/network/vpc

- id: workstation
  source: modules/compute/vm-instance
  use: [network1]
  settings:
  ...

In this snippet, the VM instance workstation uses the outputs of vpc network1.

In this case both network_self_link and subnetwork_self_link in the workstation settings will be set to $(network1.network_self_link) and $(network1.subnetwork_self_link) which refer to the network1 outputs of the same names.

The order of precedence that gcluster uses in determining when to infer a setting value is in the following priority order:

Explicitly set in the blueprint using the settings field
Output from a used module, taken in the order provided in the use list
Deployment variable (vars) of the same name
Default value for the setting

NOTE: See the network storage documentation for more information about mounting network storage file systems via the use field.

Outputs (Optional)

The outputs field adds the output of individual Terraform modules to the output of its deployment group. This enables the value to be available via terraform output. This can useful for displaying the IP of a login node or printing instructions on how to use a module, as we have in the monitoring dashboard module.

The outputs field is a lists that it can be in either of two formats: a string equal to the name of the module output, or a map specifying the name, description, and whether the value is sensitive and should be suppressed from the standard output of Terraform commands. An example is shown below that displays the internal and public IP addresses of a VM created by the vm-instance module:

  - id: vm
    source: modules/compute/vm-instance
    use:
    - network1
    settings:
      machine_type: e2-medium
    outputs:
    - internal_ip
    - name: external_ip
      description: "External IP of VM"
      sensitive: true

The outputs shown after running Terraform apply will resemble:

Apply complete! Resources: 7 added, 0 changed, 0 destroyed.

Outputs:

external_ip_simplevm = <sensitive>
internal_ip_simplevm = [
  "10.128.0.19",
]

Required Services (APIs) (optional)

Each Toolkit module depends upon Google Cloud services ("APIs") being enabled in the project used by the AI/ML and HPC environment. For example, the creation of VMs requires the Compute Engine API (compute.googleapis.com). The startup-script module requires the Cloud Storage API (storage.googleapis.com) for storage of the scripts themselves. Each module included in the Toolkit source code describes its required APIs internally. The Toolkit will merge the requirements from all modules and automatically validate that all APIs are enabled in the project specified by $(vars.project_id).

Common Settings

The following common naming conventions should be used to decrease the verbosity needed to define a blueprint. This is intentional to allow multiple modules to share inferred settings from deployment variables or from other modules listed under the use field.

For example, if all modules are to be created in a single region, that region can be defined as a deployment variable named region, which is shared between all modules without an explicit setting. Similarly, if many modules need to be connected to the same VPC network, they all can add the vpc module ID to their use list so that network_self_link would be inferred from that vpc module rather than having to set it manually.

project_id: The GCP project ID in which to create the GCP resources.
deployment_name: The name of the current deployment of a blueprint. This can help to avoid naming conflicts of modules when multiple deployments are created from the same blueprint.
region: The GCP region the module will be created in.
zone: The GCP zone the module will be created in.
labels: Labels added to the module. In order to include any module in advanced monitoring, labels must be exposed. We strongly recommend that all modules expose this variable.

Writing Custom Cluster Toolkit Modules

Modules are flexible by design, however we do define some best practices when creating a new module meant to be used with the Cluster Toolkit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Modules

All Modules

Compute

Database

File System

Monitoring

Network

Packer

Project

Pub/Sub

Remote Desktop

Scheduler

Scripts

Module Fields

ID (Required)

Source (Required)

Embedded Modules

Local Modules

GitHub-hosted Modules and Packages

GitHub-hosted Packer modules

GitHub over SSH

Generic Git Modules

Google Cloud Storage Modules

Kind (May be Required)

Settings (May Be Required)

Use (Optional)

Outputs (Optional)

Required Services (APIs) (optional)

Common Settings

Writing Custom Cluster Toolkit Modules

Files

README.md

Latest commit

History

README.md

File metadata and controls

Modules

All Modules

Compute

Database

File System

Monitoring

Network

Packer

Project

Pub/Sub

Remote Desktop

Scheduler

Scripts

Module Fields

ID (Required)

Source (Required)

Embedded Modules

Local Modules

GitHub-hosted Modules and Packages

GitHub-hosted Packer modules

GitHub over SSH

Generic Git Modules

Google Cloud Storage Modules

Kind (May be Required)

Settings (May Be Required)

Use (Optional)

Outputs (Optional)

Required Services (APIs) (optional)

Common Settings

Writing Custom Cluster Toolkit Modules