This directory contains a set of core modules built for the Cluster Toolkit. Modules describe the building blocks of an AI/ML and HPC deployment. The expected fields in a module are listed in more detail below. Blueprints can be extended in functionality by incorporating modules from GitHub repositories.
Modules from various sources are all listed here for visibility. Badges are used to indicate the source and status of many of these resources.
Modules listed below with the badge are located in this folder and are tested and maintained by the Cluster Toolkit team.
Modules labeled with the badge are contributed by the community (including the Cluster Toolkit team, partners, etc.). Community modules are located in the community folder.
Modules labeled with the badge are now deprecated and may be removed in the future. Customers are advised to transition to alternatives.
Modules that are still in development and less stable are labeled with the badge.
- vm-instance : Creates one or more VM instances.
- schedmd-slurm-gcp-v5-partition : Creates a partition to be used by a slurm-controller.
- schedmd-slurm-gcp-v5-node-group : Creates a node group to be used by the schedmd-slurm-gcp-v5-partition module.
- schedmd-slurm-gcp-v6-partition : Creates a partition to be used by a slurm-controller.
- schedmd-slurm-gcp-v6-nodeset : Creates a nodeset to be used by the schedmd-slurm-gcp-v6-partition module.
- schedmd-slurm-gcp-v6-nodeset-tpu : Creates a TPU nodeset to be used by the schedmd-slurm-gcp-v6-partition module.
- schedmd-slurm-gcp-v6-nodeset-dynamic : Creates a dynamic nodeset to be used by the schedmd-slurm-gcp-v6-partition module and instance template.
- gke-node-pool : Creates a Kubernetes node pool using GKE.
- gke-job-template : Creates a Kubernetes job file to be used with a gke-node-pool.
- htcondor-execute-point : Manages a group of execute points for use in an HTCondor pool.
- pbspro-execution : Creates execution hosts for use in a PBS Professional cluster.
- mig : Creates a Managed Instance Group.
- notebook : Creates a Vertex AI Notebook. Primarily used for FSI - MonteCarlo Tutorial.
- slurm-cloudsql-federation : Creates a Google SQL Instance meant to be integrated with a slurm-controller.
- bigquery-dataset : Creates a BQ dataset. Primarily used for FSI - MonteCarlo Tutorial.
- bigquery-table : Creates a BQ table. Primarily used for FSI - MonteCarlo Tutorial.
- filestore : Creates a filestore file system.
- parallelstore : Creates a parallelstore file system.
- pre-existing-network-storage : Specifies a pre-existing file system that can be mounted on a VM.
- DDN-EXAScaler : Creates a DDN EXAscaler lustre file system. This module has license costs.
- Intel-DAOS : Creates a DAOS file system.
- cloud-storage-bucket : Creates a Google Cloud Storage (GCS) bucket.
- gke-persistent-volume : Creates persistent volumes and persistent volume claims for shared storage.
- nfs-server : Creates a VM and configures an NFS server that can be mounted by other VM.
- dashboard : Creates a monitoring dashboard for visually tracking a Cluster Toolkit deployment.
- vpc : Creates a Virtual Private Cloud (VPC) network with regional subnetworks and firewall rules.
- multivpc : Creates a variable number of VPC networks using the vpc module.
- pre-existing-vpc : Used to connect newly built components to a pre-existing VPC network.
- firewall-rules : Add custom firewall rules to existing networks (commonly used with pre-existing-vpc).
- private-service-access : Configures Private Services Access for a VPC network (commonly used with filestore and slurm-cloudsql-federation).
- custom-image : Creates a custom VM Image based on the GCP HPC VM image.
- new-project : Creates a Google Cloud Project.
- service-account : Creates service accounts for a GCP project.
- service-enablement : Allows enabling various APIs for a Google Cloud Project.
- topic : Creates a Pub/Sub topic. Primarily used for FSI - MonteCarlo Tutorial.
- bigquery-sub : Creates a Pub/Sub subscription. Primarily used for FSI - MonteCarlo Tutorial.
- chrome-remote-desktop : Creates a GPU accelerated Chrome Remote Desktop.
- batch-job-template : Creates a Google Cloud Batch job template that works with other Toolkit modules.
- batch-login-node : Creates a VM that can be used for submission of Google Cloud Batch jobs.
- gke-cluster : Creates a Kubernetes cluster using GKE.
- pre-existing-gke-cluster : Retrieves an existing GKE cluster. Substitute for (gke-cluster) module.
- schedmd-slurm-gcp-v5-controller : Creates a Slurm controller node using slurm-gcp-version-5.
- schedmd-slurm-gcp-v5-login : Creates a Slurm login node using slurm-gcp-version-5.
- schedmd-slurm-gcp-v5-hybrid : Creates hybrid Slurm partition configuration files using slurm-gcp-version-5.
- schedmd-slurm-gcp-v6-controller : Creates a Slurm controller node using slurm-gcp-version-6.
- schedmd-slurm-gcp-v6-login : Creates a Slurm login node using slurm-gcp-version-6.
- htcondor-setup : Creates the base infrastructure for an HTCondor pool (service accounts and Cloud Storage bucket).
- htcondor-pool-secrets : Creates and manages access to the secrets necessary for secure operation of an HTCondor pool.
- htcondor-access-point : Creates a regional instance group managing a highly available HTCondor access point (login node).
- pbspro-client : Creates a client host for submitting jobs to a PBS Professional cluster.
- pbspro-server : Creates a server host for operating a PBS Professional cluster.
- startup-script : Creates a customizable startup script that can be fed into compute VMs.
- windows-startup-script : Creates Windows PowerShell (PS1) scripts that can be used to customize Windows VMs and VM images.
- htcondor-install : Creates a startup script to install HTCondor and exports a list of required APIs
- omnia-install : Installs Slurm via Dell Omnia onto a cluster of VM instances. This module has been deprecated and will be removed on August 1, 2024.
- pbspro-preinstall : Creates a Cloud Storage bucket with PBS Pro RPM packages for use by PBS clusters.
- pbspro-install : Creates a Toolkit runner to install PBS Professional from RPM packages.
- pbspro-qmgr : Creates a Toolkit
runner to run common
qmgr
commands when configuring a PBS Pro cluster. - ramble-execute : Creates a startup script to execute Ramble commands on a target VM
- ramble-setup : Creates a startup script to install Ramble on an instance or a slurm login or controller.
- spack-setup : Creates a startup script to install Spack on an instance or a slurm login or controller.
- spack-execute : Defines a software build using Spack.
- wait-for-startup : Waits for successful completion of a startup script on a compute VM.
NOTE: Slurm V4 is deprecated. In case, you want to use V4 modules, please use ghpc-v1.27.0 source code and build ghpc binary from this. This source code also contains deprecated examples using V4 modules for your reference.
The id
field is used to uniquely identify and reference a defined module.
ID's are used in variables and become the
name of each module when writing the terraform main.tf
file. They are also
used in the use and outputs lists
described below.
For terraform modules, the ID will be rendered into the terraform module label at the top level main.tf file.
The source is a path or URL that points to the source files for Packer or Terraform modules. A source can either be a filesystem path or a URL to a git repository:
-
Filesystem paths
- modules embedded in the
gcluster
executable - modules in the local filesystem
- modules embedded in the
-
Remote modules using Terraform URL syntax
- Hosted on GitHub
- Google Cloud Storage Buckets
- Generic git repositories
when modules are in a subdirectory of the git repository, a special double-slash
//
notation can be required as described below
An important distinction is that those URLs are natively supported by Terraform so they are not copied to your deployment directory. Packer does not have native support for git-hosted modules so the Toolkit will copy these modules into the deployment folder on your behalf.
Embedded modules are added to the gcluster binary during compilation and cannot
be edited. To refer to embedded modules, set the source path to
modules/<<MODULE_PATH>>
or community/modules/<<MODULE_PATH>>
.
The paths match the modules in the repository structure for core modules and community modules. Because the modules are embedded during compilation, your local copies may differ unless you recompile gcluster.
For example, this example snippet uses the embedded pre-existing-vpc module:
- id: network1
source: modules/network/pre-existing-vpc
Local modules point to a module in the file system and can easily be edited.
They are very useful during module development. To use a local module, set
the source to a path starting with /
, ./
, or ../
. For instance, the
following module definition refers the local pre-existing-vpc modules.
- id: network1
source: ./modules/network/pre-existing-vpc
NOTE: Relative paths (beginning with
.
or..
must be relative to the working directory from whichgcluster
is executed. This example would have to be run from a local copy of the Cluster Toolkit repository. An alternative is to use absolute paths to modules.
The Intel DAOS blueprint makes extensive use of GitHub-hosted Terraform and Packer modules. You may wish to use it as an example reference for this documentation.
To use a Terraform module available on GitHub, set the source to a path starting
with github.com
(HTTPS) or git@github.com
(SSH). For instance, the following
module definition sources the Toolkit vpc module:
- id: network1
source: github.com/GoogleCloudPlatform/hpc-toolkit//modules/network/vpc
This example uses the double-slash notation (//
) to indicate that
the Toolkit is a "package" of multiple modules whose root directory is the root
of the git repository. The remainder of the path indicates the sub-directory of
the vpc module.
The example above uses the default main
branch of the Toolkit. Specific
revisions can be selected with any valid git reference.
(git branch, commit hash or tag). If the git reference is a tag or branch, we
recommend setting &depth=1
to reduce the data transferred over the network.
This option cannot be set when the reference is a commit hash. The following
examples select the vpc module on the active develop
branch and also an older
release of the filestore module:
- id: network1
source: github.com/GoogleCloudPlatform/hpc-toolkit//modules/network/vpc?ref=develop
...
- id: homefs
source: github.com/GoogleCloudPlatform/hpc-toolkit//modules/file-system/filestore?ref=v1.22.1&depth=1
Because Terraform modules natively support this syntax, gcluster will not copy
GitHub-hosted modules into your deployment folder. Terraform will download them
into a hidden folder when you run terraform init
.
Packer does not natively support GitHub-hosted modules so gcluster create
will
copy modules into your deployment folder.
If the module uses //
package notation, gcluster create
will copy the entire
repository to the module path: deployment_name/group_name/module_id
. However,
when gcluster deploy
is invoked, it will run Packer from the subdirectory
deployment_name/group_name/module_id/subdirectory/after/double_slash
.
Referring back to the Intel DAOS blueprint, we see that it will
create 2 deployment groups at pfs-daos/daos-client-image
and
pfs-daos/daos-server-image
. However, Packer will actually be invoked from
a subdirectories ending in daos-client-image/images
and
daos-server-image/images
.
If the module does not use //
package notation, gcluster create
will copy
only the final directory in the path to deployment_name/group_name/module_id
.
In all cases, gcluster create
will remove the .git
directory from the packer
module to ensure that you can manage the entire deployment directory with its
own git versioning.
Get module from GitHub over SSH:
- id: network1
source: git@github.com:GoogleCloudPlatform/hpc-toolkit.git//modules/network/vpc
Specific versions can be selected as for HTTPS:
- id: network1
source: git@github.com:GoogleCloudPlatform/hpc-toolkit.git//modules/network/vpc?ref=v1.22.1&depth=1
To use a Terraform module available in a non-GitHub git repository such as
gitlab, set the source to a path starting git::
. Two Standard git protocols
are supported, git::https://
for HTTPS or git::git@github.com
for SSH.
Additional formatting and features after git::
are identical to that of the
GitHub Modules described above.
To use a Terraform module available in a Google Cloud Storage bucket, set the source
to a URL with the special gcs::
prefix, followed by a GCS bucket object URL.
For example: gcs::https://www.googleapis.com/storage/v1/BUCKET_NAME/PATH_TO_MODULE
kind
refers to the way in which a module is deployed. Currently, kind
can be
either terraform
or packer
. It must be specified for modules of type
packer
. If omitted, it will default to terraform
.
The settings field is a map that supplies any user-defined variables for each
module. Settings values can be simple strings, numbers or booleans, but can
also support complex data types like maps and lists of variable depth. These
settings will become the values for the variables defined in either the
variables.tf
file for Terraform or variable.pkr.hcl
file for Packer.
For some modules, there are mandatory variables that must be set,
therefore settings
is a required field in that case. In many situations, a
combination of sensible defaults, deployment variables and used modules can
populated all required settings and therefore the settings field can be omitted.
The use
field is a powerful way of linking a module to one or more other
modules. When a module "uses" another module, the outputs of the used
module are compared to the settings of the current module. If they have
matching names and the setting has no explicit value, then it will be set to
the used module's output. For example, see the following blueprint snippet:
modules:
- id: network1
source: modules/network/vpc
- id: workstation
source: modules/compute/vm-instance
use: [network1]
settings:
...
In this snippet, the VM instance workstation
uses the outputs of vpc
network1
.
In this case both network_self_link
and subnetwork_self_link
in the
workstation settings will be set
to $(network1.network_self_link)
and $(network1.subnetwork_self_link)
which
refer to the network1 outputs
of the same names.
The order of precedence that gcluster
uses in determining when to infer a setting
value is in the following priority order:
- Explicitly set in the blueprint using the
settings
field - Output from a used module, taken in the order provided in the
use
list - Deployment variable (
vars
) of the same name - Default value for the setting
NOTE: See the network storage documentation for more information about mounting network storage file systems via the
use
field.
The outputs
field adds the output of individual Terraform modules to the
output of its deployment group. This enables the value to be available via
terraform output
. This can useful for displaying the IP of a login node or
printing instructions on how to use a module, as we have in the
monitoring dashboard module.
The outputs field is a lists that it can be in either of two formats: a string
equal to the name of the module output, or a map specifying the name
,
description
, and whether the value is sensitive
and should be suppressed
from the standard output of Terraform commands. An example is shown below
that displays the internal and public IP addresses of a VM created by the
vm-instance module:
- id: vm
source: modules/compute/vm-instance
use:
- network1
settings:
machine_type: e2-medium
outputs:
- internal_ip
- name: external_ip
description: "External IP of VM"
sensitive: true
The outputs shown after running Terraform apply will resemble:
Apply complete! Resources: 7 added, 0 changed, 0 destroyed.
Outputs:
external_ip_simplevm = <sensitive>
internal_ip_simplevm = [
"10.128.0.19",
]
Each Toolkit module depends upon Google Cloud services ("APIs") being enabled
in the project used by the AI/ML and HPC environment. For example, the creation of
VMs requires the Compute Engine API
(compute.googleapis.com). The startup-script module
requires the Cloud Storage API (storage.googleapis.com) for storage of the
scripts themselves. Each module included in the Toolkit source code describes
its required APIs internally. The Toolkit will merge the requirements from all
modules and automatically validate that all
APIs are enabled in the project specified by $(vars.project_id)
.
The following common naming conventions should be used to decrease the verbosity
needed to define a blueprint. This is intentional to allow multiple
modules to share inferred settings from deployment variables or from other
modules listed under the use
field.
For example, if all modules are to be created in a single region, that region
can be defined as a deployment variable named region
, which is shared between
all modules without an explicit setting. Similarly, if many modules need to be
connected to the same VPC network, they all can add the vpc module ID to their
use
list so that network_self_link
would be inferred from that vpc module rather
than having to set it manually.
- project_id: The GCP project ID in which to create the GCP resources.
- deployment_name: The name of the current deployment of a blueprint. This can help to avoid naming conflicts of modules when multiple deployments are created from the same blueprint.
- region: The GCP region the module will be created in.
- zone: The GCP zone the module will be created in.
- labels: Labels added to the module. In order to include any module in advanced monitoring, labels must be exposed. We strongly recommend that all modules expose this variable.
Modules are flexible by design, however we do define some best practices when creating a new module meant to be used with the Cluster Toolkit.