cluster-toolkit/community/modules/scheduler/schedmd-slurm-gcp-v5-controller at main · NinaCai/cluster-toolkit

Name	Name	Last commit message	Last commit date
parent directory ..
etc	etc
README.md	README.md
gpu_definition.tf	gpu_definition.tf
main.tf	main.tf
metadata.yaml	metadata.yaml
outputs.tf	outputs.tf
source_image_logic.tf	source_image_logic.tf
variables.tf	variables.tf
versions.tf	versions.tf

Description

This module creates a slurm controller node via the SchedMD/slurm-gcp slurm_controller_instance and slurm_instance_template modules.

More information about Slurm On GCP can be found at the project's GitHub page and in the Slurm on Google Cloud User Guide.

The user guide provides detailed instructions on customizing and enhancing the Slurm on GCP cluster as well as recommendations on configuring the controller for optimal performance at different scales.

Warning: The variables enable_reconfigure, enable_cleanup_compute, and enable_cleanup_subscriptions, if set to true, require additional dependencies to be installed on the system deploying the infrastructure.
# Install Python3 and run
pip3 install -r https://raw.githubusercontent.com/GoogleCloudPlatform/slurm-gcp/5.12.0/scripts/requirements.txt

Example

- id: slurm_controller
  source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
  use:
  - network1
  - homefs
  - compute_partition
  settings:
    machine_type: c2-standard-8

This creates a controller node with the following attributes:

connected to the primary subnetwork of network1
the filesystem with the ID homefs (defined elsewhere in the blueprint) mounted
One partition with the ID compute_partition (defined elsewhere in the blueprint)
machine type upgraded from the default c2-standard-4 to c2-standard-8

For a complete example using this module, see slurm-gcp-v5-cluster.yaml.

Live Cluster Reconfiguration (`enable_reconfigure`)

The schedmd-slurm-gcp-v5-controller module supports the reconfiguration of partitions and slurm configuration in a running, active cluster. This option is activated through the enable_reconfigure setting:

- id: slurm_controller
  source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
  settings:
    enable_reconfigure: true

To reconfigure a running cluster:

Edit the blueprint with the desired configuration changes
Call gcluster create <blueprint> -w to overwrite the deployment directory
Follow instructions in terminal to deploy

The following are examples of updates that can be made to a running cluster:

Add or remove a partition to the cluster
Resize an existing partition
Attach new network storage to an existing partition

NOTE: Changing the VM machine_type of a partition may not work with enable_reconfigure. It is better to create a new partition and delete the old one.

This option has some additional requirements:

The Pub/Sub API must be activated in the target project: gcloud services enable pubsub.googleapis.com --project "<<PROJECT_ID>>"
The authenticated user in the local development environment (or where terraform apply is called) must have the Pub/Sub Admin (roles/pubsub.admin) IAM role.
Python and some python packages need to be installed with pip in the local development environment deploying the cluster. One can use following commands:
```
pip3 install -r https://raw.githubusercontent.com/GoogleCloudPlatform/slurm-gcp/5.12.0/scripts/requirements.txt
```
For more information, see the description of this module.

Custom Images

For more information on creating valid custom images for the controller VM instance or for custom instance templates, see our vm-images.md documentation page.

GPU Support

More information on GPU support in Slurm on GCP and other Cluster Toolkit modules can be found at docs/gpu-support.md

Placement Max Distance

When using enable_placement with Slurm, Google Compute Engine will attempt to place VMs as physically close together as possible. Capacity constraints at the time of VM creation may still force VMs to be spread across multiple racks. Google provides the max-distance flag which can used to control the maximum spreading allowed. Read more about max-distance in the official docs.

After deploying a Slurm cluster, you can use the following steps to manually configure the max-distance parameter.

Make sure your blueprint has enable_placement: true setting for Slurm partitions.
Deploy the Slurm cluster and wait for the deployment to complete.
SSH to the deployed Slurm controller

Apply the following edit to /slurm/scripts/config.yaml:

# Replace
enable_slurm_gcp_plugins: false

# With
enable_slurm_gcp_plugins:
  max_hops:
    max_hops: 1

The max_hops parameter will be used for the max-distance argument. In the above case using a value of 1 will restrict VM to be placed on the same rack.

You can confirm that the `max-distance`` was applied by calling the following command while jobs are running:

gcloud beta compute resource-policies list \
  --format='yaml(name,groupPlacementPolicy.maxDistance)'

Warning

If a zone lacks capacity, using a lower max-distance value (such as 1) is more likely to cause VMs creation to fail.

Warning

/slurm/scripts/config.yaml will be overwritten if the blueprint is re-deployed using the enable_reconfigure flag.

Hybrid Slurm Clusters

For more information on how to configure an on premise slurm cluster with hybrid cloud partitions, see the schedmd-slurm-gcp-v5-hybrid module and our extended instructions in our docs.

Support

The Cluster Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name	Version
terraform	>= 1.1
google	>= 3.83

Providers

Name	Version
google	>= 3.83

Modules

Name	Source	Version
slurm_controller_instance	github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_controller_instance	5.12.0
slurm_controller_template	github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template	5.12.0

Resources

Name	Type
google_compute_default_service_account.default	data source
google_compute_image.slurm	data source

Inputs

Name	Description	Type	Default	Required
access_config	Access configurations, i.e. IPs via which the VM instance can be accessed via the Internet.	list(object({ nat_ip = string network_tier = string }))	`[]`	no
additional_disks	List of maps of disks.	list(object({ disk_name = string device_name = string disk_type = string disk_size_gb = number disk_labels = map(string) auto_delete = bool boot = bool }))	`[]`	no
allow_automatic_updates	If false, disables automatic system package updates on the created instances. This feature is only available on supported images (or images derived from them). For more details, see https://cloud.google.com/compute/docs/instances/create-hpc-vm#disable_automatic_updates	`bool`	`true`	no
can_ip_forward	Enable IP forwarding, for NAT instances for example.	`bool`	`false`	no
cgroup_conf_tpl	Slurm cgroup.conf template file path.	`string`	`null`	no
cloud_parameters	cloud.conf options.	object({ no_comma_params = bool resume_rate = number resume_timeout = number suspend_rate = number suspend_timeout = number })	{ "no_comma_params": false, "resume_rate": 0, "resume_timeout": 300, "suspend_rate": 0, "suspend_timeout": 300 }	no
cloudsql	Use this database instead of the one on the controller. server_ip : Address of the database server. user : The user to access the database as. password : The password, given the user, to access the given database. (sensitive) db_name : The database to access.	object({ server_ip = string user = string password = string # sensitive db_name = string })	`null`	no
compute_startup_script	Startup script used by the compute VMs.	`string`	`""`	no
compute_startup_scripts_timeout	The timeout (seconds) applied to the compute_startup_script. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled.	`number`	`300`	no
controller_startup_script	Startup script used by the controller VM.	`string`	`""`	no
controller_startup_scripts_timeout	The timeout (seconds) applied to the controller_startup_script. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled.	`number`	`300`	no
deployment_name	Name of the deployment.	`string`	n/a	yes
disable_controller_public_ips	If set to false. The controller will have a random public IP assigned to it. Ignored if access_config is set.	`bool`	`true`	no
disable_default_mounts	Disable default global network storage from the controller - /usr/local/etc/slurm - /etc/munge - /home - /apps Warning: If these are disabled, the slurm etc and munge dirs must be added manually, or some other mechanism must be used to synchronize the slurm conf files and the munge key across the cluster.	`bool`	`false`	no
disable_smt	Disables Simultaneous Multi-Threading (SMT) on instance.	`bool`	`true`	no
disk_auto_delete	Whether or not the boot disk should be auto-deleted.	`bool`	`true`	no
disk_labels	Labels specific to the boot disk. These will be merged with var.labels.	`map(string)`	`{}`	no
disk_size_gb	Boot disk size in GB.	`number`	`50`	no
disk_type	Boot disk type.	`string`	`"pd-ssd"`	no
enable_bigquery_load	Enable loading of cluster job usage into big query.	`bool`	`false`	no
enable_cleanup_compute	Enables automatic cleanup of compute nodes and resource policies (e.g. placement groups) managed by this module, when cluster is destroyed. NOTE: Requires Python and pip packages listed at the following link: https://github.com/GoogleCloudPlatform/slurm-gcp/blob/3979e81fc5e4f021b5533a23baa474490f4f3614/scripts/requirements.txt WARNING: Toggling this may impact the running workload. Deployed compute nodes may be destroyed and their jobs will be requeued.	`bool`	`false`	no
enable_cleanup_subscriptions	Enables automatic cleanup of pub/sub subscriptions managed by this module, when cluster is destroyed. NOTE: Requires Python and pip packages listed at the following link: https://github.com/GoogleCloudPlatform/slurm-gcp/blob/3979e81fc5e4f021b5533a23baa474490f4f3614/scripts/requirements.txt WARNING: Toggling this may temporarily impact var.enable_reconfigure behavior.	`bool`	`false`	no
enable_confidential_vm	Enable the Confidential VM configuration. Note: the instance image must support option.	`bool`	`false`	no
enable_devel	Enables development mode. Not for production use.	`bool`	`false`	no
enable_external_prolog_epilog	Automatically enable a script that will execute prolog and epilog scripts shared under /opt/apps from the controller to compute nodes.	`bool`	`false`	no
enable_oslogin	Enables Google Cloud os-login for user login and authentication for VMs. See https://cloud.google.com/compute/docs/oslogin	`bool`	`true`	no
enable_reconfigure	Enables automatic Slurm reconfiguration when Slurm configuration changes (e.g. slurm.conf.tpl, partition details). Compute instances and resource policies (e.g. placement groups) will be destroyed to align with new configuration. NOTE: Requires Python and Google Pub/Sub API. WARNING: Toggling this will impact the running workload. Deployed compute nodes will be destroyed and their jobs will be requeued.	`bool`	`false`	no
enable_shielded_vm	Enable the Shielded VM configuration. Note: the instance image must support option.	`bool`	`false`	no
enable_slurm_gcp_plugins	Enables calling hooks in scripts/slurm_gcp_plugins during cluster resume and suspend.	`any`	`false`	no
epilog_scripts	List of scripts to be used for Epilog. Programs for the slurmd to execute on every node when a user's job completes. See https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog.	list(object({ filename = string content = string }))	`[]`	no
gpu	DEPRECATED: use var.guest_accelerator	object({ type = string count = number })	`null`	no
guest_accelerator	List of the type and count of accelerator cards attached to the instance.	list(object({ type = string, count = number }))	`[]`	no
instance_image	Defines the image that will be used in the Slurm controller VM instance. Expected Fields: name: The name of the image. Mutually exclusive with family. family: The image family to use. Mutually exclusive with name. project: The project where the image is hosted. For more information on creating custom images that comply with Slurm on GCP see the "Slurm on GCP Custom Images" section in docs/vm-images.md.	`map(string)`	{ "family": "slurm-gcp-5-12-hpc-centos-7", "project": "schedmd-slurm-public" }	no
instance_image_custom	A flag that designates that the user is aware that they are requesting to use a custom and potentially incompatible image for this Slurm on GCP module. If the field is set to false, only the compatible families and project names will be accepted. The deployment will fail with any other image family or name. If set to true, no checks will be done. See: https://goo.gle/hpc-slurm-images	`bool`	`false`	no
instance_template	Self link to a custom instance template. If set, other VM definition variables such as machine_type and instance_image will be ignored in favor of the provided instance template. For more information on creating custom images for the instance template that comply with Slurm on GCP see the "Slurm on GCP Custom Images" section in docs/vm-images.md.	`string`	`null`	no
labels	Labels, provided as a map.	`map(string)`	`{}`	no
login_startup_scripts_timeout	The timeout (seconds) applied to the login startup script. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled.	`number`	`300`	no
machine_type	Machine type to create.	`string`	`"c2-standard-4"`	no
metadata	Metadata, provided as a map.	`map(string)`	`{}`	no
min_cpu_platform	Specifies a minimum CPU platform. Applicable values are the friendly names of CPU platforms, such as Intel Haswell or Intel Skylake. See the complete list: https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform	`string`	`null`	no
network_ip	DEPRECATED: Use `static_ips` variable to assign an internal static ip address.	`string`	`null`	no
network_self_link	Network to deploy to. Either network_self_link or subnetwork_self_link must be specified.	`string`	`null`	no
network_storage	An array of network attached storage mounts to be configured on all instances.	list(object({ server_ip = string, remote_mount = string, local_mount = string, fs_type = string, mount_options = string, client_install_runner = map(string) mount_runner = map(string) }))	`[]`	no
on_host_maintenance	Instance availability Policy.	`string`	`"MIGRATE"`	no
partition	Cluster partitions as a list.	list(object({ compute_list = list(string) partition = object({ enable_job_exclusive = bool enable_placement_groups = bool network_storage = list(object({ server_ip = string remote_mount = string local_mount = string fs_type = string mount_options = string })) partition_conf = map(string) partition_feature = string partition_name = string partition_nodes = map(object({ access_config = list(object({ network_tier = string })) bandwidth_tier = string node_count_dynamic_max = number node_count_static = number enable_spot_vm = bool group_name = string instance_template = string maintenance_interval = string node_conf = map(string) reservation_name = string spot_instance_config = object({ termination_action = string }) })) partition_startup_scripts_timeout = number subnetwork = string zone_policy_allow = list(string) zone_policy_deny = list(string) zone_target_shape = string }) }))	`[]`	no
preemptible	Allow the instance to be preempted.	`bool`	`false`	no
project_id	Project ID to create resources in.	`string`	n/a	yes
prolog_scripts	List of scripts to be used for Prolog. Programs for the slurmd to execute whenever it is asked to run a job step from a new job allocation. See https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog.	list(object({ filename = string content = string }))	`[]`	no
region	Region where the instances should be created.	`string`	`null`	no
service_account	Service account to attach to the controller instance. If not set, the default compute service account for the given project will be used with the "https://www.googleapis.com/auth/cloud-platform" scope.	object({ email = string scopes = set(string) })	`null`	no
shielded_instance_config	Shielded VM configuration for the instance. Note: not used unless enable_shielded_vm is 'true'. enable_integrity_monitoring : Compare the most recent boot measurements to the integrity policy baseline and return a pair of pass/fail results depending on whether they match or not. enable_secure_boot : Verify the digital signature of all boot components, and halt the boot process if signature verification fails. enable_vtpm : Use a virtualized trusted platform module, which is a specialized computer chip you can use to encrypt objects like keys and certificates.	object({ enable_integrity_monitoring = bool enable_secure_boot = bool enable_vtpm = bool })	{ "enable_integrity_monitoring": true, "enable_secure_boot": true, "enable_vtpm": true }	no
slurm_cluster_name	Cluster name, used for resource naming and slurm accounting. If not provided it will default to the first 8 characters of the deployment name (removing any invalid characters).	`string`	`null`	no
slurm_conf_tpl	Slurm slurm.conf template file path.	`string`	`null`	no
slurmdbd_conf_tpl	Slurm slurmdbd.conf template file path.	`string`	`null`	no
source_image	DEPRECATED: Use `instance_image` instead.	`string`	`null`	no
source_image_family	DEPRECATED: Use `instance_image` instead.	`string`	`null`	no
source_image_project	DEPRECATED: Use `instance_image` instead.	`string`	`null`	no
static_ips	List of static IPs for VM instances.	`list(string)`	`[]`	no
subnetwork_project	The project that subnetwork belongs to.	`string`	`null`	no
subnetwork_self_link	Subnet to deploy to. Either network_self_link or subnetwork_self_link must be specified.	`string`	`null`	no
tags	Network tag list.	`list(string)`	`[]`	no
zone	Zone where the instances should be created. If not specified, instances will be spread across available zones in the region.	`string`	`null`	no

Outputs

Name	Description
cloud_logging_filter	Cloud Logging filter to cluster errors.
controller_instance_id	The server-assigned unique identifier of the controller compute instance.
pubsub_topic	Cluster Pub/Sub topic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schedmd-slurm-gcp-v5-controller

schedmd-slurm-gcp-v5-controller

README.md

Description

Example

Live Cluster Reconfiguration (`enable_reconfigure`)

Custom Images

GPU Support

Placement Max Distance

Hybrid Slurm Clusters

Support

License

Requirements

Providers

Modules

Resources

Inputs

Outputs

Files

schedmd-slurm-gcp-v5-controller

Directory actions

More options

Directory actions

More options

Latest commit

History

schedmd-slurm-gcp-v5-controller

Folders and files

parent directory

README.md

Description

Example

Live Cluster Reconfiguration (enable_reconfigure)

Custom Images

GPU Support

Placement Max Distance

Hybrid Slurm Clusters

Support

License

Requirements

Providers

Modules

Resources

Inputs

Outputs

Live Cluster Reconfiguration (`enable_reconfigure`)