Merge pull request #2132 from harshthakkar01/update-spack-wrfv3

Update spack wrf example and references to use Slurm V6
GoogleCloudPlatform · Jan 12, 2024 · 360f03a · 360f03a
2 parents 1d11682 + 7a33133
commit 360f03a
Show file tree

Hide file tree

Showing 2 changed files with 57 additions and 60 deletions.
diff --git a/docs/tutorials/wrfv3/spack-wrfv3.md b/docs/tutorials/wrfv3/spack-wrfv3.md
@@ -5,18 +5,18 @@ easy for customers to deploy HPC environments on Google Cloud.
 
 In this tutorial you will use the HPC Toolkit to:
 
-* Deploy a [Slurm](https://github.com/SchedMD/slurm-gcp#readme) HPC cluster on
+* Deploy a [Slurm](https://github.com/GoogleCloudPlatform/slurm-gcp#readme) HPC cluster on
   Google Cloud
 * Use [Spack](https://spack.io/) to install the Weather Research and Forecasting (WRF) Model application and all of
   its dependencies
 * Run a [Weather Research and Forecasting (WRF) Model](https://www.mmm.ucar.edu/weather-research-and-forecasting-model) job on your newly provisioned
   cluster
 * Tear down the cluster
 
-Estimated time to complete:  
-The tutorial takes 2 hr. to complete,  
-of which 1.5 hr is for installing software  
-(without cache).  
+Estimated time to complete:
+The tutorial takes 2 hr. to complete,
+of which 1.5 hr is for installing software
+(without cache).
 
 > **_NOTE:_** With a complete Spack cache, the tutorial takes 30 min.
 
@@ -75,7 +75,7 @@ which should be open in the Cloud Shell Editor (on the left).
 
 This file describes the cluster you will deploy. It defines:
 
-* the existing default network from your project
+* a vpc network
 * a monitoring dashboard with metrics on your cluster
 * a definition of a custom Spack installation
 * a startup script that
@@ -84,7 +84,6 @@ This file describes the cluster you will deploy. It defines:
   * sets up a Spack environment including downloading an example input deck
   * places a submission script on a shared drive
 * a Slurm cluster
-  * a Slurm login node
   * a Slurm controller
   * An auto-scaling Slurm partition
 
@@ -106,24 +105,18 @@ contains the terraform needed to deploy your cluster.
 
 ## Deploy the Cluster
 
-Use the following commands to run terraform and deploy your cluster.
+Use below command to deploy your cluster.
 
 ```bash
-terraform -chdir=spack-wrfv3/primary init
-terraform -chdir=spack-wrfv3/primary apply
+./ghpc deploy spack-wrfv3
 ```
 
-The `terraform apply` command will generate a _plan_ that describes the Google
-Cloud resources that will be deployed.
-
-You can review the plan and then start the deployment by typing
-**`yes [enter]`**.
-
-The deployment will take about 30 seconds. There should be regular status updates
-in the terminal.
+You can also use below command to generate a plan that describes the Google Cloud resources that will be deployed.
 
-If the `apply` is successful, a message similar to the following will be
-displayed:
+```bash
+terraform -chdir=spack-wrfv3/primary init
+terraform -chdir=spack-wrfv3/primary apply
+```
 
 <!-- Note: Bash blocks give "copy to cloud shell" option.  -->
 <!-- "shell" or "text" is used in places where command should not be run in cloud shell. -->
@@ -144,30 +137,30 @@ controller. This command can be used to view progress and check for completion
 of the startup script:
 
 ```bash
-gcloud compute instances get-serial-port-output --port 1 --zone us-central1-c --project <walkthrough-project-id/> slurm-spack-wrfv3-controller | grep google_metadata_script_runner
+gcloud compute instances get-serial-port-output --port 1 --zone us-central1-c --project <walkthrough-project-id/> spackwrfv3-controller | grep google_metadata_script_runner
 ```
 
 When the startup script has finished running you will see the following line as
 the final output from the above command:
-> _`slurm-spack-wrfv3-controller google_metadata_script_runner: Finished running startup scripts.`_
+> _`spackwrfv3-controller google_metadata_script_runner: Finished running startup scripts.`_
 
 Optionally while you wait, you can see your deployed VMs on Google Cloud
 Console. Open the link below in a new window. Look for
-`slurm-spack-wrfv3-controller` and `slurm-spack-wrfv3-login0`. If you don't
+`spackwrfv3-controller`. If you don't
 see your VMs make sure you have the correct project selected (top left).
 
 ```text
 https://console.cloud.google.com/compute?project=<walkthrough-project-id/>
 ```
 
-## Connecting to the login node
+## Connecting to the controller node
 
-Once the startup script has completed, connect to the login node.
+Once the startup script has completed, connect to the controller node.
 
-Use the following command to ssh into the login node from cloud shell:
+Use the following command to ssh into the controller node from cloud shell:
 
 ```bash
-gcloud compute ssh slurm-spack-wrfv3-login0 --zone us-central1-c --project <walkthrough-project-id/>
+gcloud compute ssh spackwrfv3-controller --zone us-central1-c --project <walkthrough-project-id/>
 ```
 
 You may be prompted to set up SSH. If so follow the prompts and if asked for a
@@ -191,15 +184,15 @@ following instructions:
    https://console.cloud.google.com/compute?project=<walkthrough-project-id/>
    ```
 
-1. Click on the `SSH` button associated with the `slurm-spack-wrfv3-login0`
+1. Click on the `SSH` button associated with the `spackwrfv3-controller`
    instance.
 
    This will open a separate pop up window with a terminal into our newly
-   created Slurm login VM.
+   created Slurm controller VM.
 
 ## Run a Job on the Cluster
 
-   **The commands below should be run on the Slurm login node.**
+   **The commands below should be run on the Slurm controller node.**
 
 We will use the submission script (see line 122 of the blueprint) to submit a
 Weather Research and Forecasting (WRF) Model job.
@@ -213,7 +206,7 @@ Weather Research and Forecasting (WRF) Model job.
 2. Submit the job to Slurm to be scheduled:
 
    ```bash
-   sbatch /apps/wrfv3/submit_wrfv3.sh
+   sbatch /opt/apps/wrfv3/submit_wrfv3.sh
    ```
 
 3. Once submitted, you can watch the job progress by repeatedly calling the
@@ -227,7 +220,7 @@ The `sbatch` command trigger Slurm to auto-scale up several nodes to run the job
 
 You can refresh the `Compute Engine` > `VM instances` page and see that
 additional VMs are being/have been created. These will be named something like
-`slurm-spack-wrfv3-compute-0-0`.
+`spackwrfv3-compute-0`.
 
 When running `squeue`, observe the job status start as `CF` (configuring),
 change to `R` (running) once the compute VMs have been created, and finally `CG`
@@ -247,7 +240,7 @@ about 5 minutes to run.
 Several files will have been generated in the `test_run/` folder you created.
 
 The `rsl.out.0000` file has information on the run. You can view this file by
-running the following command on the login node:
+running the following command on the controller node:
 
 ```bash
 cat rsl.out.0000
@@ -268,9 +261,9 @@ https://console.cloud.google.com/monitoring/dashboards?project=<walkthrough-proj
 To avoid incurring ongoing charges we will want to destroy our cluster.
 
 For this we need to return to our cloud shell terminal. Run `exit` in the
-terminal to close the SSH connection to the login node:
+terminal to close the SSH connection to the controller node:
 
-> **_NOTE:_** If you are accessing the login node terminal via a separate pop-up
+> **_NOTE:_** If you are accessing the controller node terminal via a separate pop-up
 > then make sure to call `exit` in the pop-up window.
 
 ```bash
@@ -280,7 +273,7 @@ exit
 Run the following command in the cloud shell terminal to destroy the cluster:
 
 ```bash
-terraform -chdir=spack-wrfv3/primary destroy -auto-approve
+./ghpc destroy spack-wrfv3
 ```
 
 When complete you should see something like:

diff --git a/docs/tutorials/wrfv3/spack-wrfv3.yaml b/docs/tutorials/wrfv3/spack-wrfv3.yaml
@@ -26,7 +26,7 @@ deployment_groups:
 - group: primary
   modules:
   - id: network1
-    source: modules/network/pre-existing-vpc
+    source: modules/network/vpc
 
   - id: hpc_dash
     source: modules/monitoring/dashboard
@@ -35,8 +35,8 @@ deployment_groups:
   - id: spack-setup
     source: community/modules/scripts/spack-setup
     settings:
-      install_dir: /apps/spack
-      spack_ref: v0.19.0
+      install_dir: /opt/apps/spack
+      spack_ref: v0.20.0
 
   - id: spack-execute
     source: community/modules/scripts/spack-execute
@@ -88,7 +88,7 @@ deployment_groups:
         # fi
         # spack buildcache keys --install --trust
 
-        spack config --scope defaults add config:build_stage:/apps/spack/spack-stage
+        spack config --scope defaults add config:build_stage:/opt/apps/spack/spack-stage
         spack config --scope defaults add -f /tmp/projections-config.yaml
         spack config --scope site add -f /tmp/slurm-external-config.yaml
 
@@ -107,58 +107,62 @@ deployment_groups:
     source: modules/scripts/startup-script
     settings:
       runners:
+      - type: shell
+        destination: remove_lustre_client.sh
+        content: |
+          #!/bin/bash
+          rm /etc/yum.repos.d/lustre-client.repo
       - $(spack-execute.spack_runner)
       - type: shell
         destination: wrfv3_setup.sh
         content: |
           #!/bin/bash
-          source /apps/spack/share/spack/setup-env.sh
+          source /opt/apps/spack/share/spack/setup-env.sh
           spack env activate wrfv3
-          chmod -R a+rwX /apps/spack/var/spack/environments/wrfv3
-          mkdir -p /apps/wrfv3
-          chmod a+rwx /apps/wrfv3
-          cd /apps/wrfv3
+          chmod -R a+rwX /opt/apps/spack/var/spack/environments/wrfv3
+          mkdir -p /opt/apps/wrfv3
+          chmod a+rwx /opt/apps/wrfv3
+          cd /opt/apps/wrfv3
           wget --no-verbose https://www2.mmm.ucar.edu/wrf/bench/conus12km_v3911/bench_12km.tar.bz2
           tar xjf bench_12km.tar.bz2
       - type: data
-        destination: /apps/wrfv3/submit_wrfv3.sh
+        destination: /opt/apps/wrfv3/submit_wrfv3.sh
         content: |
           #!/bin/bash
           #SBATCH -N 2
           #SBATCH --ntasks-per-node 30
 
-          source /apps/spack/share/spack/setup-env.sh
+          source /opt/apps/spack/share/spack/setup-env.sh
           spack env activate wrfv3
 
           # Check that wrf.exe exists
           which wrf.exe
           cd $SLURM_SUBMIT_DIR
-          cp /apps/wrfv3/bench_12km/* .
+          cp /opt/apps/wrfv3/bench_12km/* .
           WRF=`spack location -i wrf`
           ln -s $WRF/run/* .
           scontrol show hostnames ${SLURM_JOB_NODELIST} > hostfile
 
           mpirun -n 60 -hostfile hostfile -ppn ${SLURM_NTASKS_PER_NODE} wrf.exe
 
+  - id: compute_nodeset
+    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
+    use: [network1]
+    settings:
+      node_count_dynamic_max: 20
+
   - id: compute_partition
-    source: community/modules/compute/SchedMD-slurm-on-gcp-partition
-    use:
-    - network1
+    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
+    use: [compute_nodeset]
     settings:
       partition_name: compute
-      max_node_count: 20
 
   - id: slurm_controller
-    source: community/modules/scheduler/SchedMD-slurm-on-gcp-controller
+    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
     use:
     - network1
     - compute_partition
     settings:
+      disable_controller_public_ips: false
+      controller_startup_scripts_timeout: 21600
       controller_startup_script: $(controller-setup.startup_script)
-      login_node_count: 1
-
-  - id: slurm_login
-    source: community/modules/scheduler/SchedMD-slurm-on-gcp-login-node
-    use:
-    - network1
-    - slurm_controller