Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified Script Warning #1809

Merged

Conversation

cdunbar13
Copy link
Contributor

Stage 1 of the script warning. Adds simple warnings when startup scripts start and end, and warning to terminal when users log in, letting them know that there the system uses startup scripts for configuration and how to check if they're still running.

@cdunbar13 cdunbar13 added the release-improvements Added to release notes under the "Improvements" heading. label Oct 3, 2023
@cdunbar13 cdunbar13 added release-module-improvements Added to release notes under the "Module Improvements" heading. and removed release-improvements Added to release notes under the "Improvements" heading. labels Oct 3, 2023
Copy link
Member

@tpdownes tpdownes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, although I think we need some changes primarily in the text away from "it's unsafe" to a more strictly factual "services are (not yet) running". We anticipate replacing the prompt message soon with a 2nd PR, but it's better to assume it will be in the next release or two. So we should get it accurate across our array of operating systems.

@tpdownes tpdownes assigned cdunbar13 and unassigned tpdownes Oct 6, 2023
@cdunbar13 cdunbar13 assigned tpdownes and unassigned cdunbar13 Oct 6, 2023
@tpdownes
Copy link
Member

tpdownes commented Oct 6, 2023

For future record, I am testing with this blueprint for vm-instance behavior and behavior when wrapped with Slurm startup scripts:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

blueprint_name: hpc-slurm

vars:
  project_id:  ## Set GCP Project ID Here ##
  deployment_name: hpc-small
  region: us-central1
  zone: us-central1-a

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md

deployment_groups:
- group: primary
  modules:
  # Source is an embedded resource, denoted by "resources/*" without ./, ../, /
  # as a prefix. To refer to a local resource, prefix with ./, ../ or /
  # Example - ./resources/network/vpc
  - id: network1
    source: modules/network/vpc

  - id: script
    source: modules/scripts/startup-script
    settings:
      runners:
      - type: shell
        destination: install_nvidia_drivers.sh
        content: |
          #!/bin/bash
          sleep 600

  - id: vm
    source: modules/compute/vm-instance
    use:
    - network1
    - script
    settings:
      machine_type: c2-standard-4

  - id: debug_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 4
      machine_type: n2-standard-2

  - id: debug_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - debug_node_group
    settings:
      partition_name: debug
      exclusive: false # allows nodes to stay up after jobs are done
      enable_placement: false # the default is: true
      is_default: true

  - id: compute_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20

  - id: compute_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - compute_node_group
    settings:
      partition_name: compute

  - id: h3_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20
      machine_type: h3-standard-88
      # H3 does not support pd-ssd and pd-standard
      # https://cloud.google.com/compute/docs/compute-optimized-machines#h3_disks
      disk_type: pd-balanced

  - id: h3_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - h3_node_group
    settings:
      partition_name: h3

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    use:
    - network1
    - debug_partition
    - compute_partition
    - h3_partition
    - script
    settings:
      disable_controller_public_ips: false

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
    use:
    - network1
    - slurm_controller
    - script
    settings:
      machine_type: n2-standard-4
      disable_login_public_ips: false

Copy link
Member

@tpdownes tpdownes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash the commits and allow "PR-test-slurm-gcp-v5-startup-scripts" to succeed before merging.

@tpdownes tpdownes assigned cdunbar13 and unassigned tpdownes Oct 6, 2023
@cdunbar13 cdunbar13 merged commit ed632b3 into GoogleCloudPlatform:develop Oct 6, 2023
6 of 30 checks passed
@cdunbar13 cdunbar13 deleted the script-run-warning-stage-1 branch October 6, 2023 17:31
@cboneti cboneti mentioned this pull request Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-module-improvements Added to release notes under the "Module Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants