-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add local ssd RAID0 startup script #2720
Add local ssd RAID0 startup script #2720
Conversation
f7a69d6
to
2d09539
Compare
2d09539
to
34d81a7
Compare
34d81a7
to
d7e7ed6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to review the implementation further for refining. I have tested the following blueprint:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
blueprint_name: startup-vm-instance
vars:
project_id: ## Set project id here
deployment_name: testfix
region: asia-southeast1
zone: asia-southeast1-b
bandwidth_tier: gvnic_enabled
deployment_groups:
- group: first
modules:
- id: network1
source: modules/network/vpc
- id: script
source: modules/scripts/startup-script
settings:
setup_raid: true
- id: vm1
source: modules/compute/vm-instance
use:
- network1
- script
settings:
machine_type: g2-standard-24
local_ssd_count: 2
Which is running CentOS 7. I need to test other OSes that are not EOL, and I understand there may be an issue with Debian behavior upon reboot. I have successfully tested this blueprint with:
- rebooting preserved data and is remounted ("WAI")
- power off loses data but new disks are formatted and mounted ("WAI")
- power off with preview flag to preserve data preserved data and is remounted ("WAI")
d7e7ed6
to
3d6e0ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some research into the behavior of mdadm
on modern Linux systems. They will be auto-assembled with a predictable name at /dev/md/raid_name
if you supply a name during creation. Additionally, the hostname of the machine must match the hostname it was created on. In the real world, I believe there are boot-time race conditions which prevent hostname matching from being 100% reliable even on the same host. We can work around this by supplying --homehost=any
during RAID creation.
If you follow these "rules" then you can rely on the system to assemble the RAID array at "/dev/md/raid_name" and ignore the numeric device names. If that file does not exist, then it means that the RAID array must be created. This occurs when the VM is booted for the first time or powered off and powered on without disabling local SSD discard.
This won't handle a Slurm-GCP power off/on cycle but it's a good first step.
fa3467d
to
31bf852
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 very small changes that I'd like to see. I'm going to begin testing with those changes in place.
7b122c0
to
a7148b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I want @nick-stroud to weigh in on input variable format and then done.
a7148b0
to
da5fa89
Compare
da5fa89
to
cd85816
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We missed a change that causes the accidental inclusion of the runner in all invocations of startup-script
Co-authored-by: Tom Downes <tpdownes@users.noreply.github.com>
caf9f82
into
GoogleCloudPlatform:develop
This PR adds a startup script that will check for existing local ssds and raid if there are 2 or more.
This script has been tested manually with slurm-gcp v6 and vm-instance with centos, debian, rocky, and ubuntu.
If there are at least 2 local ssds attached, a raid device for them will be created. Using
lsblk -o NAME,FSTYPE,LABEL,UUID,MOUNTPOINT
shows the following.When the number of local ssds present is 0 or 1, the script will exit early.
Manual Testing for rebooting.
lsblk -o NAME,FSTYPE,LABEL,UUID,MOUNTPOINT
and saved the output.sudo reboot
to reboot the instancelsblk -o NAME,FSTYPE,LABEL,UUID,MOUNTPOINT
and compared the UUIDs of the raid0 local ssds before and after reboot.A reboot is successful if it can successful recreate the raid device with the local ssd that were up prior to reboot. Rebooting was successful for all the instances.
Manual Testing for restarting.
lsblk -o NAME,FSTYPE,LABEL,UUID,MOUNTPOINT
.lsblk -o NAME,FSTYPE,LABEL,UUID,MOUNTPOINT
.A restart is successful if it can setup the raid device with the local ssds. Restarting was successful for all the instances except for the Slurm-GCP instances. The current problem is that the Slurm setup verifies that Slurm is already installed and exits early which leads to the startup script not getting executed.
Cases covered by this PR:
Cases that require more thought: