Releases: GoogleCloudPlatform/cluster-toolkit
v1.41.0 Adoption of Slurm 24.05 and Improvements to GKE Support
What's Changed
Key New Features 🎉
New Modules 🧱
- resource-policy module implemented by @sharabiani in #3066
- gke-topology-scheduler module implemented by @sharabiani in #3080
- add GKE support for parallelstore through gke-storage module by @chengcongdu in #3120
Module Improvements 🔨
- Added compatibility check for GPUDirect and GKE version by @sharabiani in #3079
- Support template file for kueue configuration in kubectl-apply module by @sharabiani in #3111
- Implement xpk-gke-a3-megagpu blueprint by @sharabiani in #3108
- Use sackd for the login nodes by @mr0re1 in #3126
- gke-node-pool default name conflict fixed by @sharabiani in #3127
- improve dws_flex ux by @abbas1902 in #3122
- Include deployment name in Spack and Ramble bucket names (like startup-script) by @rohitramu in #3136
Improvements 🛠
- Create and use non-default service accounts in GKE by @annuay-google in #3123
- Added documentation on cloud-ops-agent installation and stackdriver removal by @jrossthomson in #3029
- Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles by @tpdownes in #3129
Deprecations 💤
- Freeze slurm-gcp v5 hybrid blueprints with the latest cluster toolkit version support by @harshthakkar01 in #3117
- Update Slurm-gcp v5 deprecation details by @harshthakkar01 in #3118
- Update badge for slurm-gcp v5 and slurm-gcp v6 by @harshthakkar01 in #3116
Version Updates ⏫
- Update A3-High NeMo to 24.07 and NCCL solution to latest recommended values by @akiki-liang0 in #3130
- Update Slurm-GCP to 6.8.2 by @tpdownes in #3132
Bug fixes 🐞
- Fixed the exact number constraint problem for additional vpcs in gpu_direct checks by @sharabiani in #3078
- Provide explicit project information by @wiktorn in #3060
- Chrome Remote Desktop: increase resilience of apt operations by @tpdownes in #3093
- Add mount parallelstore service to mount parallelstore for every reboot by @harshthakkar01 in #3125
New Contributors
- @akiki-liang0 made their first contribution in #3130
- @ighosh98 made their first contribution in #3124
Full Changelog: v1.40.1...v1.41.0
v1.40.1 Fix issue that affected GKE blueprints due to dynamic provisioning
What's Changed
Other changes
- Revert PR#3046 and add more line breaks for readability by @ankitkinra in #3115
Full Changelog: v1.40.0...v1.40.1
v1.40.0: A3 Mega and A3 High families supported in GKE
What's Changed
Important
All HPC VM images based upon CentOS 7 have been deprecated. This means that
referring to the "hpc-centos-7" family in the "cloud-hpc-image-public"
project will fail. We recommend migrating to the "hpc-rocky-linux-8" family
that is the new default throughout the Toolkit. If CentOS 7 is truly needed,
the final HPC CentOS 7 image can be used by its name: "hpc-centos-7-v20240712".
Key New Features 🎉
- GKE A3 High blueprint and GKE A3 Mega blueprint with automated GPU networking performance enhancements
- Add enable-maintenance-reservation flag in slurm to control reservation for scheduled maintenance by @harshthakkar01 in #2987
- adding documentation for versioned blueprint feature by @RachaelSTamakloe in #3055
- adding unit test for version blueprint caching mechanism by @RachaelSTamakloe in #3052
New Modules 🧱
- implement kubectl-apply module by @sharabiani in #2980
Module Improvements 🔨
- Default to zonal bulkInsert by @mr0re1 in #3005
- Add machine type availability checks by @annuay-google in #3003
- add support for enabling tcpx/o in a3 and a3mega vm, provide script for injecting rxdm sidecar and other required components into user workload by @chengcongdu in #3012
- support ghpc_stage function in kubectl-apply module by @sharabiani in #3036
- Validate Reservations in GKE Blueprints by @arajmane-g in #3024
- Fix multivpc missing region by @wiktorn in #3046
- Add initial_node_count support to gke-node-pool by @sharabiani in #3068
Improvements 🛠
- Update gVNIC driver in A3 Mega solution by @tpdownes in #2957
- Implement udev-based approach to mounting aperture devices by @tpdownes in #2955
- Update Debian 12 image in A3 Mega solution by @tpdownes in #2958
- adding module cache to prevent repeated module downloads during modul… by @RachaelSTamakloe in #3010
- add additional vpc validation for a3/a3mega machine by @chengcongdu in #3049
- Adds option to allow Kueue/Jobset to be installed on a GKE cluster via blueprints by @ankitkinra in #3017
- update readme for gpudirect by @chengcongdu in #3059
Deprecations 💤
- SlurmGCP V6. Remove CentOS7 image support. by @mr0re1 in #3038
- removing deprecated spack setup variables by @RachaelSTamakloe in #3040
- removing deprecated ramble setup variables by @RachaelSTamakloe in #3041
Version Updates ⏫
- Update NeMo 23.11 to 24.07 by @akiki-liang0 in #3090
Bug fixes 🐞
- Retry mounting daos container by @harshthakkar01 in #3045
- add argparse dependency to cloud build by @chengcongdu in #3057
- Allow users to provide a commit hash instead of git tag for Spack and Ramble installations by @rohitramu in #3073
- resolving error when var.initial_node_count is null by @RachaelSTamakloe in #3081
- A3 High blueprint prolog solution updates by @tpdownes in #3088
Other changes
- NeMo readme instructions for preloading gpt2 tokenizer by @koallison in #3075
New Contributors
- @koallison made their first contribution in #3075
- @akiki-liang0 made their first contribution in #3090
Full Changelog: v1.39.0...v1.40.0
v1.39.0: Slurm reservations during maintenance windows, Improved GKE Support, removed CentOS 7 references
What's Changed
Key New Features 🎉
- Add reservation support in slurm sync for scheduled maintenance by @harshthakkar01 in #2880
- Support multivpc with GKE by @sharabiani in #2797
- adding optional fields to redirect use of embedded modules to pull fr… by @RachaelSTamakloe in #2945
Module Improvements 🔨
- Make CloudSQL secret replication configurable by @dgouju in #2828
- GKE Blueprints to support reservations by @arajmane-g in #2891
- Expose maintenance interval as a blueprint setting for node pools in GKE by @annuay-google in #2971
- Support named placements in GKE node pools by @arajmane-g in #2969
- Add machine type availability checks to slurm-gcp-v6-nodeset by @annuay-google in #2962
- Revisit the Reservation Interface for GKE Blueprints by @arajmane-g in #2997
Improvements 🛠
- Add
sort_nodes.py
by @mr0re1 in #2853 - replacing centos7 with rocky8 in vm-instance modules by @RachaelSTamakloe in #2900
- replacing centos7 with rocky8 in nfs-server modules by @RachaelSTamakloe in #2901
- replacing centos7 with rocky8 in packer modules by @RachaelSTamakloe in #2899
- Update batch image to hpc-rocky-linux-8 by @ankitkinra in #2884
- OFE - various updates and fixes by @scott-nag in #2921
- Don't set
automaticRestart: false
by @mr0re1 in #2981
Bug fixes 🐞
- Add
slurmgcp-managed
infix to resource policy name by @mr0re1 in #2892 - Move pytest and other package installation to make by @annuay-google in #2890
- Prevent use of google provider 6.0 where breaking changes are in use by @tpdownes in #2978
- Fix local_ssd_config issue that forces node-pool recreation by @sharabiani in #2968
- kubernetes provider added to gke-cluster module by @sharabiani in #2985
- Fix for cleanup script. The last input is optional by @cdunbar13 in #2993
- Catch "None" fields in slurm job datetime data for BigQuery by @fdmalone in #2992
Other changes
- Use local-ssd for enroot temp space. by @samskillman in #3011
New Contributors
- @scott-nag made their first contribution in #2921
- @abbas1902 made their first contribution in #2956
- @fdmalone made their first contribution in #2992
Full Changelog: v1.38.0...v1.39.0
v1.38.0: Slurm GCP v6 for a3-highgpu-8g and added ability to disable automatic updates
What's Changed
Key New Features 🎉
- Add Slurm-GCP v6 based solution for provisioning a3-highgpu-8g compute nodes by @tpdownes in #2859
- Add
allow_automatic_updates
flag by @rohitramu in #2778 - Update slurm-gcp module to use custom endpoints. by @cdunbar13 in #2653
- Add local ssd RAID0 startup script by @alyssa-sm in #2720
New Modules 🧱
- Move GKE Modules to Core by @chengcongdu in #2758
Module Improvements 🔨
- Move
slurm_files
to the repo. by @mr0re1 in #2739 - Fix cleanup compute for different versions of gcloud by @cdunbar13 in #2794
- change default disk_type for GKE nodepool to null by @chengcongdu in #2818
- Add
instance_properties
var tonodeset
by @mr0re1 in #2843 - Enable local SSD formatting solution to set POSIX permissions by @tpdownes in #2863
- support for min_cpu_platform usage on vm-instance by @RachaelSTamakloe in #2873
Improvements 🛠
- Gke optional accelerator by @ankitkinra in #2736
- add test for gke n2 pool with default driver by @chengcongdu in #2811
- Update local ssd examples to use local ssd startup solution by @alyssa-sm in #2870
- Update a3-megagpu-8 example to use local ssd solution by @alyssa-sm in #2871
Deprecations 💤
Version Updates ⏫
Bug fixes 🐞
- Fix construction of
cloud.conf
by @mr0re1 in #2810 - SlurmGCP. Fix broken
--trace-api
flag. by @mr0re1 in #2817 - SlurmGCP6. Fix nodes stack in
down*
state. by @mr0re1 in #2856 - SlurmGCP. Fix bugs around nodeset zones by @mr0re1 in #2864
- Roll back changes in go.mod to release v1.37.2 by @nick-stroud in #2934
New Contributors
- @chengcongdu made their first contribution in #2758
- @ctk21 made their first contribution in #2761
- @arajmane-g made their first contribution in #2854
Full Changelog: v1.37.2...v1.38.0
v1.37.2 Fix SlurmGCP cleanup of resource policies
v1.37.1: Documentation update
Fix minor typographical errors in documentation
Full Changelog: v1.37.0...v1.37.1
v1.37.0
The HPC Toolkit has been rebranded to Cluster Toolkit. More details will follow shortly. The github repository has been renamed to match. This should not break existing workflows. References to the old name should seamlessly redirect to the updated repo. The binary has been renamed to gcluster
(formally ghpc
) but ghpc
has been symlinked and will continue to work. If any unexpected behavior is noticed as part of this transition, please report it.
What's Changed
Key New Features 🎉
- Rename binary
ghpc
->gcluster
by @mr0re1 in #2813 - Update references to HPC Toolkit to Cluster Toolkit by @alyssa-sm in #2829
Other changes
- Roll version number to v1.37.0 by @nick-stroud in #2839
Full Changelog: v1.36.1...v1.37.0
v1.36.1: Fix Slurm GCP Cloud Parameter Defaults
What's Changed
Bug fixes 🐞
- Hot fix to add defaults to cloud params by @nick-stroud in #2812
Full Changelog: v1.36.0...v1.36.1
v1.36.0 - Parallelstore support
What's Changed
Key New Features 🎉
- Add support for parallelstore in pre-existing-network-storage by @harshthakkar01 in #2701
- Develop and adopt boot-time fix for EOL CentOS 7 repositories by @tpdownes in #2738
New Modules 🧱
- Create 'pre-existing-gke-cluster' module by @sharabiani in #2704
- Add parallelstore module and support for rocky 8, ubuntu 22.04 and debian 12 by @harshthakkar01 in #2695
- Add
schedmd-slurm-gcp-v6-nodeset-dynamic
module by @mr0re1 in #2696
Module Improvements 🔨
- Add 'source' argument for path to prolog or epilog scripts by @andybubu in #2670
- Allow users to turn on access to cluster via GCP public IP address space by @ankitkinra in #2687
- Add known gpu types and their accelerators to gke module by @ankitkinra in #2680
- Add disk_type for HTCondor's EP template by @aneo-ssam in #2705
Improvements 🛠
Bug fixes 🐞
- Revert "Remove installation of enroot and pyxis from a3-highgpu-8g blueprint" by @samskillman in #2722
- Only enable gpu taints if guest_acclerator list is not empty by @ankitkinra in #2727
- Move GCESysPrep to provisioner in Windows scripts by @tpdownes in #2728
- Modify a3-highgpu-8g image-building blueprint network by @tpdownes in #2744
- Update image to new centos image for both login and builder nodes by @ankitkinra in #2780
Other changes
New Contributors
- @sharabiani made their first contribution in #2704
- @aneo-ssam made their first contribution in #2705
Full Changelog: v1.35.1...v.1.36.0