Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release candidate #3403

Merged
merged 164 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from 162 commits
Commits
Show all changes
164 commits
Select commit Hold shift + click to select a range
f9c5974
Standardizing the network naming for the integration tests
cdunbar13 Oct 30, 2024
ee84ffe
Merge branch 'develop' of https://github.com/pawloch00/cluster-toolki…
pawloch00 Nov 4, 2024
02c7962
Update mount parallelstore script to support multiple parallelstore
harshthakkar01 Nov 13, 2024
728c0c3
Update custom TAS scripts to support A3U
ighosh98 Nov 18, 2024
5bcc159
Updating Slurm-GCP to 6.8.6
cdunbar13 Nov 26, 2024
834fe90
Merge pull request #3312 from cdunbar13/update-slurm-gcp-version
cdunbar13 Nov 27, 2024
c1bc523
add kueue non tas integration tests
ighosh98 Nov 21, 2024
d303813
SlurmGCP. Harden topology updates
mr0re1 Nov 27, 2024
1e58ed3
Fix the default disk which was updated as it seemed local ssd did not…
ankitkinra Nov 27, 2024
f25af1c
Merge pull request #3316 from mr0re1/topo_upd
mr0re1 Nov 27, 2024
2741f58
tas tests added
ighosh98 Nov 21, 2024
bfb5d70
support jobset 0.7.1
annuay-google Nov 28, 2024
e255640
support jobset 0.7.1
annuay-google Nov 28, 2024
eb8dc53
Merge pull request #3315 from ighosh98/develop-kueue-tests
ighosh98 Nov 28, 2024
cf0acd2
Add dynamic setup of gpu_limit in gke-job-template module
mohitchaurasia91 Nov 28, 2024
9bb4d7a
fixed typo
mohitchaurasia91 Nov 28, 2024
343e897
fixed typo
mohitchaurasia91 Nov 28, 2024
e5ace41
deprecated comments
mohitchaurasia91 Nov 28, 2024
ca16871
fixed terraform issue
mohitchaurasia91 Nov 28, 2024
9d36504
Merge branch 'develop' of https://github.com/pawloch00/cluster-toolki…
pawloch00 Nov 28, 2024
6bd24f0
add python-pip3
pawloch00 Nov 28, 2024
9e9a03f
install packages globally
pawloch00 Nov 29, 2024
209de06
add support v0.9.1
ighosh98 Nov 29, 2024
a80864f
Merge pull request #3321 from ighosh98/develop-v0.9.1
ighosh98 Nov 29, 2024
6ee4ffa
merge develop
annuay-google Nov 29, 2024
1af5c6d
Merge pull request #3318 from annuay-google/annuay/bump-jobset-version
annuay-google Nov 29, 2024
97d8217
fix issue related to gpu shared config
mohitchaurasia91 Nov 29, 2024
c3463b8
use list instead of map for secondary ranges
annuay-google Nov 29, 2024
8aa5e5c
code refactored
mohitchaurasia91 Nov 29, 2024
d09823d
Bump golang.org/x/sys from 0.26.0 to 0.27.0
dependabot[bot] Dec 1, 2024
0967097
Bump github.com/zclconf/go-cty from 1.15.0 to 1.15.1
dependabot[bot] Dec 1, 2024
f9d4b9f
Bump github.com/hashicorp/hcl/v2 from 2.22.0 to 2.23.0
dependabot[bot] Dec 1, 2024
5910ab1
Add reservations to vm-instance
cdunbar13 Dec 1, 2024
11742fd
set GOOGLE_APPLICATION_CREDENTIALS
pawloch00 Dec 2, 2024
cc48407
remove setting creds path
pawloch00 Dec 2, 2024
b7c50ad
Merge pull request #3317 from ankitkinra/fix-default-ssd
ankitkinra Dec 2, 2024
2924327
Merge pull request #3325 from GoogleCloudPlatform/dependabot/go_modul…
alyssa-sm Dec 2, 2024
ebfd142
Merge pull request #3324 from GoogleCloudPlatform/dependabot/go_modul…
alyssa-sm Dec 2, 2024
dbd6f47
Merge pull request #3323 from GoogleCloudPlatform/dependabot/go_modul…
alyssa-sm Dec 2, 2024
00d1223
add migration example to description
annuay-google Dec 2, 2024
46d2c3d
Make most of the variables optional on additional networks to reduce …
wiktorn Nov 24, 2024
71da978
Refactor Python Integration Test
alyssa-sm Dec 2, 2024
549e252
Merge pull request #3319 from mohitchaurasia91/fix_gpu_limit
mohitchaurasia91 Dec 3, 2024
e0c690b
Merge pull request #3295 from ighosh98/develop-tas-plugin
annuay-google Dec 3, 2024
c04180d
Addressing feedback for PR
cdunbar13 Dec 3, 2024
f032613
Merge pull request #3303 from wiktorn/simplify_additional_networks
mr0re1 Dec 3, 2024
6eb555a
Merge pull request #3331 from alyssa-sm/refactor-python-test
alyssa-sm Dec 3, 2024
8298714
Refactor obtaining "resume_data" and bulk-grouping nodes
mr0re1 Dec 3, 2024
f7161e8
Merge pull request #3332 from mr0re1/resume_data
mr0re1 Dec 4, 2024
df854d8
integrate tas plugin bug fixes
ighosh98 Dec 4, 2024
c14e964
Merge pull request #3339 from ighosh98/tas-plugin
ighosh98 Dec 4, 2024
54d2602
Add change to conditionally perform pip install based on gke_node_poo…
mohitchaurasia91 Dec 4, 2024
f326cb0
Merge pull request #3327 from cdunbar13/vm-instance-reservation
cdunbar13 Dec 4, 2024
d93cac0
Add Python Integration Test Build Files
alyssa-sm Dec 3, 2024
08c7007
Merge pull request #3335 from alyssa-sm/add-python-integration-test-b…
alyssa-sm Dec 4, 2024
e0e6533
Redirect OpenFOAM tutorial to scientific computing examples demostration
wardharold Dec 4, 2024
b61abfe
Merge branch 'GoogleCloudPlatform:develop' into develop
wkharold Dec 4, 2024
bfbcd6f
Revert "integrate tas plugin bug fixes"
ighosh98 Dec 4, 2024
f6d86d0
Merge pull request #3344 from GoogleCloudPlatform/revert-3339-tas-plugin
ankitkinra Dec 4, 2024
cfe7e6f
Update A3 High integration test with new base image
tpdownes Dec 4, 2024
6a650d5
SlurmGCP. Resume. Don't group nodes by non-exclusive jobs
mr0re1 Dec 4, 2024
6888f3f
Address feedback during review of #3256
tpdownes Dec 3, 2024
c38e59a
Add python test build files to daily-tests
alyssa-sm Dec 5, 2024
97e662d
Merge pull request #3346 from alyssa-sm/add-paramiko-to-docker
alyssa-sm Dec 5, 2024
5f1092e
Merge pull request #3340 from mr0re1/job_array
mr0re1 Dec 5, 2024
994c6e9
SlurmGCP. Use generated placement policies with dense reservations.
mr0re1 Dec 5, 2024
6684796
Add NodeAction classes to subsitute NodeStatus Enum
abbas1902 Nov 27, 2024
8f8eb63
Merge pull request #3338 from abbas1902/cut_action
abbas1902 Dec 5, 2024
41fa782
Merge pull request #3345 from tpdownes/update_a3high_test
tpdownes Dec 5, 2024
2cfd2f9
Merge pull request #3347 from mr0re1/reservation_pp
mr0re1 Dec 5, 2024
c416381
Merge pull request #3256 from harshthakkar01/ps-fix-2
tpdownes Dec 5, 2024
6f3a643
Merge pull request #3341 from mohitchaurasia91/gke-pip3-fix
mohitchaurasia91 Dec 5, 2024
ca73935
Merge pull request #3350 from GoogleCloudPlatform/main
ighosh98 Dec 5, 2024
31365dc
Rename openfoam.md to README.md
wkharold Dec 5, 2024
e6abf74
Merge pull request #3320 from pawloch00/ppawl-add-pip3
rohitramu Dec 5, 2024
0e7ba6b
update terraform to 6.12.0
ighosh98 Dec 6, 2024
8d3c524
Enable PS CSI through google-container-cluster module
mohitchaurasia91 Dec 6, 2024
767a2c7
Sync Toolkit with local development during review of #3256
tpdownes Dec 6, 2024
b823c12
Ensure parallelstore mounts are restarted after DAOS agent restarts
tpdownes Dec 6, 2024
bcb2004
Avoid spurious warnings upon reboot by creating apt source.list file …
tpdownes Dec 6, 2024
94b5392
Fail early in Parallelstore client installation
tpdownes Dec 6, 2024
2e21505
make upgrade settings configurable
ighosh98 Dec 6, 2024
283ced3
Merge pull request #3356 from ighosh98/update-terraform-provider
ighosh98 Dec 6, 2024
04de06e
Use relative path for remaining use of startup-script module
tpdownes Dec 6, 2024
6c5233c
Merge pull request #3360 from tpdownes/fix_startup_usage
tpdownes Dec 6, 2024
f84c4f7
SlurmGCP. Improve non-exclusive placement alloaction
mr0re1 Dec 6, 2024
b18d453
Add future reservation support
abbas1902 Oct 31, 2024
8b4d994
Merge pull request #3227 from abbas1902/to_be_fulfilled
abbas1902 Dec 7, 2024
b7a3602
Merge pull request #2 from mohitchaurasia91/develop
mohitchaurasia91 Dec 7, 2024
c519141
Merge pull request #3354 from mr0re1/chunk_pp
mr0re1 Dec 7, 2024
182641e
Add TTL to SSH key
mr0re1 Dec 7, 2024
3548189
Merge pull request #3363 from mr0re1/ssh_key_expiration
mr0re1 Dec 7, 2024
6ef1838
Merge pull request #3357 from mohitchaurasia91/enable-gcsfuse-ps-config
mohitchaurasia91 Dec 9, 2024
f969c00
refactor to use preconditions and throw specific errors
ighosh98 Dec 6, 2024
4ef9dac
Merge branch 'develop' of https://github.com/pawloch00/cluster-toolki…
pawloch00 Dec 9, 2024
5981b96
set enable_private_endpoints to false
pawloch00 Dec 9, 2024
780ad26
Update terraform provider to 6.13.0
alyssa-sm Dec 9, 2024
a897961
Merge pull request #3367 from alyssa-sm/update-terraform-provider
alyssa-sm Dec 9, 2024
f774aa9
Fix node_is_fr to handle empty string instead
abbas1902 Dec 9, 2024
ef29fd7
Merge pull request #3369 from abbas1902/fr_fix
abbas1902 Dec 9, 2024
75a60c3
manage edge cases for upgrade settings gracefully
ighosh98 Dec 9, 2024
6714b52
Update image-builder.yaml link in README
ighosh98 Dec 10, 2024
c7a0347
Merge pull request #3373 from ighosh98/develop
ighosh98 Dec 10, 2024
e1b48a6
Merge pull request #3359 from ighosh98/upgrade-settings
ighosh98 Dec 10, 2024
fc9b339
Merge pull request #3188 from cdunbar13/integration-test-network-stan…
cdunbar13 Dec 10, 2024
f53c66c
Promote the new nic-types in vm-instance
cdunbar13 Nov 19, 2024
2360811
Merge pull request #3288 from cdunbar13/promote-vm-nic-types
cdunbar13 Dec 10, 2024
1adc1bd
Merge pull request #3348 from tpdownes/address_3256
tpdownes Dec 10, 2024
8d1ae34
Migrate `instance_template` modules from `slurm-gcp` repo
mr0re1 Dec 10, 2024
ea0704d
Update to ensure that gpu-test is compatible with H200 and H100s for …
cdunbar13 Dec 10, 2024
d3f3ceb
Merge pull request #3380 from mr0re1/move_tmpl2
mr0re1 Dec 10, 2024
27ac4ab
skip setting disk_labels if disk_type is local-ssd
abbas1902 Dec 10, 2024
5f4bb52
Merge pull request #3383 from abbas1902/labels_ssd
abbas1902 Dec 11, 2024
cff3d11
Add Future Reservation guide
abbas1902 Dec 9, 2024
03c653b
Merge pull request #3365 from abbas1902/how_to
abbas1902 Dec 11, 2024
7fdd6d3
Update community/modules/scheduler/schedmd-slurm-gcp-v6-controller/mo…
cdunbar13 Dec 11, 2024
063530c
Merge pull request #3381 from cdunbar13/a3u-update-gpu-test
cdunbar13 Dec 11, 2024
e0ff3b9
make changes backward compatible and port to older blueprints
annuay-google Dec 11, 2024
bf08e16
add validation
annuay-google Dec 11, 2024
5ba7bf7
use secondary ranges list where available, if not fall back to second…
annuay-google Dec 11, 2024
b11a70c
testing complete
annuay-google Dec 11, 2024
e9ec97c
fix failing tests
annuay-google Dec 11, 2024
dabb494
gke v1.31 added to acceptable list for a3-mega
sharabiani Dec 11, 2024
09f4b0c
Merge pull request #3364 from pawloch00/ppawl-fix-a3-xpk
pawloch00 Dec 11, 2024
f09b6f0
update deprecation warning
annuay-google Dec 11, 2024
146a9f8
An option added to disable/enable workload script execution for A3-Hi…
sharabiani Dec 11, 2024
14d473b
Merge pull request #3322 from annuay-google/annuay/prefix-based-resou…
annuay-google Dec 11, 2024
354f10a
Formatting markdown
nick-stroud Dec 11, 2024
4351862
Adding network_profile to the VPC modules
cdunbar13 Dec 11, 2024
a4b03cf
Merge pull request #3342 from wkharold/develop
nick-stroud Dec 11, 2024
ebda34f
Merge pull request #3388 from sharabiani/gke-ver-check
sharabiani Dec 11, 2024
125b5ec
Merge pull request #3389 from sharabiani/optional-workload-script
sharabiani Dec 11, 2024
6a57e7f
Update python integration tests to throw errors
alyssa-sm Dec 7, 2024
796a00b
Merge pull request #3361 from alyssa-sm/update-python-tests-to-throw
alyssa-sm Dec 11, 2024
e7dae4e
Merge pull request #3387 from cdunbar13/vpc-network-profile
cdunbar13 Dec 12, 2024
8202f72
TopologyAwareScheduling enabled by default
sharabiani Dec 12, 2024
b6bb0c5
Merge pull request #3394 from GoogleCloudPlatform/main
harshthakkar01 Dec 12, 2024
53f9646
tolerations added back
sharabiani Dec 12, 2024
7f7dc6b
Migrate `_slurm_instance` and `slurm_tpu_nodeset` modules from `slurm…
abbas1902 Dec 11, 2024
37912df
Update to ensure that gpu-test is compatible with H200 and H100s for …
cdunbar13 Dec 10, 2024
755acfa
Update community/modules/scheduler/schedmd-slurm-gcp-v6-controller/mo…
cdunbar13 Dec 11, 2024
bca3956
Initial commit of gpu-vpc module
cdunbar13 Dec 11, 2024
7a4d256
PR review updates
cdunbar13 Dec 12, 2024
64f812e
Merge pull request #3396 from sharabiani/support-kueue-v0-9-1
sharabiani Dec 12, 2024
76d2e38
Merge pull request #3391 from cdunbar13/gpu-vpc-promo
cdunbar13 Dec 12, 2024
0283c4d
Merge pull request #3390 from abbas1902/the_modules_are_coming
abbas1902 Dec 12, 2024
ed822a5
Increase version to 1.44.0
harshthakkar01 Dec 12, 2024
ed25c4d
Merge pull request #3399 from GoogleCloudPlatform/version/v1.44.0
harshthakkar01 Dec 12, 2024
4dda16b
Add terraform setup to github workflow config
abbas1902 Dec 19, 2024
5ada631
Update README for parallelstore related example blueprint
mohitchaurasia91 Dec 16, 2024
e74ca30
Update README with GKE parallelstore related example blueprint details
mohitchaurasia91 Dec 16, 2024
2e497b1
Update README with GKE parallelstore related example blueprint details
mohitchaurasia91 Dec 16, 2024
d7723f4
Updated blueprint name from gke-storage-parallelstore to gke-storage-…
mohitchaurasia91 Dec 17, 2024
2bc964e
Update ops to operation
mohitchaurasia91 Dec 18, 2024
ee06379
Fix gke parallelstore blueprint name going beyond network char limit
mohitchaurasia91 Dec 19, 2024
f613347
Updated ansible playbook test file name
mohitchaurasia91 Dec 19, 2024
0b570bc
Merge pull request #3436 from ighosh98/release-branch
nick-stroud Dec 19, 2024
77be2dd
Merge branch 'release-candidate' into release-branch
mohitchaurasia91 Dec 19, 2024
9e39b30
add reservations for kueue integration tests
ighosh98 Dec 18, 2024
8bb384f
Merge pull request #3431 from ighosh98/release-candidate
nick-stroud Dec 19, 2024
dbc0740
Merge branch 'release-candidate' into release-branch
nick-stroud Dec 19, 2024
4d362f6
Merge pull request #3437 from mohitchaurasia91/release-branch
nick-stroud Dec 19, 2024
fd6ca50
Fix typo if `scheduiling.provisioningModel` field
mr0re1 Dec 19, 2024
9ec27bb
Merge pull request #3443 from mr0re1/rc_resbound
nick-stroud Dec 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/pr-precommit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ jobs:
with:
go-version: '1.22'
check-latest: true
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.5.7"
terraform_wrapper: false
- run: make install-dev-deps
- uses: terraform-linters/setup-tflint@v4
with:
Expand Down
3 changes: 2 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,9 @@ repos:
hooks:
- id: script-must-have-extension
- id: shellcheck
exclude: ".*unlinted"
- id: shfmt
exclude: ".*tpl"
exclude: ".*tpl|.*unlinted"
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
Expand Down
2 changes: 1 addition & 1 deletion cmd/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ HPC deployments on the Google Cloud Platform.`,
logging.Fatal("cmd.Help function failed: %s", err)
}
},
Version: "v1.43.0",
Version: "v1.44.0",
Annotations: annotation,
}
)
Expand Down
6 changes: 4 additions & 2 deletions community/examples/xpk-gke-a3-megagpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ deployment_groups:
settings:
subnetwork_name: xpk-gke-a3-megagpu-subnet
mtu: 8244
secondary_ranges:
xpk-gke-a3-megagpu-subnet:
secondary_ranges_list:
- subnetwork_name: xpk-gke-a3-megagpu-subnet
ranges:
- range_name: pods
ip_cidr_range: 10.4.0.0/14
- range_name: services
Expand All @@ -54,6 +55,7 @@ deployment_groups:
source: modules/scheduler/gke-cluster
use: [network1, gpunets]
settings:
enable_private_endpoint: false
master_authorized_networks:
- cidr_block: $(vars.authorized_cidr) # Allows your machine run kubectl command. It's required for the multi-network setup.
display_name: "kubectl-access-network"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ spec:
hostNetwork: true
containers:
- name: label-nodes-daemon
image: python:3.9
image: python:3.10
command:
- bash
- -c
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ spec:
effect: NoSchedule
containers:
- name: topology-scheduler-container
image: python:3.9
image: python:3.10
command: ["/bin/sh", "-c", "pip install google-auth google-api-python-client kubernetes; python /scripts/schedule-daemon.py --ignored-namespace kube-system gmp-public gmp-system"]
volumeMounts:
- name: scripts-volume
Expand Down
Loading
Loading