Skip to content

Commit

Permalink
Updated all CSPs to 1.28 and updated GPU operator to 23.9.x
Browse files Browse the repository at this point in the history
  • Loading branch information
MaggieXJZhang committed Jan 10, 2024
1 parent faea332 commit 41324be
Show file tree
Hide file tree
Showing 15 changed files with 69 additions and 85 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Each CSP has its own end of life date for the versions of Kubernetes they suppor

| Version | Release Date | Kubernetes Versions | NVIDIA GPU Operator | NVIDIA Data Center Driver* | End of Life |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.6.0 | January 2024 | EKS - 1.28 <br> GKE - 1.28 <br> AKS - 1.28 | 23.9.1 (Default); 23.9.0 (NV AI E) | 535.129.03 (EKS & GKE Default); 535.129.03 (NV AI E version for GKE & EKS) | EKS - Nov 2024 <br> GKE - Nov 2024 <br> AKS - Nov 2024 |
| 0.5.0 | November 2023 | EKS - 1.27 <br> GKE - 1.27 <br> AKS - 1.27 | 23.6.1 (Default); 23.3.2 (NV AI E) | 535.104.05 (EKS & GKE Default); 525.125.06 (NV AI E version for GKE & EKS) | EKS - July 2024 <br> GKE - August 2024 <br> AKS - July 2024 |
| 0.4.0 | October 2023 | EKS - 1.27 <br> GKE - 1.27 <br> AKS - 1.27 | 23.6.1 (Default); 23.3.2 (NV AI E) | 535.104.05 (EKS & GKE Default); 525.125.06 (NV AI E version for GKE & EKS) | EKS - July 2024 <br> GKE - August 2024 <br> AKS - July 2024 |
| 0.3.0 | September 2023 | EKS - 1.26 <br> GKE - 1.26 <br> AKS - 1.26 | 23.6.1 (Default); 23.3.2 (NV AI E) | 535.54.03 (EKS & GKE Default); 525.125.06 (NV AI E version for GKE & EKS) | EKS - June 2024 <br> GKE - June 2024 <br> AKS - March 2024 |
Expand Down
7 changes: 3 additions & 4 deletions aks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,6 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
- In Cloud Shell, run `az login` and re-run `terraform apply`



## Requirements

| Name | Version |
Expand Down Expand Up @@ -131,12 +130,12 @@ No modules.
| <a name="input_gpu_node_pool_max_count"></a> [gpu\_node\_pool\_max\_count](#input\_gpu\_node\_pool\_max\_count) | Max count of nodes in Default GPU pool | `number` | `5` | no |
| <a name="input_gpu_node_pool_min_count"></a> [gpu\_node\_pool\_min\_count](#input\_gpu\_node\_pool\_min\_count) | Min count of number of nodes in Default GPU pool | `number` | `2` | no |
| <a name="input_gpu_operator_namespace"></a> [gpu\_operator\_namespace](#input\_gpu\_operator\_namespace) | The namespace to deploy the NVIDIA GPU operator into | `string` | `"gpu-operator"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU operator to be installed | `string` | `"v23.6.1"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU operator to be installed | `string` | `"v23.9.1"` | no |
| <a name="input_gpu_os_sku"></a> [gpu\_os\_sku](#input\_gpu\_os\_sku) | Specifies the OS SKU used by the agent pool. Possible values include: Ubuntu, CBLMariner, Mariner, Windows2019, Windows2022 | `string` | `"Ubuntu"` | no |
| <a name="input_kubernetes_version"></a> [kubernetes\_version](#input\_kubernetes\_version) | Version of Kubernetes to turn on. Run 'az aks get-versions --location <location> --output table' to view all available versions | `string` | `"1.27"` | no |
| <a name="input_kubernetes_version"></a> [kubernetes\_version](#input\_kubernetes\_version) | Version of Kubernetes to turn on. Run 'az aks get-versions --location <location> --output table' to view all available versions | `string` | `"1.28"` | no |
| <a name="input_location"></a> [location](#input\_location) | The region to create resources in | `any` | n/a | yes |
| <a name="input_nvaie"></a> [nvaie](#input\_nvaie) | To use the versions of GPU operator and drivers specified as part of NVIDIA AI Enterprise, set this to true. More information at https://www.nvidia.com/en-us/data-center/products/ai-enterprise | `bool` | `false` | no |
| <a name="input_nvaie_gpu_operator_version"></a> [nvaie\_gpu\_operator\_version](#input\_nvaie\_gpu\_operator\_version) | The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true` | `string` | `"v23.3.2"` | no |
| <a name="input_nvaie_gpu_operator_version"></a> [nvaie\_gpu\_operator\_version](#input\_nvaie\_gpu\_operator\_version) | The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true` | `string` | `"v23.9.0"` | no |

## Outputs

Expand Down
8 changes: 4 additions & 4 deletions aks/terraform.tfvars
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# Sample tfvars file. Uncomment out values to use
Expand All @@ -19,9 +19,9 @@
# gpu_node_pool_max_count = 5
# gpu_node_pool_min_count = 2
# gpu_operator_namespace = "gpu-operator"
# gpu_operator_version = "v23.6.1"
# gpu_operator_version = "v23.9.1"
# gpu_os_sku = "Ubuntu"
# kubernetes_version = "1.26.3"
# kubernetes_version = "1.28"
# location = ""
# nvaie = false
# nvaie_gpu_operator_version = "v23.3.2"
# nvaie_gpu_operator_version = "v23.9.0"
8 changes: 4 additions & 4 deletions aks/variables.tf
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

/****************************
Expand All @@ -25,7 +25,7 @@ variable "cluster_name" {
}

variable "kubernetes_version" {
default = "1.27"
default = "1.28"
description = "Version of Kubernetes to turn on. Run 'az aks get-versions --location <location> --output table' to view all available versions "
}

Expand Down Expand Up @@ -87,7 +87,7 @@ variable "gpu_os_sku" {
GPU Operator Variables
****************************/
variable "gpu_operator_version" {
default = "v23.6.1"
default = "v23.9.1"
description = "Version of the GPU operator to be installed"
}

Expand All @@ -105,7 +105,7 @@ variable "nvaie" {

variable "nvaie_gpu_operator_version" {
type = string
default = "v23.3.2"
default = "v23.9.0"
description = "The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true`"
}

Expand Down
20 changes: 0 additions & 20 deletions eks/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 8 additions & 8 deletions eks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="input_aws_profile"></a> [aws\_profile](#input\_aws\_profile) | n/a | `string` | `"development"` | no |
| <a name="input_cidr_block"></a> [cidr\_block](#input\_cidr\_block) | CIDR for VPC | `string` | `"10.0.0.0/16"` | no |
| <a name="input_cluster_name"></a> [cluster\_name](#input\_cluster\_name) | n/a | `string` | n/a | yes |
| <a name="input_cluster_version"></a> [cluster\_version](#input\_cluster\_version) | Version of EKS to install on the control plane (Major and Minor version only, do not include the patch) | `string` | `"1.27"` | no |
| <a name="input_cluster_version"></a> [cluster\_version](#input\_cluster\_version) | Version of EKS to install on the control plane (Major and Minor version only, do not include the patch) | `string` | `"1.28"` | no |
| <a name="input_cpu_instance_type"></a> [cpu\_instance\_type](#input\_cpu\_instance\_type) | CPU EC2 worker node instance type | `string` | `"t2.xlarge"` | no |
| <a name="input_cpu_node_pool_additional_user_data"></a> [cpu\_node\_pool\_additional\_user\_data](#input\_cpu\_node\_pool\_additional\_user\_data) | User data that is appended to the user data script after of the EKS bootstrap script on EKS-managed GPU node pool. | `string` | `""` | no |
| <a name="input_cpu_node_pool_delete_on_termination"></a> [cpu\_node\_pool\_delete\_on\_termination](#input\_cpu\_node\_pool\_delete\_on\_termination) | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | `bool` | `true` | no |
Expand All @@ -136,18 +136,18 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="input_gpu_node_pool_delete_on_termination"></a> [gpu\_node\_pool\_delete\_on\_termination](#input\_gpu\_node\_pool\_delete\_on\_termination) | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | `bool` | `true` | no |
| <a name="input_gpu_node_pool_root_disk_size_gb"></a> [gpu\_node\_pool\_root\_disk\_size\_gb](#input\_gpu\_node\_pool\_root\_disk\_size\_gb) | The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node | `number` | `512` | no |
| <a name="input_gpu_node_pool_root_volume_type"></a> [gpu\_node\_pool\_root\_volume\_type](#input\_gpu\_node\_pool\_root\_volume\_type) | The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs | `string` | `"gp2"` | no |
| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. Not set when `nvaie` is set to true | `string` | `"535.104.05"` | no |
| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. Not set when `nvaie` is set to true | `string` | `"535.129.03"` | no |
| <a name="input_gpu_operator_namespace"></a> [gpu\_operator\_namespace](#input\_gpu\_operator\_namespace) | The namespace for the GPU operator deployment | `string` | `"gpu-operator"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU Operator to deploy. Defaults to latest available. Not set when `nvaie` is set to `true` | `string` | `"v23.6.1"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU Operator to deploy. Defaults to latest available. Not set when `nvaie` is set to `true` | `string` | `"v23.9.1"` | no |
| <a name="input_max_cpu_nodes"></a> [max\_cpu\_nodes](#input\_max\_cpu\_nodes) | Maximum number of CPU nodes in the Autoscaling Group | `string` | `"2"` | no |
| <a name="input_max_gpu_nodes"></a> [max\_gpu\_nodes](#input\_max\_gpu\_nodes) | Maximum number of GPU nodes in the Autoscaling Group | `string` | `"5"` | no |
| <a name="input_min_cpu_nodes"></a> [min\_cpu\_nodes](#input\_min\_cpu\_nodes) | Minimum number of CPU nodes in the Autoscaling Group | `string` | `"0"` | no |
| <a name="input_min_gpu_nodes"></a> [min\_gpu\_nodes](#input\_min\_gpu\_nodes) | Minimum number of GPU nodes in the Autoscaling Group | `string` | `"2"` | no |
| <a name="input_nvaie"></a> [nvaie](#input\_nvaie) | To use the versions of GPU operator and drivers specified as part of NVIDIA AI Enterprise, set this to true. More information at https://www.nvidia.com/en-us/data-center/products/ai-enterprise | `bool` | `false` | no |
| <a name="input_nvaie_gpu_operator_driver_version"></a> [nvaie\_gpu\_operator\_driver\_version](#input\_nvaie\_gpu\_operator\_driver\_version) | The NVIDIA AI Enterprise version of the NVIDIA driver to be installed with the GPU operator. Overrides `gpu_operator_driver_version` when `nvaie` is set to `true` | `string` | `"525.125.06"` | no |
| <a name="input_nvaie_gpu_operator_version"></a> [nvaie\_gpu\_operator\_version](#input\_nvaie\_gpu\_operator\_version) | The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true` | `string` | `"v23.3.2"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.1.0/24",<br> "10.0.2.0/24",<br> "10.0.3.0/24"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.4.0/24",<br> "10.0.5.0/24",<br> "10.0.6.0/24"<br>]</pre> | no |
| <a name="input_nvaie_gpu_operator_driver_version"></a> [nvaie\_gpu\_operator\_driver\_version](#input\_nvaie\_gpu\_operator\_driver\_version) | The NVIDIA AI Enterprise version of the NVIDIA driver to be installed with the GPU operator. Overrides `gpu_operator_driver_version` when `nvaie` is set to `true` | `string` | `"535.129.03"` | no |
| <a name="input_nvaie_gpu_operator_version"></a> [nvaie\_gpu\_operator\_version](#input\_nvaie\_gpu\_operator\_version) | The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true` | `string` | `"v23.9.0"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.0.0/19",<br> "10.0.32.0/19",<br> "10.0.64.0/19"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.96.0/19",<br> "10.0.128.0/19",<br> "10.0.160.0/19"<br>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | AWS region to provision the Holoscan Compliant Kubernetes Cluster | `string` | `"us-west-2"` | no |
| <a name="input_single_nat_gateway"></a> [single\_nat\_gateway](#input\_single\_nat\_gateway) | Should be true if you want to provision a single shared NAT Gateway across all of your private networks | `bool` | `false` | no |
| <a name="input_ssh_key"></a> [ssh\_key](#input\_ssh\_key) | n/a | `string` | `""` | no |
Expand All @@ -166,4 +166,4 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="output_nodes"></a> [nodes](#output\_nodes) | n/a |
| <a name="output_oidc_endpoint"></a> [oidc\_endpoint](#output\_oidc\_endpoint) | n/a |
| <a name="output_private_subnet_ids"></a> [private\_subnet\_ids](#output\_private\_subnet\_ids) | n/a |
| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |
| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |
4 changes: 2 additions & 2 deletions eks/examples/cnpack/aws-pca.tf
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ AWS Private Cert Authority Config

// Create AWS Private Cert Authority
resource "aws_acmpca_certificate_authority" "cnpack-pca" {
count = var.pca_enabled ? 1 : 0
type = "ROOT"
count = var.pca_enabled ? 1 : 0
type = "ROOT"
usage_mode = var.pca_short_lived ? "SHORT_LIVED_CERTIFICATE" : "GENERAL_PURPOSE"
certificate_authority_configuration {
key_algorithm = "RSA_4096"
Expand Down
31 changes: 12 additions & 19 deletions eks/terraform.tfvars
Original file line number Diff line number Diff line change
@@ -1,11 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# Sample tfvars file. Uncomment out values to use
# Do not commit this file to Git with sensitive values


# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# Sample tfvars file. Uncomment out values to use
Expand All @@ -17,7 +10,7 @@
# aws_profile = "development"
# cidr_block = "10.0.0.0/16"
# cluster_name = ""
# cluster_version = "1.26"
# cluster_version = "1.28"
# cpu_instance_type = "t2.xlarge"
# cpu_node_pool_additional_user_data = ""
# cpu_node_pool_delete_on_termination = true
Expand All @@ -35,25 +28,25 @@
# gpu_node_pool_delete_on_termination = true
# gpu_node_pool_root_disk_size_gb = 512
# gpu_node_pool_root_volume_type = "gp2"
# gpu_operator_driver_version = "535.104.05"
# gpu_operator_driver_version = "535.129.03"
# gpu_operator_namespace = "gpu-operator"
# gpu_operator_version = "v23.6.1"
# gpu_operator_version = "v23.9.1"
# max_cpu_nodes = "2"
# max_gpu_nodes = "5"
# min_cpu_nodes = "0"
# min_gpu_nodes = "2"
# nvaie = false
# nvaie_gpu_operator_driver_version = "525.125.06"
# nvaie_gpu_operator_version = "v23.3.2"
# nvaie_gpu_operator_driver_version = "535.129.03"
# nvaie_gpu_operator_version = "v23.9.0"
# private_subnets = [
# "10.0.1.0/24",
# "10.0.2.0/24",
# "10.0.3.0/24"
# "10.0.0.0/19",
# "10.0.32.0/19",
# "10.0.64.0/19"
# ]
# public_subnets = [
# "10.0.4.0/24",
# "10.0.5.0/24",
# "10.0.6.0/24"
# "10.0.96.0/19",
# "10.0.128.0/19",
# "10.0.160.0/19"
# ]
# region = "us-west-2"
# single_nat_gateway = false
Expand Down
12 changes: 6 additions & 6 deletions eks/variables.tf
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

/************************
Expand Down Expand Up @@ -28,20 +28,20 @@ variable "cluster_name" {

variable "cluster_version" {
type = string
default = "1.27"
default = "1.28"
description = "Version of EKS to install on the control plane (Major and Minor version only, do not include the patch)"
}
/************************
GPU Operator Variables
*************************/
variable "gpu_operator_version" {
default = "v23.6.1"
default = "v23.9.1"
description = "Version of the GPU Operator to deploy. Defaults to latest available. Not set when `nvaie` is set to `true`"
}

variable "gpu_operator_driver_version" {
type = string
default = "535.104.05"
default = "535.129.03"
description = "The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. Not set when `nvaie` is set to true"
}

Expand All @@ -59,13 +59,13 @@ variable "nvaie" {

variable "nvaie_gpu_operator_version" {
type = string
default = "v23.3.2"
default = "v23.9.0"
description = "The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true`"
}

variable "nvaie_gpu_operator_driver_version" {
type = string
default = "525.125.06"
default = "535.129.03"
description = "The NVIDIA AI Enterprise version of the NVIDIA driver to be installed with the GPU operator. Overrides `gpu_operator_driver_version` when `nvaie` is set to `true`"
}
/*****************************
Expand Down
Loading

0 comments on commit 41324be

Please sign in to comment.