Skip to content

Commit

Permalink
Updated EKS to 1.28 and updated GPU operator to 23.9.1
Browse files Browse the repository at this point in the history
  • Loading branch information
MaggieXJZhang committed Jan 9, 2024
1 parent 14dc04b commit 66dbe72
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 36 deletions.
20 changes: 0 additions & 20 deletions eks/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 6 additions & 6 deletions eks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="input_aws_profile"></a> [aws\_profile](#input\_aws\_profile) | n/a | `string` | `"development"` | no |
| <a name="input_cidr_block"></a> [cidr\_block](#input\_cidr\_block) | CIDR for VPC | `string` | `"10.0.0.0/16"` | no |
| <a name="input_cluster_name"></a> [cluster\_name](#input\_cluster\_name) | n/a | `string` | n/a | yes |
| <a name="input_cluster_version"></a> [cluster\_version](#input\_cluster\_version) | Version of EKS to install on the control plane (Major and Minor version only, do not include the patch) | `string` | `"1.27"` | no |
| <a name="input_cluster_version"></a> [cluster\_version](#input\_cluster\_version) | Version of EKS to install on the control plane (Major and Minor version only, do not include the patch) | `string` | `"1.28"` | no |
| <a name="input_cpu_instance_type"></a> [cpu\_instance\_type](#input\_cpu\_instance\_type) | CPU EC2 worker node instance type | `string` | `"t2.xlarge"` | no |
| <a name="input_cpu_node_pool_additional_user_data"></a> [cpu\_node\_pool\_additional\_user\_data](#input\_cpu\_node\_pool\_additional\_user\_data) | User data that is appended to the user data script after of the EKS bootstrap script on EKS-managed GPU node pool. | `string` | `""` | no |
| <a name="input_cpu_node_pool_delete_on_termination"></a> [cpu\_node\_pool\_delete\_on\_termination](#input\_cpu\_node\_pool\_delete\_on\_termination) | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | `bool` | `true` | no |
Expand All @@ -136,18 +136,18 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="input_gpu_node_pool_delete_on_termination"></a> [gpu\_node\_pool\_delete\_on\_termination](#input\_gpu\_node\_pool\_delete\_on\_termination) | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | `bool` | `true` | no |
| <a name="input_gpu_node_pool_root_disk_size_gb"></a> [gpu\_node\_pool\_root\_disk\_size\_gb](#input\_gpu\_node\_pool\_root\_disk\_size\_gb) | The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node | `number` | `512` | no |
| <a name="input_gpu_node_pool_root_volume_type"></a> [gpu\_node\_pool\_root\_volume\_type](#input\_gpu\_node\_pool\_root\_volume\_type) | The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs | `string` | `"gp2"` | no |
| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. Not set when `nvaie` is set to true | `string` | `"535.104.05"` | no |
| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. Not set when `nvaie` is set to true | `string` | `"535.129.03"` | no |
| <a name="input_gpu_operator_namespace"></a> [gpu\_operator\_namespace](#input\_gpu\_operator\_namespace) | The namespace for the GPU operator deployment | `string` | `"gpu-operator"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU Operator to deploy. Defaults to latest available. Not set when `nvaie` is set to `true` | `string` | `"v23.6.1"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU Operator to deploy. Defaults to latest available. Not set when `nvaie` is set to `true` | `string` | `"v23.9.1"` | no |
| <a name="input_max_cpu_nodes"></a> [max\_cpu\_nodes](#input\_max\_cpu\_nodes) | Maximum number of CPU nodes in the Autoscaling Group | `string` | `"2"` | no |
| <a name="input_max_gpu_nodes"></a> [max\_gpu\_nodes](#input\_max\_gpu\_nodes) | Maximum number of GPU nodes in the Autoscaling Group | `string` | `"5"` | no |
| <a name="input_min_cpu_nodes"></a> [min\_cpu\_nodes](#input\_min\_cpu\_nodes) | Minimum number of CPU nodes in the Autoscaling Group | `string` | `"0"` | no |
| <a name="input_min_gpu_nodes"></a> [min\_gpu\_nodes](#input\_min\_gpu\_nodes) | Minimum number of GPU nodes in the Autoscaling Group | `string` | `"2"` | no |
| <a name="input_nvaie"></a> [nvaie](#input\_nvaie) | To use the versions of GPU operator and drivers specified as part of NVIDIA AI Enterprise, set this to true. More information at https://www.nvidia.com/en-us/data-center/products/ai-enterprise | `bool` | `false` | no |
| <a name="input_nvaie_gpu_operator_driver_version"></a> [nvaie\_gpu\_operator\_driver\_version](#input\_nvaie\_gpu\_operator\_driver\_version) | The NVIDIA AI Enterprise version of the NVIDIA driver to be installed with the GPU operator. Overrides `gpu_operator_driver_version` when `nvaie` is set to `true` | `string` | `"525.125.06"` | no |
| <a name="input_nvaie_gpu_operator_version"></a> [nvaie\_gpu\_operator\_version](#input\_nvaie\_gpu\_operator\_version) | The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true` | `string` | `"v23.3.2"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.1.0/24",<br> "10.0.2.0/24",<br> "10.0.3.0/24"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.4.0/24",<br> "10.0.5.0/24",<br> "10.0.6.0/24"<br>]</pre> | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.0.0/19",<br> "10.0.32.0/19",<br> "10.0.64.0/19"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | List of subnet ranges for the Holoscan VPC | `list(any)` | <pre>[<br> "10.0.96.0/19",<br> "10.0.128.0/19",<br> "10.0.160.0/19"<br>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | AWS region to provision the Holoscan Compliant Kubernetes Cluster | `string` | `"us-west-2"` | no |
| <a name="input_single_nat_gateway"></a> [single\_nat\_gateway](#input\_single\_nat\_gateway) | Should be true if you want to provision a single shared NAT Gateway across all of your private networks | `bool` | `false` | no |
| <a name="input_ssh_key"></a> [ssh\_key](#input\_ssh\_key) | n/a | `string` | `""` | no |
Expand All @@ -166,4 +166,4 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="output_nodes"></a> [nodes](#output\_nodes) | n/a |
| <a name="output_oidc_endpoint"></a> [oidc\_endpoint](#output\_oidc\_endpoint) | n/a |
| <a name="output_private_subnet_ids"></a> [private\_subnet\_ids](#output\_private\_subnet\_ids) | n/a |
| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |
| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |
10 changes: 5 additions & 5 deletions eks/terraform.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# aws_profile = "development"
# cidr_block = "10.0.0.0/16"
# cluster_name = ""
# cluster_version = "1.26"
# cluster_version = "1.28"
# cpu_instance_type = "t2.xlarge"
# cpu_node_pool_additional_user_data = ""
# cpu_node_pool_delete_on_termination = true
Expand All @@ -35,16 +35,16 @@
# gpu_node_pool_delete_on_termination = true
# gpu_node_pool_root_disk_size_gb = 512
# gpu_node_pool_root_volume_type = "gp2"
# gpu_operator_driver_version = "535.104.05"
# gpu_operator_driver_version = "535.129.03"
# gpu_operator_namespace = "gpu-operator"
# gpu_operator_version = "v23.6.1"
# gpu_operator_version = "v23.9.1"
# max_cpu_nodes = "2"
# max_gpu_nodes = "5"
# min_cpu_nodes = "0"
# min_gpu_nodes = "2"
# nvaie = false
# nvaie_gpu_operator_driver_version = "525.125.06"
# nvaie_gpu_operator_version = "v23.3.2"
# nvaie_gpu_operator_driver_version = "535.129.03"
# nvaie_gpu_operator_version = "v23.9.0"
# private_subnets = [
# "10.0.1.0/24",
# "10.0.2.0/24",
Expand Down
10 changes: 5 additions & 5 deletions eks/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -28,20 +28,20 @@ variable "cluster_name" {

variable "cluster_version" {
type = string
default = "1.27"
default = "1.28"
description = "Version of EKS to install on the control plane (Major and Minor version only, do not include the patch)"
}
/************************
GPU Operator Variables
*************************/
variable "gpu_operator_version" {
default = "v23.6.1"
default = "v23.9.1"
description = "Version of the GPU Operator to deploy. Defaults to latest available. Not set when `nvaie` is set to `true`"
}

variable "gpu_operator_driver_version" {
type = string
default = "535.104.05"
default = "535.129.03"
description = "The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. Not set when `nvaie` is set to true"
}

Expand All @@ -59,13 +59,13 @@ variable "nvaie" {

variable "nvaie_gpu_operator_version" {
type = string
default = "v23.3.2"
default = "v23.9.0"
description = "The NVIDIA Driver version of GPU Operator. Overrides `gpu_operator_version` when `nvaie` is set to `true`"
}

variable "nvaie_gpu_operator_driver_version" {
type = string
default = "525.125.06"
default = "535.129.03"
description = "The NVIDIA AI Enterprise version of the NVIDIA driver to be installed with the GPU operator. Overrides `gpu_operator_driver_version` when `nvaie` is set to `true`"
}
/*****************************
Expand Down

0 comments on commit 66dbe72

Please sign in to comment.