Merge pull request #2 from angudadevops/main

Feat: Expose GPU Operator Driver Version on EKS, GKE
NVIDIA · Jul 19, 2023 · 0712333 · 0712333
2 parents 83299dd + ce16689
commit 0712333
Show file tree

Hide file tree

Showing 6 changed files with 29 additions and 3 deletions.
diff --git a/eks/README.md b/eks/README.md
@@ -74,7 +74,6 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
 7. Connect to the cluster with `kubectl` by running `aws eks update-kubeconfig --name tf-holoscan-cluster --region us-west-2` after the cluster is created
 8. Run `terraform destroy` to delete cloud infrastructure provisioned by Terraform
 
-
 ## Requirements
 
 | Name | Version |
@@ -139,6 +138,7 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
 | <a name="input_gpu_node_pool_delete_on_termination"></a> [gpu\_node\_pool\_delete\_on\_termination](#input\_gpu\_node\_pool\_delete\_on\_termination) | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | `bool` | `true` | no |
 | <a name="input_gpu_node_pool_root_disk_size_gb"></a> [gpu\_node\_pool\_root\_disk\_size\_gb](#input\_gpu\_node\_pool\_root\_disk\_size\_gb) | The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node | `number` | `512` | no |
 | <a name="input_gpu_node_pool_root_volume_type"></a> [gpu\_node\_pool\_root\_volume\_type](#input\_gpu\_node\_pool\_root\_volume\_type) | The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs | `string` | `"gp2"` | no |
+| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version of GPU Operator | `string` | `"535.54.03"` | no |
 | <a name="input_gpu_operator_namespace"></a> [gpu\_operator\_namespace](#input\_gpu\_operator\_namespace) | The namespace for the GPU operator deployment | `string` | `"gpu-operator"` | no |
 | <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | The version of the GPU operator | `string` | `"v23.3.2"` | no |
 | <a name="input_max_cpu_nodes"></a> [max\_cpu\_nodes](#input\_max\_cpu\_nodes) | Maximum number of CPU nodes in the Autoscaling Group | `string` | `"2"` | no |
@@ -166,4 +166,5 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
 | <a name="output_nodes"></a> [nodes](#output\_nodes) | n/a |
 | <a name="output_oidc_endpoint"></a> [oidc\_endpoint](#output\_oidc\_endpoint) | n/a |
 | <a name="output_private_subnet_ids"></a> [private\_subnet\_ids](#output\_private\_subnet\_ids) | n/a |
-| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |
+| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |
+
diff --git a/eks/main.tf b/eks/main.tf
@@ -194,5 +194,11 @@ resource "helm_release" "gpu_operator" {
   cleanup_on_fail  = true
   reset_values     = true
   replace          = true
+
+  set {
+	name   = "driver.version"
+        value  = var.gpu_operator_driver_version
+  }
+
 }
 
diff --git a/eks/variables.tf b/eks/variables.tf
@@ -40,6 +40,12 @@ variable "gpu_operator_version" {
   description = "The version of the GPU operator"
 }
 
+variable "gpu_operator_driver_version" {
+  type        = string
+  default     = "535.54.03"
+  description = "The NVIDIA Driver version of GPU Operator"
+}
+
 variable "gpu_operator_namespace" {
   type        = string
   default     = "gpu-operator"

diff --git a/gke/README.md b/gke/README.md
@@ -128,6 +128,7 @@ No modules.
 | <a name="input_gpu_instance_type"></a> [gpu\_instance\_type](#input\_gpu\_instance\_type) | Machine Type for GPU node pool | `string` | `"n1-standard-4"` | no |
 | <a name="input_gpu_max_node_count"></a> [gpu\_max\_node\_count](#input\_gpu\_max\_node\_count) | Max Number of GPU nodes in GPU nodepool | `string` | `"5"` | no |
 | <a name="input_gpu_min_node_count"></a> [gpu\_min\_node\_count](#input\_gpu\_min\_node\_count) | Min number of GPU nodes in GPU nodepool | `string` | `"2"` | no |
+| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version of GPU Operator | `string` | `"535.54.03"` | no |
 | <a name="input_gpu_operator_namespace"></a> [gpu\_operator\_namespace](#input\_gpu\_operator\_namespace) | The namespace to deploy the NVIDIA GPU operator intov | `string` | `"gpu-operator"` | no |
 | <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU operator to be installed | `string` | `"v23.3.2"` | no |
 | <a name="input_gpu_type"></a> [gpu\_type](#input\_gpu\_type) | GPU SKU To attach to Holoscan GPU Node (eg. nvidia-tesla-k80) | `string` | `"nvidia-tesla-v100"` | no |
@@ -153,4 +154,4 @@ No modules.
 | <a name="output_stable_channel_latest_gke_version"></a> [stable\_channel\_latest\_gke\_version](#output\_stable\_channel\_latest\_gke\_version) | The latest available version of GKE when using the STABLE channel |
 | <a name="output_subnet_cidr_range"></a> [subnet\_cidr\_range](#output\_subnet\_cidr\_range) | The IPs and CIDRs of the subnets |
 | <a name="output_subnet_region"></a> [subnet\_region](#output\_subnet\_region) | The region of the VPC subnet used in this module |
-| <a name="output_vpc_project"></a> [vpc\_project](#output\_vpc\_project) | Project of the VPC network (can be different from the project launching Kubernetes resources) |
+| <a name="output_vpc_project"></a> [vpc\_project](#output\_vpc\_project) | Project of the VPC network (can be different from the project launching Kubernetes resources) |
diff --git a/gke/main.tf b/gke/main.tf
@@ -194,5 +194,11 @@ resource "helm_release" "gpu-operator" {
   cleanup_on_fail  = true
   reset_values     = true
   replace          = true
+
+  set {
+	name   = "driver.version"
+        value  = var.gpu_operator_driver_version
+  }
+
 }
 
diff --git a/gke/variables.tf b/gke/variables.tf
@@ -104,6 +104,12 @@ variable "gpu_operator_version" {
   description = "Version of the GPU operator to be installed"
 }
 
+variable "gpu_operator_driver_version" {
+  type        = string
+  default     = "535.54.03"
+  description = "The NVIDIA Driver version of GPU Operator"
+}
+
 variable "gpu_operator_namespace" {
   default     = "gpu-operator"
   description = "The namespace to deploy the NVIDIA GPU operator intov"