Skip to content

Commit

Permalink
Merge pull request #2 from angudadevops/main
Browse files Browse the repository at this point in the history
Feat: Expose GPU Operator Driver Version on EKS, GKE
  • Loading branch information
evberrypi committed Jul 19, 2023
2 parents 83299dd + ce16689 commit 0712333
Show file tree
Hide file tree
Showing 6 changed files with 29 additions and 3 deletions.
5 changes: 3 additions & 2 deletions eks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,6 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
7. Connect to the cluster with `kubectl` by running `aws eks update-kubeconfig --name tf-holoscan-cluster --region us-west-2` after the cluster is created
8. Run `terraform destroy` to delete cloud infrastructure provisioned by Terraform


## Requirements

| Name | Version |
Expand Down Expand Up @@ -139,6 +138,7 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="input_gpu_node_pool_delete_on_termination"></a> [gpu\_node\_pool\_delete\_on\_termination](#input\_gpu\_node\_pool\_delete\_on\_termination) | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | `bool` | `true` | no |
| <a name="input_gpu_node_pool_root_disk_size_gb"></a> [gpu\_node\_pool\_root\_disk\_size\_gb](#input\_gpu\_node\_pool\_root\_disk\_size\_gb) | The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node | `number` | `512` | no |
| <a name="input_gpu_node_pool_root_volume_type"></a> [gpu\_node\_pool\_root\_volume\_type](#input\_gpu\_node\_pool\_root\_volume\_type) | The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs | `string` | `"gp2"` | no |
| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version of GPU Operator | `string` | `"535.54.03"` | no |
| <a name="input_gpu_operator_namespace"></a> [gpu\_operator\_namespace](#input\_gpu\_operator\_namespace) | The namespace for the GPU operator deployment | `string` | `"gpu-operator"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | The version of the GPU operator | `string` | `"v23.3.2"` | no |
| <a name="input_max_cpu_nodes"></a> [max\_cpu\_nodes](#input\_max\_cpu\_nodes) | Maximum number of CPU nodes in the Autoscaling Group | `string` | `"2"` | no |
Expand Down Expand Up @@ -166,4 +166,5 @@ To create a cluster with everything needed to run the Cloud Native Service Add-o
| <a name="output_nodes"></a> [nodes](#output\_nodes) | n/a |
| <a name="output_oidc_endpoint"></a> [oidc\_endpoint](#output\_oidc\_endpoint) | n/a |
| <a name="output_private_subnet_ids"></a> [private\_subnet\_ids](#output\_private\_subnet\_ids) | n/a |
| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |
| <a name="output_public_subnet_ids"></a> [public\_subnet\_ids](#output\_public\_subnet\_ids) | n/a |

6 changes: 6 additions & 0 deletions eks/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -194,5 +194,11 @@ resource "helm_release" "gpu_operator" {
cleanup_on_fail = true
reset_values = true
replace = true

set {
name = "driver.version"
value = var.gpu_operator_driver_version
}

}

6 changes: 6 additions & 0 deletions eks/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,12 @@ variable "gpu_operator_version" {
description = "The version of the GPU operator"
}

variable "gpu_operator_driver_version" {
type = string
default = "535.54.03"
description = "The NVIDIA Driver version of GPU Operator"
}

variable "gpu_operator_namespace" {
type = string
default = "gpu-operator"
Expand Down
3 changes: 2 additions & 1 deletion gke/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ No modules.
| <a name="input_gpu_instance_type"></a> [gpu\_instance\_type](#input\_gpu\_instance\_type) | Machine Type for GPU node pool | `string` | `"n1-standard-4"` | no |
| <a name="input_gpu_max_node_count"></a> [gpu\_max\_node\_count](#input\_gpu\_max\_node\_count) | Max Number of GPU nodes in GPU nodepool | `string` | `"5"` | no |
| <a name="input_gpu_min_node_count"></a> [gpu\_min\_node\_count](#input\_gpu\_min\_node\_count) | Min number of GPU nodes in GPU nodepool | `string` | `"2"` | no |
| <a name="input_gpu_operator_driver_version"></a> [gpu\_operator\_driver\_version](#input\_gpu\_operator\_driver\_version) | The NVIDIA Driver version of GPU Operator | `string` | `"535.54.03"` | no |
| <a name="input_gpu_operator_namespace"></a> [gpu\_operator\_namespace](#input\_gpu\_operator\_namespace) | The namespace to deploy the NVIDIA GPU operator intov | `string` | `"gpu-operator"` | no |
| <a name="input_gpu_operator_version"></a> [gpu\_operator\_version](#input\_gpu\_operator\_version) | Version of the GPU operator to be installed | `string` | `"v23.3.2"` | no |
| <a name="input_gpu_type"></a> [gpu\_type](#input\_gpu\_type) | GPU SKU To attach to Holoscan GPU Node (eg. nvidia-tesla-k80) | `string` | `"nvidia-tesla-v100"` | no |
Expand All @@ -153,4 +154,4 @@ No modules.
| <a name="output_stable_channel_latest_gke_version"></a> [stable\_channel\_latest\_gke\_version](#output\_stable\_channel\_latest\_gke\_version) | The latest available version of GKE when using the STABLE channel |
| <a name="output_subnet_cidr_range"></a> [subnet\_cidr\_range](#output\_subnet\_cidr\_range) | The IPs and CIDRs of the subnets |
| <a name="output_subnet_region"></a> [subnet\_region](#output\_subnet\_region) | The region of the VPC subnet used in this module |
| <a name="output_vpc_project"></a> [vpc\_project](#output\_vpc\_project) | Project of the VPC network (can be different from the project launching Kubernetes resources) |
| <a name="output_vpc_project"></a> [vpc\_project](#output\_vpc\_project) | Project of the VPC network (can be different from the project launching Kubernetes resources) |
6 changes: 6 additions & 0 deletions gke/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -194,5 +194,11 @@ resource "helm_release" "gpu-operator" {
cleanup_on_fail = true
reset_values = true
replace = true

set {
name = "driver.version"
value = var.gpu_operator_driver_version
}

}

6 changes: 6 additions & 0 deletions gke/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,12 @@ variable "gpu_operator_version" {
description = "Version of the GPU operator to be installed"
}

variable "gpu_operator_driver_version" {
type = string
default = "535.54.03"
description = "The NVIDIA Driver version of GPU Operator"
}

variable "gpu_operator_namespace" {
default = "gpu-operator"
description = "The namespace to deploy the NVIDIA GPU operator intov"
Expand Down

0 comments on commit 0712333

Please sign in to comment.