Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated kubernetes cluster version to 1.28 and update GPU operator to newest #23

Merged
merged 1 commit into from
Jan 11, 2024

Conversation

MaggieXJZhang
Copy link
Collaborator

@MaggieXJZhang MaggieXJZhang commented Jan 9, 2024

This MR updates the following to match with CNS release 11.0

  1. kubernetes version to 1.28. (Note: on GKE, we're setting a min_version with a new variable so that we don't rely on GKE's default value for regular channel, which is 1.27 right now, but this is only a min_version and does not guarantee exact versioning. See more on min_master_version
  2. GPU operator version 23.9.1 and driver version to 535.129.03
  3. NVAIE GPU operator version to 23.9.0 and driver version to 535.129.03
  4. Update the copyright header to 2024 for all revised files
  5. Renegenerate the markdown docs and tfvar defaults
  6. Updated contributing.md

AKS

~ kubectl get nodes
NAME                                   STATUS   ROLES   AGE     VERSION
aks-holoscancpu-27424565-vmss000000    Ready    agent   15m     v1.28.3
aks-holoscangpu1-21964465-vmss000000   Ready    agent   9m54s   v1.28.3
aks-holoscangpu1-21964465-vmss000001   Ready    agent   10m     v1.28.3

~ helm ls -n gpu-operator
NAME        	NAMESPACE   	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
gpu-operator	gpu-operator	1       	2024-01-10 12:05:23.428413728 -0500 EST	deployed	gpu-operator-v23.9.1	v23.9.1 

~ kubectl exec -it nvidia-device-plugin-daemonset-mjv7f -n gpu-operator -- nvidia-smi
Defaulted container "nvidia-device-plugin" out of: nvidia-device-plugin, toolkit-validation (init)
Wed Jan 10 17:12:01 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000001:00:00.0 Off |                  Off |
| N/A   30C    P0    24W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

EKS

☁  eks [magzhang/update-to-k8s-1.28] ⚡  kubectl get nodes                                                       
NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-0-35-1.us-west-2.compute.internal     Ready    <none>   7m29s   v1.28.2
ip-10-0-40-151.us-west-2.compute.internal   Ready    <none>   8m46s   v1.28.3-eks-e71965b
ip-10-0-70-112.us-west-2.compute.internal   Ready    <none>   7m28s   v1.28.2
☁  eks [magzhang/update-to-k8s-1.28] ⚡  kubectl exec nvidia-driver-daemonset-8pxcm -n gpu-operator -- nvidia-smi
Tue Jan  9 19:29:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           On  | 00000000:00:1E.0 Off |                    0 |
| N/A   27C    P0              23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
☁  eks [magzhang/update-to-k8s-1.28] ⚡  helm ls -n gpu-operator
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator    gpu-operator    1               2024-01-09 14:23:37.135651533 -0500 EST deployed        gpu-operator-v23.9.1    v23.9.1  

GKE

~  kubectl get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
gke-mz-test-tf-mz-test-cpu-pool-0f46400e-qxgn   Ready    <none>   25m   v1.28.3-gke.1203001
gke-mz-test-tf-mz-test-gpu-pool-0f4188d2-50xr   Ready    <none>   23m   v1.28.3-gke.1203001
gke-mz-test-tf-mz-test-gpu-pool-0f4188d2-n962   Ready    <none>   23m   v1.28.3-gke.1203001

~ helm ls -n gpu-operator
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator    gpu-operator    1               2024-01-10 16:40:36.295609783 -0500 EST deployed        gpu-operator-v23.9.1    v23.9.1 

~  kubectl exec -it pod/nvidia-driver-daemonset-c25gr -n gpu-operator -- nvidia-smi
Wed Jan 10 22:04:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           On  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0              24W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@MaggieXJZhang MaggieXJZhang force-pushed the magzhang/update-to-k8s-1.28 branch 2 times, most recently from 66dbe72 to 23745e2 Compare January 9, 2024 20:14
@MaggieXJZhang MaggieXJZhang changed the title Updated kubernetes cluster version to 1.28 and update GPU operator to newest WIP: Updated kubernetes cluster version to 1.28 and update GPU operator to newest Jan 10, 2024
@MaggieXJZhang MaggieXJZhang force-pushed the magzhang/update-to-k8s-1.28 branch 3 times, most recently from c4ff880 to d2f5b05 Compare January 10, 2024 22:41
@MaggieXJZhang MaggieXJZhang changed the title WIP: Updated kubernetes cluster version to 1.28 and update GPU operator to newest Updated kubernetes cluster version to 1.28 and update GPU operator to newest Jan 10, 2024
@MaggieXJZhang MaggieXJZhang force-pushed the magzhang/update-to-k8s-1.28 branch 2 times, most recently from 41324be to f2f8914 Compare January 10, 2024 23:29
@MaggieXJZhang MaggieXJZhang merged commit b388bcd into main Jan 11, 2024
@MaggieXJZhang MaggieXJZhang deleted the magzhang/update-to-k8s-1.28 branch January 16, 2024 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants