arena top job lost resource information #1082

kangzemin · 2024-05-08T11:36:29Z

arena top job lost resourece information

arena: v0.9.14
BuildDate: 2024-04-10T12:54:22Z
GitCommit: adb43b8
GitTreeState: clean
GitTag: v0.9.14
GoVersion: go1.20.12
Compiler: gc
Platform: linux/amd64

The text was updated successfully, but these errors were encountered:

Syulin7 · 2024-05-08T12:30:37Z

@kangzemin The GPU resource information depends on metrics in Prometheus, requiring a Prometheus service in the cluster and providing metrics such as "nvidia_gpu_duty_cycle." For reference, see: https://github.com/kubeflow/arena/blob/master/pkg/apis/types/gpu_metric.go

kangzemin · 2024-05-11T03:53:23Z

Thank you for your guidance!
When I deploy exporter with https://github.com/kubeflow/arena/blob/master/docs/top/prometheus.md, there has been an error:

so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)
now pod is running.

But exporter pod log is:

time="2024-05-11T03:28:50Z" level=info msg="runtime is docker"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-11T03:28:50Z"}

Is there something wrong?

kubernests version:

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-02T00:35:13Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-01T01:11:45Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

attention: my kubernetes runtime is containerd.
nvidia-smi:

 NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2

kangzemin · 2024-05-11T04:03:29Z

Thank you for your guidance! When I deploy exporter with https://github.com/kubeflow/arena/blob/master/docs/top/prometheus.md, there has been an error: so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate) now pod is running.

But exporter pod log is:
time="2024-05-11T03:28:50Z" level=info msg="runtime is docker"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-11T03:28:50Z"}
Is there something wrong?

kubernests version:
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-02T00:35:13Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-01T01:11:45Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
attention: my kubernetes runtime is containerd. nvidia-smi:
 NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2

@Syulin7

Syulin7 · 2024-05-13T03:13:22Z

so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)

@kangzemin You need to mount the node's containerd.sock to /run/containerd/containerd.sock inside the container.

kangzemin · 2024-05-14T09:07:21Z

so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)

@kangzemin You need to mount the node's containerd.sock to /run/containerd/containerd.sock inside the container.

@Syulin7 ok,I mount nodes‘s /run/containerd/containerd.sock to /run/containerd/containerd.sock inside the container. and exporter pod is running.

but exporter pod log is error:

time="2024-05-14T06:36:07Z" level=info msg="runtime is containerd"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-14T06:36:07Z"}

query from prometheus is empty:

kubectl get --raw '/api/v1/namespaces/arena-system/services/prometheus-svc:prometheus/proxy/api/v1/query?query=nvidia_gpu_num_devices' 
{"status":"success","data":{"resultType":"vector","result":[]}}

Syulin7 · 2024-05-15T02:11:09Z

@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

kangzemin · 2024-05-15T03:32:53Z

@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.
kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

This is result :

kangzemin · 2024-05-15T03:35:08Z

@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.
kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'
This is result :

@Syulin7

Syulin7 · 2024-05-15T07:11:44Z

#1087

@kangzemin I submitted a PR to fix this issue. Please refer to this PR to redeploy the service.

The Prometheus deployed here is for testing only. In a production environment, you should deploy your own Prometheus service and ensure data persistence.

kangzemin · 2024-05-15T09:49:37Z

The Prometheus deployed here is for testing only. In a production environment, you should deploy your own Prometheus service and ensure data persistence.
@Syulin7 Ok，Thank you ！

Syulin7 · 2024-05-17T03:41:48Z

@kangzemin Does it work after trying again? Are there any other issues?

kangzemin · 2024-05-25T05:06:05Z

@kangzemin Does it work after trying again? Are there any other issues?

The problem still exists。

But prometheus looks normal

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

grafana only gpunode dashboard has data, other is empty.

Can you give me some advice?

Syulin7 · 2024-05-27T02:41:11Z

@kangzemin It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to #1087?

kangzemin · 2024-05-29T10:57:36Z

It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to #1087?

@Syulin7 Yes, I fix deployment, use image:gpu-prometheus-exporter:v1.0.1-b2c2f9b. and limit cpu 1, mem 2000Mi.
arena top job, about gpu info is N/A .

Syulin7 · 2024-05-29T11:48:53Z

@kangzemin This should be related to your cluster configuration, please contact me via email.

Syulin7 mentioned this issue May 15, 2024

Fix gpu-exporter and prometheus demo #1087

Merged

google-oss-prow bot closed this as completed in #1087 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arena top job lost resource information #1082

arena top job lost resource information #1082

kangzemin commented May 8, 2024

Syulin7 commented May 8, 2024

kangzemin commented May 11, 2024

kangzemin commented May 11, 2024

Syulin7 commented May 13, 2024

kangzemin commented May 14, 2024

Syulin7 commented May 15, 2024

kangzemin commented May 15, 2024

kangzemin commented May 15, 2024

Syulin7 commented May 15, 2024 •

edited

Loading

kangzemin commented May 15, 2024

Syulin7 commented May 17, 2024

kangzemin commented May 25, 2024

Syulin7 commented May 27, 2024

kangzemin commented May 29, 2024

Syulin7 commented May 29, 2024

arena top job lost resource information #1082

arena top job lost resource information #1082

Comments

kangzemin commented May 8, 2024

Syulin7 commented May 8, 2024

kangzemin commented May 11, 2024

kangzemin commented May 11, 2024

Syulin7 commented May 13, 2024

kangzemin commented May 14, 2024

Syulin7 commented May 15, 2024

kangzemin commented May 15, 2024

kangzemin commented May 15, 2024

Syulin7 commented May 15, 2024 • edited Loading

kangzemin commented May 15, 2024

Syulin7 commented May 17, 2024

kangzemin commented May 25, 2024

Syulin7 commented May 27, 2024

kangzemin commented May 29, 2024

Syulin7 commented May 29, 2024

Syulin7 commented May 15, 2024 •

edited

Loading