-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arena top job lost resource information #1082
Comments
@kangzemin The GPU resource information depends on metrics in Prometheus, requiring a Prometheus service in the cluster and providing metrics such as "nvidia_gpu_duty_cycle." For reference, see: https://github.com/kubeflow/arena/blob/master/pkg/apis/types/gpu_metric.go |
Thank you for your guidance! But exporter pod log is:
Is there something wrong? kubernests version:
attention: my kubernetes runtime is containerd.
|
|
@kangzemin You need to mount the node's containerd.sock to /run/containerd/containerd.sock inside the container. |
@Syulin7 ok,I mount nodes‘s /run/containerd/containerd.sock to /run/containerd/containerd.sock inside the container. and exporter pod is running. but exporter pod log is error:
query from prometheus is empty:
|
@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.
|
|
|
@kangzemin I submitted a PR to fix this issue. Please refer to this PR to redeploy the service. The Prometheus deployed here is for testing only. In a production environment, you should deploy your own Prometheus service and ensure data persistence. |
|
@kangzemin Does it work after trying again? Are there any other issues? |
The problem still exists。
grafana only gpunode dashboard has data, other is empty. Can you give me some advice? |
@kangzemin It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to #1087? |
It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to #1087? @Syulin7 Yes, I fix deployment, use image:gpu-prometheus-exporter:v1.0.1-b2c2f9b. and limit cpu 1, mem 2000Mi. |
@kangzemin This should be related to your cluster configuration, please contact me via email. |
arena top job lost resourece information
arena: v0.9.14
BuildDate: 2024-04-10T12:54:22Z
GitCommit: adb43b8
GitTreeState: clean
GitTag: v0.9.14
GoVersion: go1.20.12
Compiler: gc
Platform: linux/amd64
The text was updated successfully, but these errors were encountered: