-
Notifications
You must be signed in to change notification settings - Fork 549
add alert for jobs which in pending phrase for a long time #3761
Conversation
3ae1c5d
to
579c5fd
Compare
579c5fd
to
5280b3d
Compare
src/watchdog/src/watchdog.py
Outdated
|
||
|
||
def generate_pod_metrics(pai_pod_gauge, pai_job_pod_gauge, service_name, job_name, pod_name, | ||
host_ip, status, namespace): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add assert service_name is not None and job_name is not None
to make sure caller passed expected parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we parse all pods running in k8s. Some pods such as nvidia-device-plugin will not contain app label and job label.
In reply to: 337403770 [](ancestors = 337403770)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you returned in caller in this case. Sorry, should be assert service_name is not None or job_name is not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function only generate metrics to job and service, the caller must know this, assert statement is to let caller know it, so I suggest adding this so in future, code reader or writer can know this by only looking assert
Refer to issue: #3760
Sometimes user job will in pending status forever which caused by a kubelet bug.
Add alert for this situation.