Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

add alert for jobs which in pending phrase for a long time #3761

Merged
merged 7 commits into from
Oct 24, 2019

Conversation

Binyang2014
Copy link
Contributor

@Binyang2014 Binyang2014 commented Oct 21, 2019

Refer to issue: #3760
Sometimes user job will in pending status forever which caused by a kubelet bug.
Add alert for this situation.

@Binyang2014 Binyang2014 force-pushed the binyli/monitor_job_pod branch 3 times, most recently from 3ae1c5d to 579c5fd Compare October 21, 2019 10:53
@Binyang2014 Binyang2014 force-pushed the binyli/monitor_job_pod branch from 579c5fd to 5280b3d Compare October 21, 2019 11:44
@Binyang2014 Binyang2014 marked this pull request as ready for review October 22, 2019 03:25
@Binyang2014 Binyang2014 changed the title add job pod monitor add alert for jobs which in pending phrase for a long time Oct 22, 2019
src/watchdog/src/watchdog.py Outdated Show resolved Hide resolved


def generate_pod_metrics(pai_pod_gauge, pai_job_pod_gauge, service_name, job_name, pod_name,
host_ip, status, namespace):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add assert service_name is not None and job_name is not None to make sure caller passed expected parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we parse all pods running in k8s. Some pods such as nvidia-device-plugin will not contain app label and job label.


In reply to: 337403770 [](ancestors = 337403770)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you returned in caller in this case. Sorry, should be assert service_name is not None or job_name is not None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only generate metrics to job and service, the caller must know this, assert statement is to let caller know it, so I suggest adding this so in future, code reader or writer can know this by only looking assert

@Binyang2014 Binyang2014 merged commit bfba50f into master Oct 24, 2019
@Binyang2014 Binyang2014 deleted the binyli/monitor_job_pod branch October 25, 2019 03:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants