Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Supervisord] Deduplicate the alerting messages of critical processes from Supervisord. #6849

Merged
merged 3 commits into from
Feb 25, 2021
Merged

[Supervisord] Deduplicate the alerting messages of critical processes from Supervisord. #6849

merged 3 commits into from
Feb 25, 2021

Conversation

yozhao101
Copy link
Contributor

@yozhao101 yozhao101 commented Feb 23, 2021

Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it

In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it

This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it

I verified and test this implementation on a physical DUT.

Feb 23 01:24:36.541111 str-dx010-acs-1 INFO lldp#supervisord 2021-02-23 01:24:36,540 INFO exited: lldp-syncd (terminated by SIGKILL; not expected)
Feb 23 01:24:36.543880 str-dx010-acs-1 INFO lldp#supervisord 2021-02-23 01:24:36,543 INFO exited: lldpmgrd (terminated by SIGKILL; not expected)
Feb 23 01:25:36.616111 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldp-syncd' is not running in namespace 'host'(1 minutes).
Feb 23 01:25:36.616207 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldpmgrd' is not running in namespace 'host'(1 minutes).
Feb 23 01:26:36.673000 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldp-syncd' is not running in namespace 'host'(2 minutes).
Feb 23 01:26:36.673443 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldpmgrd' is not running in namespace 'host'(2 minutes).
Feb 23 01:27:36.730690 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldp-syncd' is not running in namespace 'host'(3 minutes).
Feb 23 01:27:36.730817 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldpmgrd' is not running in namespace 'host'(3 minutes).
Feb 23 01:28:36.782367 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldp-syncd' is not running in namespace 'host'(4 minutes).
Feb 23 01:28:36.782818 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldpmgrd' is not running in namespace 'host'(4 minutes).

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • [x ] 202012

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
files/scripts/supervisor-proc-exit-listener Outdated Show resolved Hide resolved
1.Fix the format of alerting message.
2.For each exited process, there are two fields: the time of last alert
and number of dead minutes. Use a dict to hold these two fields instead
of a list.
3.Use a formula to calculate how many minutes the process was in dead
state instead of hard code.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Please wait for other reviewers.

@yozhao101
Copy link
Contributor Author

Looks good to me. Please wait for other reviewers.

@lguohan Can you please help me review this change?

@yozhao101 yozhao101 merged commit 21f5e12 into sonic-net:master Feb 25, 2021
yxieca pushed a commit that referenced this pull request Mar 4, 2021
… from Supervisord. (#6849)

Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it
I verified and test this implementation on a physical DUT.
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this pull request Aug 7, 2021
… from Supervisord. (sonic-net#6849)

Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it
I verified and test this implementation on a physical DUT.
lolyu pushed a commit to lolyu/sonic-buildimage that referenced this pull request Sep 13, 2021
… from Supervisord. (sonic-net#6849)

Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it
I verified and test this implementation on a physical DUT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants