-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and log control plane drops #20481
Detect and log control plane drops #20481
Conversation
df35e9e
to
5a23584
Compare
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
5a23584
to
c65adf1
Compare
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
# Drop count is in the second column for each CPU | ||
if len(stat) > 1: | ||
drop_count += int(stat[1], 16) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close all the file descriptors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using "with open", so explicit close is not needed for file descriptors.
last_drop_count = get_softnet_dropped_count() | ||
with open(drop_count_stash, 'w') as f: | ||
f.write(str(last_drop_count)) | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any option to reset the last_drop_counter during config reload or warmboot??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During warmboot, the counter will automatically be reset (after kexec to new kernel). For config reload case, we might have a stale value for potentially 1 cycle and then it will fix itself. The subsequent read will clear up the drop counter stash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also verified that during "config reload", /proc/net/softnet_stat drop count does increase a bit, so we don't need to reset the last dropped counter after "config reload".
For eg: before config reload, the second column had a count of 0 and after config reload, the second column had a count of 00000015 (hex).
for line in f: | ||
if line.strip(): # Ensure the line is not empty | ||
stat = line.split() | ||
# Drop count is in the second column for each CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think second column in softnet_stat represents drop counter in all CPU architectures, please add a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll confirm for all cpu architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://github.com/torvalds/linux/blob/v5.10/net/core/net-procfs.c#L153, the dropped count seems to be always in column 2 independent of the cpu architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include this reference in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
f.write(str(current_drop_count)) | ||
|
||
if current_drop_count > last_drop_count: | ||
write_syslog("control_plane_drop_check: Kernel packet drops detected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the drop count in the log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
Detect and log control plane drops Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
Why I did it
Packet drops in kernel can go unnoticed for a long while until they cause protocols like BGP/LACP to timeout potentially causing a big impact. The change here is to detect that proactively and alert using monit
Work item tracking
How I did it
By adding a monit script that runs every 5 minutes and checks if drop counters reported by /proc/net/softnet_stat is increasing or not.
How to verify it
By running " sudo monit status controlPlaneDropCheck" and verifying that the status aligns with the drops reported in /proc/net/softnet_stat
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Master
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)