Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and log control plane drops #20481

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

prabhataravind
Copy link
Contributor

@prabhataravind prabhataravind commented Oct 12, 2024

Why I did it

Packet drops in kernel can go unnoticed for a long while until they cause protocols like BGP/LACP to timeout potentially causing a big impact. The change here is to detect that proactively and alert using monit

Work item tracking
  • Microsoft ADO 25736117:

How I did it

By adding a monit script that runs every 5 minutes and checks if drop counters reported by /proc/net/softnet_stat is increasing or not.

How to verify it

By running " sudo monit status controlPlaneDropCheck" and verifying that the status aligns with the drops reported in /proc/net/softnet_stat

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Master

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@prabhataravind prabhataravind force-pushed the paravind/control_plane_drops branch 3 times, most recently from df35e9e to 5a23584 Compare October 17, 2024 04:32
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
@prabhataravind
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind marked this pull request as ready for review October 18, 2024 00:07
# Drop count is in the second column for each CPU
if len(stat) > 1:
drop_count += int(stat[1], 16)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close all the file descriptors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using "with open", so explicit close is not needed for file descriptors.

last_drop_count = get_softnet_dropped_count()
with open(drop_count_stash, 'w') as f:
f.write(str(last_drop_count))
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any option to reset the last_drop_counter during config reload or warmboot??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During warmboot, the counter will automatically be reset (after kexec to new kernel). For config reload case, we might have a stale value for potentially 1 cycle and then it will fix itself. The subsequent read will clear up the drop counter stash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also verified that during "config reload", /proc/net/softnet_stat drop count does increase a bit, so we don't need to reset the last dropped counter after "config reload".

For eg: before config reload, the second column had a count of 0 and after config reload, the second column had a count of 00000015 (hex).

for line in f:
if line.strip(): # Ensure the line is not empty
stat = line.split()
# Drop count is in the second column for each CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think second column in softnet_stat represents drop counter in all CPU architectures, please add a note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll confirm for all cpu architectures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://github.com/torvalds/linux/blob/v5.10/net/core/net-procfs.c#L153, the dropped count seems to be always in column 2 independent of the cpu architecture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include this reference in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@prabhataravind
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny prsunny requested a review from prgeor October 24, 2024 21:31
f.write(str(current_drop_count))

if current_drop_count > last_drop_count:
write_syslog("control_plane_drop_check: Kernel packet drops detected")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the drop count in the log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants