Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset skip error counter on a fabric link if it was down. #3247

Conversation

jfeng-arista
Copy link
Contributor

What I did
reset the skip error counters for fabric link monitoring if a link was down and up again.

Why I did it

The fabric link monitoring feature does not take care the peer card restart cases. Peer card reload could lead to links go in to isolation state, and that's a case where we can ignore the error due to the init churn. So in this change , the skip error on init counters get reset if a link was down and up again.

How I verified it

Details if related

@kenneth-arista
Copy link
Contributor

Related to sonic-net/sonic-buildimage#19288

@jfeng-arista jfeng-arista requested a review from prsunny as a code owner July 31, 2024 03:15
@jfeng-arista
Copy link
Contributor Author

@judyjoseph
Copy link
Contributor

@saksarav-nokia to review too

@saksarav-nokia
Copy link
Contributor

Tested the fix and seems to be working. However i see lot of these log messages and not sure if we need have them at NOTICE level

2024 Aug 5 20:50:16.520538 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates POLL_WITH_ERRORS to 0 0
2024 Aug 5 20:50:16.520538 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates POLL_WITH_NO_ERRORS to 8 8
2024 Aug 5 20:50:16.520594 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates POLL_WITH_FEC_ERRORS to 0 0
2024 Aug 5 20:50:16.520662 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates POLL_WITH_NOFEC_ERRORS to 8 8
2024 Aug 5 20:50:16.520717 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates CONFIG_ISOLATED to 0 0
2024 Aug 5 20:50:16.520782 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates ISOLATED to 0 0

@jfeng-arista
Copy link
Contributor Author

I can change the message level to INFO

@jfeng-arista
Copy link
Contributor Author

Tested the fix and seems to be working. However i see lot of these log messages and not sure if we need have them at NOTICE level

2024 Aug 5 20:50:16.520538 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates POLL_WITH_ERRORS to 0 0 2024 Aug 5 20:50:16.520538 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates POLL_WITH_NO_ERRORS to 8 8 2024 Aug 5 20:50:16.520594 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates POLL_WITH_FEC_ERRORS to 0 0 2024 Aug 5 20:50:16.520662 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates POLL_WITH_NOFEC_ERRORS to 8 8 2024 Aug 5 20:50:16.520717 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates CONFIG_ISOLATED to 0 0 2024 Aug 5 20:50:16.520782 ixre-cpm-chassis12 NOTICE swss0#orchagent: :- updateStateDbTable: PORT0 updates ISOLATED to 0 0

Changed the log level to INFO, so it will not show up by default now

@jfeng-arista
Copy link
Contributor Author

can anyone help review again and see if we can merge up this ? thank you

Copy link
Contributor

@arlakshm arlakshm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as comments

orchagent/fabricportsorch.cpp Show resolved Hide resolved
orchagent/fabricportsorch.cpp Show resolved Hide resolved
orchagent/fabricportsorch.cpp Show resolved Hide resolved
@jfeng-arista
Copy link
Contributor Author

test_HashSwitchGlobalConfiguration[inner-frame-ecmp-hash] failed (1 runs remaining out of 2).
Not releated to this change

@jfeng-arista
Copy link
Contributor Author

/azpw run Azure.sonic-swss

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jfeng-arista
Copy link
Contributor Author

I don't quite get the coverage results, I added test to cover the change and get coverage tests passed before retarting the pipeline. How it can fail the coverage test after the rerun ...

start again

/azpw run Azure.sonic-swss

@jfeng-arista
Copy link
Contributor Author

/azpw run Azure.sonic-swss

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jfeng-arista
Copy link
Contributor Author

/azpw run coverage.Azure.sonic-swss.vstest

@arlakshm
Copy link
Contributor

@prsunny, can you please help merge this PR

@prsunny prsunny merged commit 42ea859 into sonic-net:master Sep 3, 2024
17 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this pull request Sep 6, 2024
…3247)

What I did
reset the skip error counters for fabric link monitoring if a link was down and up again.

Why I did it

The fabric link monitoring feature does not take care the peer card restart cases. Peer card reload could lead to links go in to isolation state, and that's a case where we can ignore the error due to the init churn. So in this change , the skip error on init counters get reset if a link was down and up again.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #3281

mssonicbld pushed a commit that referenced this pull request Sep 7, 2024
What I did
reset the skip error counters for fabric link monitoring if a link was down and up again.

Why I did it

The fabric link monitoring feature does not take care the peer card restart cases. Peer card reload could lead to links go in to isolation state, and that's a case where we can ignore the error due to the init churn. So in this change , the skip error on init counters get reset if a link was down and up again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants