-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[synchronous mode] Add failure notification for SAI failures in synchronous mode #1596
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block this PR, for discuss purpose only
orchagent/orch.h
Outdated
@@ -218,6 +218,11 @@ class Orch | |||
static void recordTuple(Consumer &consumer, const swss::KeyOpFieldsValuesTuple &tuple); | |||
|
|||
void dumpPendingTasks(std::vector<std::string> &ts); | |||
|
|||
/* Failure handling */ | |||
bool handleSaiCreateFailure(sai_status_t status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool [](start = 4, length = 4)
protected functions?
virtual functions (not sure if any will be used by a base pointer)? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the function as protected virtual.
orchagent/orch.cpp
Outdated
* 2. Develop fine-grain failure handling mechanisms specific and replace | ||
* this coarse handling in each orch. | ||
*/ | ||
SWSS_LOG_THROW("Encountered failure in Create operation, status: %d", status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SWSS_LOG_THROW [](start = 4, length = 14)
SWSS_LOG_THROW [](start = 4, length = 14)
In old async mode, https://github.com/Azure/sonic-swss/blob/master/orchagent/notifications.cpp#L20 on_switch_shutdown_request will be called. Do you want to mimic exact message and exit? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message in on_switch_shutdown_request
is only "syncd stopped". Personally, I do not like this message very much for two reasons: 1. it does not carry much information; 2. It does not precisely describe the situation in the sync mode scenario (In sync mode, syncd hasn't really stopped at this point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log message is good. Do you still want to mimic the exit behavior?
In reply to: 560594351 [](ancestors = 560594351)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using exit behavior instead.
orchagent/orch.cpp
Outdated
* this coarse handling in each orch. | ||
* 3. Take the type of sai api into consideration. | ||
*/ | ||
SWSS_LOG_THROW("Encountered failure in create operation, status: %d", status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status [](start = 74, length = 6)
status [](start = 74, length = 6)
You may also log the api info. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make %s and use sai_serialize_status(status).c_str() for string status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added api info in the log and updated to use sai_serialize functions for api and status. It appears that cfgmgr also includes orch as a source, it is needed to include saimetadata library for cfgmgr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you tested any of this change ? since now it will throw exceptions on any failure
orchagent/orch.cpp
Outdated
* this coarse handling in each orch. | ||
* 3. Take the type of sai api into consideration. | ||
*/ | ||
SWSS_LOG_THROW("Encountered failure in set operation, status: %d", status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make %s and use sai_serialize_status(status).c_str() for string status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as suggested.
orchagent/orch.cpp
Outdated
* this coarse handling in each orch. | ||
* 3. Take the type of sai api into consideration. | ||
*/ | ||
SWSS_LOG_THROW("Encountered failure in remove operation, status: %d", status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make %s and use sai_serialize_status(status).c_str() for string status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as suggested.
a68fd47
to
7531ff3
Compare
I tested this change in a few artificially generated SAI failures. The goal of this change is to make sure we do no worse than async mode, which triggers orchagent exit in any SAI failures. Throwing exceptions on any failure is basically trying to follow that behavior (assuming failures coming from sairedis is rare and most likely inferring some bugs in implementations). |
5b9f638
to
61feaf7
Compare
lgtm and okay to merge In reply to: 571774867 [](ancestors = 571774867) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unblock
orchagent/orch.cpp
Outdated
@@ -685,6 +686,82 @@ Executor *Orch::getExecutor(string executorName) | |||
return NULL; | |||
} | |||
|
|||
bool Orch::handleSaiCreateStatus(sai_api_t api, sai_status_t status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status [](start = 61, length = 6)
Considering this is an interface design, and future user may have extended requirement. Is it reasonable to add another optional parameter like void *context
, so user could pass any parameter they want and use it in their derived functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this comment. I added an optional parameter void *context
to the functions.
004d741
to
0ec052a
Compare
…ronous mode (sonic-net#1596) Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch. This is a part of the first step in the SAI failure handling in orchagent with synchronous mode: 1. Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures. 2. Add general failure handling logic by status. 3. Develop fine-grain failure handling mechanism for each orch to properly handle different SAI failures. This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).
…ronous mode (sonic-net#1596) Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch. This is a part of the first step in the SAI failure handling in orchagent with synchronous mode: 1. Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures. 2. Add general failure handling logic by status. 3. Develop fine-grain failure handling mechanism for each orch to properly handle different SAI failures. This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).
…ronous mode (sonic-net#1596) Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch. This is a part of the first step in the SAI failure handling in orchagent with synchronous mode: 1. Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures. 2. Add general failure handling logic by status. 3. Develop fine-grain failure handling mechanism for each orch to properly handle different SAI failures. This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).
What I did Backport SAI failure handling related commits into the 202012 branch. The following is a list of backported commits: 941875a Deactivate mirror session only when session status is true in updateLagMember (#1666) be12482 Ignore ALREADY_EXIST error in FDB creation (#1815) c9c1aa2 Add failure handling for SAI get operations (#1768) 47b4276 [BufferOrch] Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL (#1786) db9238f Add failure notification for orchagent (#1665) fc8e43f [synchronous mode] Add failure notification for SAI failures in synchronous mode (#1596) Why I did it 202012 image needs to include failure handling mechanism for enough notification in the presence of SAI failures.
…ronous mode (sonic-net#1596) Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch. This is a part of the first step in the SAI failure handling in orchagent with synchronous mode: 1. Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures. 2. Add general failure handling logic by status. 3. Develop fine-grain failure handling mechanism for each orch to properly handle different SAI failures. This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).
What I did
Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch.
This is a part of the first step in the SAI failure handling in orchagent with synchronous mode:
Why I did it
This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).
How I verified it
Generate SAI failure in vs SAI. Verify that the function gets the error status and throws an exception.
Details if related
From the orchagent’s perspective, failures in sairedis may caused by three types of failure: a) SAI failure; b) communication failure; c) metadata failure. The function in this PR is on par with asynchronous mode in the first two failure scenarios. For metadata failures, triggering swss restart might be more extreme than asynchronous mode. However, probability of such a scenario is low and the presence of such failure is likely to indicate bugs in orchagent, we treat it the same as the first two scenarios.