Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[synchronous mode] Add failure notification for SAI failures in synchronous mode #1596

Merged
merged 8 commits into from
Feb 24, 2021

Conversation

shi-su
Copy link
Contributor

@shi-su shi-su commented Jan 12, 2021

What I did
Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch.

This is a part of the first step in the SAI failure handling in orchagent with synchronous mode:

  1. Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
  2. Add general failure handling logic by status.
  3. Develop fine-grain failure handling mechanism for each orch to properly handle different SAI failures.

Why I did it
This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).

How I verified it
Generate SAI failure in vs SAI. Verify that the function gets the error status and throws an exception.

vlab-01 ERR syncd#syncd: :- create: create status: SAI_STATUS_FAILURE
vlab-01 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_BULK_CREATE failed in syncd mode: SAI_STATUS_FAILURE
vlab-01 ERR swss#orchagent: :- addRoutePost: Failed to create route 201.0.0.128/25 with next hop(s) 10.0.0.57@PortChannel0001,10.0.0.59@PortChannel0002,10.0.0.61@PortChannel0003,10.0.0.63@PortChannel0004
Jan 29 02:09:48.865390 vlab-01 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, SAI API: SAI_API_ROUTE, status: SAI_STATUS_FAILURE
vlab-01 NOTICE swss#orchagent: :- uninitialize: begin
vlab-01 NOTICE swss#orchagent: :- uninitialize: begin

Details if related
From the orchagent’s perspective, failures in sairedis may caused by three types of failure: a) SAI failure; b) communication failure; c) metadata failure. The function in this PR is on par with asynchronous mode in the first two failure scenarios. For metadata failures, triggering swss restart might be more extreme than asynchronous mode. However, probability of such a scenario is low and the presence of such failure is likely to indicate bugs in orchagent, we treat it the same as the first two scenarios.

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block this PR, for discuss purpose only

orchagent/orch.h Outdated
@@ -218,6 +218,11 @@ class Orch
static void recordTuple(Consumer &consumer, const swss::KeyOpFieldsValuesTuple &tuple);

void dumpPendingTasks(std::vector<std::string> &ts);

/* Failure handling */
bool handleSaiCreateFailure(sai_status_t status);
Copy link
Contributor

@qiluo-msft qiluo-msft Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool [](start = 4, length = 4)

protected functions?
virtual functions (not sure if any will be used by a base pointer)? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the function as protected virtual.

* 2. Develop fine-grain failure handling mechanisms specific and replace
* this coarse handling in each orch.
*/
SWSS_LOG_THROW("Encountered failure in Create operation, status: %d", status);
Copy link
Contributor

@qiluo-msft qiluo-msft Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SWSS_LOG_THROW [](start = 4, length = 14)

SWSS_LOG_THROW [](start = 4, length = 14)

In old async mode, https://github.com/Azure/sonic-swss/blob/master/orchagent/notifications.cpp#L20 on_switch_shutdown_request will be called. Do you want to mimic exact message and exit? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message in on_switch_shutdown_request is only "syncd stopped". Personally, I do not like this message very much for two reasons: 1. it does not carry much information; 2. It does not precisely describe the situation in the sync mode scenario (In sync mode, syncd hasn't really stopped at this point).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log message is good. Do you still want to mimic the exit behavior?


In reply to: 560594351 [](ancestors = 560594351)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using exit behavior instead.

* this coarse handling in each orch.
* 3. Take the type of sai api into consideration.
*/
SWSS_LOG_THROW("Encountered failure in create operation, status: %d", status);
Copy link
Contributor

@qiluo-msft qiluo-msft Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status [](start = 74, length = 6)

status [](start = 74, length = 6)

You may also log the api info. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make %s and use sai_serialize_status(status).c_str() for string status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added api info in the log and updated to use sai_serialize functions for api and status. It appears that cfgmgr also includes orch as a source, it is needed to include saimetadata library for cfgmgr.

Copy link
Contributor

@kcudnik kcudnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you tested any of this change ? since now it will throw exceptions on any failure

* this coarse handling in each orch.
* 3. Take the type of sai api into consideration.
*/
SWSS_LOG_THROW("Encountered failure in set operation, status: %d", status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make %s and use sai_serialize_status(status).c_str() for string status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested.

* this coarse handling in each orch.
* 3. Take the type of sai api into consideration.
*/
SWSS_LOG_THROW("Encountered failure in remove operation, status: %d", status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make %s and use sai_serialize_status(status).c_str() for string status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested.

@shi-su
Copy link
Contributor Author

shi-su commented Jan 29, 2021

Did you tested any of this change ? since now it will throw exceptions on any failure

I tested this change in a few artificially generated SAI failures. The goal of this change is to make sure we do no worse than async mode, which triggers orchagent exit in any SAI failures. Throwing exceptions on any failure is basically trying to follow that behavior (assuming failures coming from sairedis is rare and most likely inferring some bugs in implementations).

@qiluo-msft
Copy link
Contributor

lgtm and okay to merge


In reply to: 571774867 [](ancestors = 571774867)

qiluo-msft
qiluo-msft previously approved these changes Feb 4, 2021
Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unblock

kcudnik
kcudnik previously approved these changes Feb 9, 2021
@qiluo-msft qiluo-msft marked this pull request as ready for review February 9, 2021 20:08
@@ -685,6 +686,82 @@ Executor *Orch::getExecutor(string executorName)
return NULL;
}

bool Orch::handleSaiCreateStatus(sai_api_t api, sai_status_t status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status [](start = 61, length = 6)

Considering this is an interface design, and future user may have extended requirement. Is it reasonable to add another optional parameter like void *context, so user could pass any parameter they want and use it in their derived functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment. I added an optional parameter void *context to the functions.

@shi-su shi-su dismissed stale reviews from kcudnik and qiluo-msft via 004d741 February 12, 2021 00:42
@shi-su shi-su merged commit 1d33881 into sonic-net:master Feb 24, 2021
DavidZagury pushed a commit to DavidZagury/sonic-swss that referenced this pull request Mar 4, 2021
…ronous mode (sonic-net#1596)

Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch.

This is a part of the first step in the SAI failure handling in orchagent with synchronous mode:
    1. Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
    2. Add general failure handling logic by status.
    3. Develop fine-grain failure handling mechanism for each orch to properly handle different SAI failures.

This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).
shi-su added a commit to shi-su/sonic-swss that referenced this pull request Aug 17, 2021
…ronous mode (sonic-net#1596)

Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch.

This is a part of the first step in the SAI failure handling in orchagent with synchronous mode:
    1. Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
    2. Add general failure handling logic by status.
    3. Develop fine-grain failure handling mechanism for each orch to properly handle different SAI failures.

This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).
shi-su added a commit to shi-su/sonic-swss that referenced this pull request Aug 17, 2021
…ronous mode (sonic-net#1596)

Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch.

This is a part of the first step in the SAI failure handling in orchagent with synchronous mode:
    1. Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
    2. Add general failure handling logic by status.
    3. Develop fine-grain failure handling mechanism for each orch to properly handle different SAI failures.

This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).
shi-su added a commit that referenced this pull request Aug 18, 2021
What I did
Backport SAI failure handling related commits into the 202012 branch. The following is a list of backported commits:

941875a Deactivate mirror session only when session status is true in updateLagMember (#1666)
be12482 Ignore ALREADY_EXIST error in FDB creation (#1815)
c9c1aa2 Add failure handling for SAI get operations (#1768)
47b4276 [BufferOrch] Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL (#1786)
db9238f Add failure notification for orchagent (#1665)
fc8e43f [synchronous mode] Add failure notification for SAI failures in synchronous mode (#1596)

Why I did it
202012 image needs to include failure handling mechanism for enough notification in the presence of SAI failures.
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this pull request Oct 5, 2021
…ronous mode (sonic-net#1596)

Add function to notify users in the presence of SAI failures by throwing exceptions and trigger swss restart in synchronous mode. And include this function in routeorch and neighorch.

This is a part of the first step in the SAI failure handling in orchagent with synchronous mode:
    1. Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
    2. Add general failure handling logic by status.
    3. Develop fine-grain failure handling mechanism for each orch to properly handle different SAI failures.

This function aims to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures (on-par with asynchronous mode).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants