Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update fabric link monitoring plan. #1013

Merged
merged 5 commits into from
Oct 4, 2023

Conversation

jfeng-arista
Copy link
Contributor

@jfeng-arista jfeng-arista commented Jun 6, 2022

Update "Fabric port support on Sonic" HLD

PRs related to updated sections are:

Repo PR title State Dependency
SONiC #5678 Update fabric test plan link Open No
sonic-utilities #2528 Add "show fabric reachability" command. Merged No
sonic-swss #2611 Revert "Revert "[voq][chassis]Add show fabric counters port/queue com Merged 2611
sonic-utilities #2499 Add asic id for linecards so "show fabric counters queue/port" can work Merged No
sonic-swss #2522 Add show fabric counters port/queue commands. Merged 2499
sonic-utilities #2717 Add fabric_ns to the ns_list when display_option is DISPLAY_ALL Merged No
sonic-mgmt #6620 Add tests to check fabric link status and reachability Merged 2717
sonic-buildimage #14170 Add support data for fabric monitoring in CONFIG_DB Merged No
sonic-utilities #2730 Add "config fabric port ..." commands and tests Merged 14170
sonic-buildimage #14282 Add Yang model for the fabric configs Merged 14170
sonic-buildimage #14390 Add Yang model for FABRIC_MONITOR|FABRIC_MONITOR_DATA Merged 14170
sonic-swss #2862 Remove fabric queue counters. Merged No
sonic-swss-common #808 Add fabric monitoring tables definitions. Merged 14170
sonic-sairedis #1301 Add fabric ports related calls for vs tests.. Open No
sonic-sairedis #16791 Add fabric port data for vs test, and start fabricmgrd in vs Open 1301
sonic-swss #2920 Add support for fabric monitor daemon. Open 808 16791
sonic-swss #TBD Start fabric manager daemon in swss containers TBD 808
sonic-mgmt #TBD Add tests for fabric manager cli and database TBD 808
sonic-utilities #2932 Add a flag of the fabric monitor feature in CONFIG_DB open 2730
sonic-swss #TBD Add fabric manager daemon handling for fabric monitor feature enable/disable flag TBD 2730
More PRs coming TBD TBD TBD

@jfeng-arista jfeng-arista marked this pull request as ready for review June 7, 2022 18:47
@kenneth-arista
Copy link

@arlakshm for awareness

@@ -171,6 +172,85 @@ PORT RxCells TxCells Crc Fec Corrected

In a later phase, a `show fabric status` command will be added to show the remote switch ID and link ID for each fabric link of an ASIC. This will be obtained from the SAI_PORT_ATTR_FABRIC_REACHABILITY port attribute of the fabric port. Note that for fabric links that do not have a link partner because of the configuration of the chassis, this will show the status as `down`. The status will also be `down` for fabric links that are down due to some other physical error. To identify links that are down due to error vs links that are not expected to be up because of the chassis connectivity, we need to build up a list of expected fabric connectivity for each ASIC. This can be computed ahead of time based on the vendor configuration and populated in the minigraph. This will be implemented in a later phase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add if this is applicable to all line cards and sup, or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commands are for both linecards and fabric cards on sup. I updated the document with this information as well.

The following proposed CLI is used to show the traffic among fabric links on both fabric ASICs and forwarding ASICs.

```
> show fabric counters rate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add units. And add more rows to show the output is per link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated document with this . thank you

#### 2.8.1.1 Cli commands

```
> config fabric port set [port_id] isolate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change the command config fabric port isolate <port_id>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated document . thank you

```

```
> config fabric port remove [port_id] isolate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change the command config fabric port unisolate <port_id>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated document . thank you

```
> config fabric port remove [port_id] isolate
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate what isolate operation does on the fabric port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated document with what isolation means . thank you


#### 2.8.1.2 Monitoring algorithm

Instead of reacting to the counter changes, Orchagent adds a new poller and periodically polls status of all fabric links. By default, the total number of received cells, cells with crc errors, cells with uncorrectable errors are fetched from all serdes links periodically and the error rates are caculated using these numbers. If any one of the error rate is above the threshold for a number of consecutive polls, the link is identified as a unhealthy link. Then the link is automatically isolated to not distribute traffic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above section CLI command is defined to isolate a fabric link, in this section it is mentioned The link is automatically isolated to not distribute traffic. If the links isolated automatically, can elaborate the use case of the isolate CLIs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. The fabric inks are isolated automatically by the algorithm if they are unhealthy.

The two commands are kind of additional debug tool. The commands can be used to manually isolate and unisolate a fabric link and can help us debug on the system as well as force isolate a fabric link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenneth-arista , @jfeng-arista

According to this document, currently if fabric links are bad the links are isolated automatically by SDK. Can you confirm this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the monitoring algorithm detected the link, and considered it as unhealthy, the algorithm calls sdk api to isolate the link, users do not need issue any CLI.

#### 2.8.2.1 Cli command

```
> config fabric monitor capacity threshold <50-100>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add if the command is applicable to linecard or supervisor or both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated document . thank you


#### 2.8.2.2 Monitoring algorithm

Orchagent will track the total number of fabric links that are isolated. Once the number of total operational fabric links is below a configured threshold, alert users with a system log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a log message be generated when the number of operational links on fabric asic goes below threshold? This message will be useful to know that the fabric asic is not working properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are focusing on Linecard side for now, and sys log when the fabric capacity is not enough to sustain linerate for forwarding ASIC.

We can extend some check in future for Fabric cards.

#### 2.8.1.1 Fabric link monitoring criteria

The fabric link monitoring algorithm checks two type of errors on a link: crc errors and uncorrectable errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these already defined in SAI port stats? Can we list the exact sai_port_stat_t type being used for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
SAI_PORT_STAT_IF_IN_ERRORS is for crc erros
SAI_PORT_STAT_IF_IN_FEC_NOT_CORRECTABLE_FRAMES is for uncorrectable erros.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. Please add the above mapping in the doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from SONiC point of view, it should monitor SAI counters. Can we confirm (add in doc) that it is SAI_PORT_STAT_IF_IN_ERRORS that are being monitored from counter db or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the exact relationship between "#crcCells" and "SAI_PORT_STAT_IF_IN_ERRORS"? Is #crcCells a value that can be read from asic registers/SDK counters? E.g, we'll expect SAI_PORT_STAT_IF_IN_ERRORS increment by one if #crcCells increments by one?

@kenneth-arista
Copy link

@rlhui and @arlakshm, this PR has already been reviewed by the community. If there are no further comments, can we merge this?

@rlhui rlhui self-assigned this Feb 11, 2023
```

Besides the fabric link monitoring algorithm, the above two commands are added. The commands can be used to manually isolate and unisolate a fabric link ( i.e. take the link out of service and put the link back into service ). The two commands can help us debug on the system as well as force isolate a fabric link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this command be done on supervisor and line cards? which SAI attribute will be used for this?

In this SAI PR opencomputeproject/SAI#1764, it is applicable for fabric switch only.

However, should we also need to isolate a link from line card , can this SAI attribute be used then?

Copy link
Contributor Author

@jfeng-arista jfeng-arista Mar 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commands can be used on both supervisor and Linecards for isolate a fabric port. SAI_PORT_ATTR_FABRIC_ISOLATE will be used for this.

opencomputeproject/SAI#1764 is different and that is NOT for link level isolation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which SAI attribute is needed to do this fabric port isolation then? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAI_PORT_ATTR_FABRIC_ISOLATE

@jfeng-arista jfeng-arista force-pushed the update_fabric_monitoring_plan branch from 70894b5 to 57b1dd0 Compare April 4, 2023 20:26
@kenneth-arista
Copy link

@rlhui can we get an approving review since all comments have been addressed?

@gechiang
Copy link
Contributor

gechiang commented Jul 8, 2023

@jfeng-arista There are many PRs to be picked up for this feature. Since each PR approval run at is own timeline, can you share if there are any dependency that we need to be aware of? Example of dependency concerns are something such as certain PR needs to go in first or else it may cause:

  1. build time issue
  2. run time dependency issue regardless of feature is being used

If there are dependency issue, please specify them clearly in this PR as well as in the corresponding PR for the dependency...
Thanks!


#### 2.8.1.2 Monitoring algorithm

Instead of reacting to the counter changes, Orchagent adds a new poller and periodically polls status of all fabric links. By default, the total number of received cells, cells with crc errors, cells with uncorrectable errors are fetched from all serdes links periodically and the error rates are calculated using these numbers. If any one of the error rates is above the threshold for a number of consecutive polls, the link is identified as an unhealthy link. Then the link is automatically isolated to not distribute traffic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm (add in doc) whether the automatic isolation of unhealthy link is in 202205 release

Copy link
Contributor Author

@jfeng-arista jfeng-arista Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Software based fabric link/capacity monitoring is not committed for 202205. The support will merge to master. The SAI support for the isolate operation is not present in 202205.

@abdosi abdosi merged commit 8b7cf5c into sonic-net:master Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants