Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sonic-swss]: Added flap counter and last flap time functionality. #928

Open
wants to merge 49 commits into
base: master
Choose a base branch
from

Conversation

prprakas
Copy link

Number of times a link got reset(up/down) has been added under this change, also the last flap timestamp for the link has also been added, they will be updated once the oper status of any port changes. 2 additional fields have been added to appl_db i.e. "flap_counter" and "last_flap_time" which willbe updated when "oper_status" is updated.

Signed-off-by: Prem Prakash prprakash@linkedin.com

What I did
Added flap counter and last flap timestamp feature

Why I did it
This would help in solving operational issues faster. Instead of relying on the logs to figure out a bad flapper we can see this feature under status command.

How I verified it
Tested it by building up the docker-orchagent.gz after my change, changes were required on sonic-utilities repo as well to see the output, I will raise a parallel PR for that.

flap_output.txt

@msftclas
Copy link

msftclas commented Jun 12, 2019

CLA assistant check
All CLA requirements met.

@zhenggen-xu
Copy link
Collaborator

retest this please

5 similar comments
@prprakas
Copy link
Author

retest this please

@prprakas
Copy link
Author

retest this please

@prprakas
Copy link
Author

retest this please

@prprakas
Copy link
Author

retest this please

@prprakas
Copy link
Author

retest this please

@prprakas
Copy link
Author

Please review the changes.

Copy link
Contributor

@stcheng stcheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this enhancement. This piece of information is definitely an important one that is missing in the current SONiC system. I have the following suggestions that open for discussion:

  1. since the flap counter is initialized as 0, and there is no other components editing this variable, is it okay to store the number in orchagent without querying the value from the database to get the old value?
  2. could you add the flap_counter and last_flap_ts as variables for the struct Port?

While I do have some concerns on whether to put these pieces of information into which database, since the current operational status is already stored in the application database, I am fine with appending these two entries there for now. But we could also discuss on whether to migrate the information into state database.

Copy link
Contributor

@stcheng stcheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check my comments

orchagent/portsorch.cpp Show resolved Hide resolved
{
SWSS_LOG_ENTER();

time_t now = time(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the getTimestamp() function instead

{
return;
}
updateDbPortLastFlapTime(new_tuples);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these two functions into updatePortOperStatus function instead of here.
under the two lines:

updateDbPortOperStatus(port, status);
port.m_oper_status = status;

@@ -2500,6 +2559,8 @@ bool PortsOrch::initializePort(Port &port)
{
port.m_oper_status = SAI_PORT_OPER_STATUS_DOWN;
}
m_portTable->hset(port.m_alias, "flap_counter", "0");
m_portTable->hset(port.m_alias, "last_flap", "N/A");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should go to state db?

@@ -1239,14 +1241,71 @@ bool PortsOrch::setHostIntfsOperStatus(const Port& port, bool isUp) const
return true;
}

void PortsOrch::updateDbPortFlapCounter(const Port& port, vector<FieldValueTuple>& old_tuples, vector<FieldValueTuple>& new_tuples) const
{
SWSS_LOG_ENTER();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tab = 4

Copy link
Contributor

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use redis incr and format your changes.

@stcheng
Copy link
Contributor

stcheng commented Jul 26, 2019

redis incr is nice but it requires much more changes. i suggest a design review first

@lguohan
Copy link
Contributor

lguohan commented Jul 26, 2019

how to move forward on this one? schedule a call to discuss this feature? @prprakas to schedule?

@prprakas
Copy link
Author

@lguohan @stcheng Yeah we can schedule a call to discuss on this. I am based out of Bangalore, India so let me know what is the good time to talk to you

@prprakas
Copy link
Author

prprakas commented Aug 8, 2019

@lguohan will add a vs-test for this, I think that should be good to move ahead with this as per the communication from @zhenggen-xu

@lguohan
Copy link
Contributor

lguohan commented Sep 6, 2019

@prprakas , can you fix simply comments we have requested, also can you share the ETA for adding vs test?

  - Added natmgr and natorch changes.
  - Added nat Zone related changes.
  - Added Warm reboot changes.

  Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
 Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
  Signed-off-by: Akhilesh Samineni akhilesh.samineni@broadcom.com
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
AkhileshSamineni and others added 25 commits December 10, 2019 07:28
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
…n_sonic

Natmgrd changes in sonic-swss sub module to support NAT feature.
…sonic-net#1156)

* [aclorch] Validate that provided IN/OUT_PORTS are physical interfaces
- Add a check during the rule validation step to make sure provided IN/OUT_PORTS rules only include physical interfaces
- Add vs test cases for invalid IN/OUT_PORTS inputs

Signed-off-by: Danny Allen <daall@microsoft.com>

* Clarify test names

* Add extra delay for table teardown

* Check if new tests are causing tests to fail

* Check if new tests are causing tests to fail

* Check if ProducerStateTable is having problems

* Make delete ACL table test more strict

* Remove invalid interface test cases

* Undo change to delete test

* Undo change to delete test
…ges_in_sonic

Natsyncd changes in sonic-swss submodule to support NAT feature.
…es_in_sonic

Orchagent changes in sonic-swss submodule to support NAT feature.
…t and queue stats (sonic-net#1170)

- Updates portsorch to use FlexCounterManager for port and queue stats instead of direct redis operations
- Updates FlexCounterManager to allow clients to enable the group in the constructor, for convenience
- Updates the makefile for cfgmgr to include new flex_counter dependency from portsorch

Signed-off-by: Danny Allen <daall@microsoft.com>
Signed-off-by: Danny Allen <daall@microsoft.com>
…asons (sonic-net#1173)

- Refactor drop reason capability query to trim SAI prefixes
- Store device capabilities in orchagent to perform safety checks

Fixes sonic-net#1136 - Rather than depending on each ASIC vendor to follow the same error handling doctrine, this PR validates HW support in orchagent, which should be more reliable.

Related to sonic-net/sonic-utilities#785 - In order to validate user input, we need to remove the SAI prefixes before we store the results. This removes the need for the CLI to perform these checks.

Signed-off-by: Danny Allen <daall@microsoft.com>
…#1141)

Clear loopback interfaces in kernel at intfmgrd starting except warmstart.
Clear vrf interfaces in kernel at vrfmgrd starting except warmstart.
…warm reboot" issue (sonic-net#1171)

* during warm-reboot teamd need to use same system-id before warm-reboot

Signed-off-by: shine.chen <shine.chen@nephosinc.com>

* refine per review comment

Signed-off-by: shine.chen <shine.chen@mediatek.com>

* skip saved pdu on negotiation stage

Signed-off-by: shine.chen <shine.chen@mediatek.com>

* reformat teammgr file
Signed-off-by: Danny Allen <daall@microsoft.com>
…onic-net#1158)

Otherwise, we can see that DUT after warm reboot sends ARP
packets with management IP (10.112.71.4)

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
…onic-net#1166)

Originally portsorch was designed to remove LAG member from LAG when
member gets disabled by teamd. This could lead to potential issues
including flood to that port and loops, since after removal member
becomes a switch port.

Now, portsorch will make use of SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE and SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE
to control collection/distribution through that LAG member port.
With this new flow, teammgrd and teamsyncd are candidates to be refactored, especially teamsyncd
warm reboot logic, since now we don't need to compare old, new lags and lag members.
Teamsyncd's job is simply to signal "status" field update without waiting for reconciliation timer to expire.

e.g. one port channel went down on peer:

```
admin@arc-switch1025:~$ show interfaces portchannel
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports
-----  ---------------  -----------  --------------
 0001  PortChannel0001  LACP(A)(Up)  Ethernet112(S)
 0002  PortChannel0002  LACP(A)(Up)  Ethernet116(S)
 0003  PortChannel0003  LACP(A)(Up)  Ethernet120(S)
 0004  PortChannel0004  LACP(A)(Dw)  Ethernet124(D)
admin@arc-switch1025:~$ docker exec -it syncd sx_api_lag_dump.py
LAG Members Table
===============================================================================================================
| SWID| LAG Logical Port| LAG Oper State| PVID| Member Port ID| Port Label| Collector| Distributor| Oper State|
===============================================================================================================
|    0        0x10000000              UP     1|        0x11900|         29|    Enable|      Enable|         UP|
===============================================================================================================
|    0        0x10000100              UP     1|        0x11b00|         30|    Enable|      Enable|         UP|
===============================================================================================================
|    0        0x10000200              UP     1|        0x11d00|         31|    Enable|      Enable|         UP|
===============================================================================================================
|    0        0x10000300            DOWN     1|        0x11f00|         32|   Disable|     Disable|         UP|
===============================================================================================================
```

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
…ite (sonic-net#1185)

Signed-off-by: Danny Allen <daall@microsoft.com>
InitSystemAcl table is removed.
ECN ACL rules: DSCP 0/8 mapping to yellow are removed.

Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
@lguohan
Copy link
Contributor

lguohan commented Jan 29, 2020

retest this please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.