Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[portmgr] Fixed the orchagent crash due to late arrival of notif #2431

Merged
merged 6 commits into from
Sep 12, 2022

Conversation

vivekrnv
Copy link
Contributor

@vivekrnv vivekrnv commented Aug 26, 2022

What I did

  1. Bulk write to APP_DB i.e. alias, lanes, speed must be read through one notification by orchagent during create_port
  2. Handled a race condition in portmgrd which tries to immediately apply a mtu/admin_status SET notif after a DEL causing it to crash

Why I did it

After the recent incremental config update changes, portmgrd will be writing to the APPL_DB PORT_TABLE (Except during startup where portyncd writes to the APP_DB).

This can lead to a problem during breakout/breakin and might cause orchagent to crash because portmgrd currently doesn't write to the APPL_DB at once but in steps.

In some cases (high probability, not always), orchagent decides to act on the notifcations immediately and since speed is not set, it tries to create_port with speed 0 and the orchagent crashes

2022-08-25.10:45:56.708466|PORT_TABLE:Ethernet232|DEL
2022-08-25.10:45:57.669534|PORT_TABLE:Ethernet232|SET|alias:etp30a|index:30
2022-08-25.10:45:57.670749|PORT_TABLE:Ethernet232|SET|lanes:232,233
Aug 25 13:45:57.838085 r-leopard-41 ERR syncd#SDK: [SAI_PORT.ERR] mlnx_sai_port.c[2809]- mlnx_port_speeds_merge: Port [Speed[0x00000000] - Interface_Type[0x0000ffff]] configuration is invalid.
Aug 25 13:45:57.838085 r-leopard-41 ERR syncd#SDK: [SAI_PORT.ERR] mlnx_sai_port.c[3086]- mlnx_port_update_speed: Failed to merge speeds, interface types and auto_neg.
Aug 25 13:45:57.838085 r-leopard-41 ERR syncd#SDK: [SAI_PORT.ERR] mlnx_sai_port.c[9666]- mlnx_create_port: Failed to update speed.
Aug 25 13:45:57.838674 r-leopard-41 ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED
Aug 25 13:45:57.838674 r-leopard-41 ERR swss#orchagent: :- addPort: Failed to create port with the speed 0, rv:-2
Aug 25 13:45:57.838693 r-leopard-41 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_PORT, status: SAI_STATUS_NOT_SUPPORTED
2022-08-25.10:45:57.671548|c|SAI_OBJECT_TYPE_PORT:oid:0x1000000000adb|SAI_PORT_ATTR_SPEED=0|SAI_PORT_ATTR_HW_LANE_LIST=2:232,233
2022-08-25.10:45:57.838523|E|SAI_STATUS_NOT_SUPPORTED

Also handled a race condition in portmgrd happens because it was trying to set mtu/admin_status immediately after the a DEL was sent. portmgrd raced to set the fields on a netdev which was by then deleted and thus portmgrd crashed

How I verified it

  1. Existing Unit Tests
  2. Tested on the switch:

Before the fix:

root@r-panther-23:/home/admin# config interface breakout Ethernet72 2x50G[25G,10G] -y -f

2022-08-26.03:57:26.451022|INTF_TABLE:Ethernet72|DEL
2022-08-26.03:57:27.208461|PORT_TABLE:Ethernet72|SET|alias:etp19a|index:19
2022-08-26.03:57:27.211031|PORT_TABLE:Ethernet72|SET|lanes:72,73|speed:50000
2022-08-26.03:57:27.852567|PORT_TABLE:Ethernet72|SET|mtu:9100|admin_status:down
2022-08-26.03:57:28.245470|PORT_TABLE:Ethernet72|SET|mtu:9100
2022-08-26.03:57:28.250334|PORT_TABLE:Ethernet72|SET|admin_status:down

After the fix:

2022-08-26.04:06:22.060385|PORT_TABLE:Ethernet72|SET|alias:etp19a|index:19|lanes:72,73|speed:50000|mtu:9100
2022-08-26.04:06:22.865474|PORT_TABLE:Ethernet72|SET|admin_status:down

Details if related

@vivekrnv vivekrnv requested a review from prsunny as a code owner August 26, 2022 20:52
@vivekrnv
Copy link
Contributor Author

Required for 202205

dgsudharsan
dgsudharsan previously approved these changes Aug 26, 2022
@vivekrnv
Copy link
Contributor Author

/azpw run Azure.sonic-swss

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik
Copy link
Collaborator

@vivekrnv could you please explain why we need sleep of 1 sec in test between each iteration? or why only 1 sec in enought?
@prsunny please help or assign someone to review.

@vivekrnv
Copy link
Contributor Author

@vivekrnv could you please explain why we need sleep of 1 sec in test between each iteration? or why only 1 sec in enought? @prsunny please help or assign someone to review.

ProducerStateTable doesn't actually SET/DEL to the APPL_DB table we intend to check. It's the ConsumerStateTable (i.e. orchagent in this case) that gets the notification and it reads the data from Temp hash and SET/DEL's the corresponding table in the APPL_DB. Thus, i thought of putting in some delay b/w CFG_DB checks and APPL_DB checks. But apparently that didn't help. The vs test still failed

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@vivekrnv vivekrnv changed the title [portmgr] Fixed the Orchagent crash due to late arrival of notif [portmgr] Fixed the orchagent crash due to late arrival of notif Sep 6, 2022
dgsudharsan
dgsudharsan previously approved these changes Sep 6, 2022
@vivekrnv
Copy link
Contributor Author

vivekrnv commented Sep 6, 2022

@prsunny, @Junchao-Mellanox please review

cfgmgr/portmgr.cpp Show resolved Hide resolved
cfgmgr/portmgr.cpp Outdated Show resolved Hide resolved
cfgmgr/portmgr.cpp Show resolved Hide resolved
tests/port_dpb.py Outdated Show resolved Hide resolved
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@liat-grozovik
Copy link
Collaborator

@prsunny as this is a degradation inteorudced with prev PRs i would appreciate if you can prioritize the review so we will have it on 202205

@liat-grozovik
Copy link
Collaborator

/azp run Azure.sonic-swss

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, @prgeor , can you please review?

@@ -2615,6 +2615,15 @@ bool PortsOrch::addPort(const set<int> &lane_set, uint32_t speed, int an, string
{
SWSS_LOG_ENTER();

if (!speed || lane_set.empty())
Copy link
Collaborator

@prsunny prsunny Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this impact VS swss initialization? Seems these two are mandatory on create

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why we are deferring until both of them are available. No, it doesn't. cfg used in VS has speed and lane_set defined, so shouldn't be a problem.

@dgsudharsan dgsudharsan merged commit 43cc486 into sonic-net:master Sep 12, 2022
vivekrnv added a commit to vivekrnv/sonic-swss that referenced this pull request Sep 14, 2022
…ic-net#2431)

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Bulk write to APP_DB i.e. alias, lanes, speed must be read through one notification by orchagent during create_port
Handled a race condition in portmgrd which tries to immediately apply a mtu/admin_status SET notif after a DEL causing it to crash
dgsudharsan pushed a commit that referenced this pull request Sep 14, 2022
…) (#2451)

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Bulk write to APP_DB i.e. alias, lanes, speed must be read through one notification by orchagent during create_port
Handled a race condition in portmgrd which tries to immediately apply a mtu/admin_status SET notif after a DEL causing it to crash
neethajohn pushed a commit that referenced this pull request Mar 20, 2023
…ication (#2704)

What I did
Handled a race condition in intfmgrd which tries to immediately apply an admin_status SET notif after a DEL causing it to crash

Why I did it
Ignores errors on the set admin_status command for subinterface when the subinterface state is not OK.

How I verified it
Unit tests

Details if related
This PR reference to older PR that fix the same issue in portmgr: #2431
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants