[Reclaiming buffer] Common code update #1996

stephenxs · 2021-11-01T15:15:41Z

What I did
Common code update for reclaiming buffer.

Loading zero_profiles when dynamic buffer manager starting
The buffer manager won't consume it for now. This is to pass Azure CI.
Support removing a buffer pool.
Support exposing maximum PGs and queues per port
Support transmit between bitmap and map string
Change the log severity from ERROR to NOTICE when parsing buffer profile from buffer profile list failed. Typically this can be resolved by retrying.
The severity of similar log when parsing buffer PG and queue is already NOTICE.

Signed-off-by: Stephen Sun stephens@nvidia.com

Why I did it

To split large PR into smaller ones and help pass CI.

How I verified it

vs test and sonic-mgmt test.

Details if related

1. Loading zero_profiles when dynamic buffer manager starting The buffer manager won't consume it for now. This is to pass Azure CI. 2. Support removing a buffer pool. 3. Support exposing maximum PGs and queues per port 4. Support transmit between bitmap and map string Signed-off-by: Stephen Sun <stephens@nvidia.com>

- Remove corresponding flex counter when a buffer pool is removed - Check maximum priority groups and queues of the port Signed-off-by: Stephen Sun <stephens@nvidia.com>

Don't call SAI API to remove an item if it doesn't exist in the local cache This means it hasn't been created in SAI yet. In most cases, this is just to notify orchagent "port ready" Signed-off-by: Stephen Sun <stephens@nvidia.com>

Signed-off-by: Stephen Sun <stephens@nvidia.com>

orchagent/bufferorch.cpp

orchagent/orch.cpp

orchagent/portsorch.cpp

orchagent/orch.cpp

tests/test_buffer_dynamic.py

Signed-off-by: Stephen Sun <stephens@nvidia.com>

orchagent/orch.cpp

clearBufferPoolWatermaskCounterCounterIdList => clearBufferPoolWatermarkCounterIdList Remove "reclaim" from a generic function in orch.cpp Signed-off-by: Stephen Sun <stephens@nvidia.com>

Signed-off-by: Stephen Sun <stephens@nvidia.com>

stephenxs · 2021-11-17T10:57:52Z

/azpw run

mssonicbld · 2021-11-17T10:57:54Z

/AzurePipelines run

azure-pipelines · 2021-11-17T10:58:05Z

Azure Pipelines successfully started running 1 pipeline(s).

orchagent/bufferorch.cpp

…n removing a pool Signed-off-by: Stephen Sun <stephens@nvidia.com>

stephenxs · 2021-11-18T01:21:26Z

/azpw run

mssonicbld · 2021-11-18T01:21:28Z

/AzurePipelines run

azure-pipelines · 2021-11-18T01:21:38Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Stephen Sun stephens@nvidia.com What I did Reclaim reserved buffer of unused ports for both dynamic and traditional models. This is done by Removing lossless priority groups on unused ports. Applying zero buffer profiles on the buffer objects of unused ports. In the dynamic buffer model, the zero profiles are loaded from a JSON file and applied to APPL_DB if there are admin down ports. The default buffer configuration will be configured on all ports. Buffer manager will apply zero profiles on admin down ports. In the static buffer model, the zero profiles are loaded by the buffer template. Why I did it How I verified it Regression test and vs test. Details if related Static buffer model Remove the lossless buffer priority group if the port is admin-down and the buffer profile aligns with the speed and cable length of the port. Dynamic buffer model Handle zero buffer pools and profiles buffermgrd: add a CLI option to load the JSON file for zero profiles. (done in PR [Reclaiming buffer] Common code update #1996) Load them from JSON file into the internal buffer manager's data structure (done in PR [Reclaiming buffer] Common code update #1996) Apply them to APPL_DB once there is at least one admin-down port Record zero profiles' names in the pool object it references. By doing so, the zero profile lists can be constructed according to the normal profile list. There should be one profile for each pool on the ingress/egress side. And then apply the zero profiles to the buffer objects of the port. Unload them from APPL_DB once all ports are admin-up since the zero pools and profiles are no longer referenced. Remove buffer pool counter id when the zero pool is removed. Now that it's possible that a pool will be removed from the system, the watermark counter of the pool is removed ahead of the pool itself being removed. Handle port admin status change Currently, there is a logic of removing buffer priority groups of admin down ports. This logic will be reused and extended for all buffer objects, including BUFFER_QUEUE, BUFFER_PORT_INGRESS_PROFILE_LIST, and BUFFER_PORT_EGRESS_PROFILE_LIST. When the port is admin down, The normal profiles are removed from the buffer objects of the port The zero profiles, if provided, are applied to the port When the port is admin up, The zero profiles, if applied, are removed from the port The normal profiles are applied to the port. Ports orchagent exposes the number of queues and priority groups to STATE_DB. Buffer manager can take advantage of these values to apply zero profiles on all the priority groups and queues of the admin-down ports. In case it is not necessary to apply zero profiles on all priority groups or queues on a certain platform, ids_to_reclaim can be customized in the JSON file. Handle all buffer tables, including BUFFER_PG, BUFFER_QUEUE, BUFFER_PORT_INGRESS_PROFILE_LIST and BUFFER_PORT_EGRESS_PROFILE_LIST Originally, only the BUFFER_PG table was cached in the dynamic buffer manager. Now, all tables are cached in order to apply zero profiles when a port is admin down and apply normal profiles when it's up. The index of such tables can include a single port or a list of ports, like BUFFER_PG|Ethernet0|3-4 or BUFFER_PG|Ethernet0,Ethernet4,Ethernet8|3-4. Originally, there is a logic to handle such indexes for the BUFFER_PG table. Now it is reused and extended to handle all the tables. [Mellanox] Plugin to calculate buffer pool size: Originally, buffer for the queue, buffer profile list, etc. were not reclaimed for admin-down ports so they are reserved for all ports. Now, they are reserved for admin-up ports only. Accelerate the progress of applying buffer tables to APPL_DB This is an optimization on top of reclaiming buffer. Don't apply buffer profiles, buffer objects to APPL_DB before buffer pools are applied when the system is starting. This is to apply the items in an order from referenced items to referencing items and try to avoid buffer orchagent retrying due to referenced table items. However, it is still possible that the referencing items are handled before referenced items. In that case, there should not be any error message. [Mellanox] Plugin to calculate buffer pool size: Return the buffer pool sizes value currently in APPL_DB if the pool sizes are not able to be calculated due to lacking some information. This typically happens at the system start. This is to accelerate the progress of pushing tables to APPL_DB.

…c-net#1910) Signed-off-by: Stephen Sun stephens@nvidia.com What I did Reclaim reserved buffer of unused ports for both dynamic and traditional models. This is done by Removing lossless priority groups on unused ports. Applying zero buffer profiles on the buffer objects of unused ports. In the dynamic buffer model, the zero profiles are loaded from a JSON file and applied to APPL_DB if there are admin down ports. The default buffer configuration will be configured on all ports. Buffer manager will apply zero profiles on admin down ports. In the static buffer model, the zero profiles are loaded by the buffer template. Why I did it How I verified it Regression test and vs test. Details if related Static buffer model Remove the lossless buffer priority group if the port is admin-down and the buffer profile aligns with the speed and cable length of the port. Dynamic buffer model Handle zero buffer pools and profiles buffermgrd: add a CLI option to load the JSON file for zero profiles. (done in PR [Reclaiming buffer] Common code update sonic-net#1996) Load them from JSON file into the internal buffer manager's data structure (done in PR [Reclaiming buffer] Common code update sonic-net#1996) Apply them to APPL_DB once there is at least one admin-down port Record zero profiles' names in the pool object it references. By doing so, the zero profile lists can be constructed according to the normal profile list. There should be one profile for each pool on the ingress/egress side. And then apply the zero profiles to the buffer objects of the port. Unload them from APPL_DB once all ports are admin-up since the zero pools and profiles are no longer referenced. Remove buffer pool counter id when the zero pool is removed. Now that it's possible that a pool will be removed from the system, the watermark counter of the pool is removed ahead of the pool itself being removed. Handle port admin status change Currently, there is a logic of removing buffer priority groups of admin down ports. This logic will be reused and extended for all buffer objects, including BUFFER_QUEUE, BUFFER_PORT_INGRESS_PROFILE_LIST, and BUFFER_PORT_EGRESS_PROFILE_LIST. When the port is admin down, The normal profiles are removed from the buffer objects of the port The zero profiles, if provided, are applied to the port When the port is admin up, The zero profiles, if applied, are removed from the port The normal profiles are applied to the port. Ports orchagent exposes the number of queues and priority groups to STATE_DB. Buffer manager can take advantage of these values to apply zero profiles on all the priority groups and queues of the admin-down ports. In case it is not necessary to apply zero profiles on all priority groups or queues on a certain platform, ids_to_reclaim can be customized in the JSON file. Handle all buffer tables, including BUFFER_PG, BUFFER_QUEUE, BUFFER_PORT_INGRESS_PROFILE_LIST and BUFFER_PORT_EGRESS_PROFILE_LIST Originally, only the BUFFER_PG table was cached in the dynamic buffer manager. Now, all tables are cached in order to apply zero profiles when a port is admin down and apply normal profiles when it's up. The index of such tables can include a single port or a list of ports, like BUFFER_PG|Ethernet0|3-4 or BUFFER_PG|Ethernet0,Ethernet4,Ethernet8|3-4. Originally, there is a logic to handle such indexes for the BUFFER_PG table. Now it is reused and extended to handle all the tables. [Mellanox] Plugin to calculate buffer pool size: Originally, buffer for the queue, buffer profile list, etc. were not reclaimed for admin-down ports so they are reserved for all ports. Now, they are reserved for admin-up ports only. Accelerate the progress of applying buffer tables to APPL_DB This is an optimization on top of reclaiming buffer. Don't apply buffer profiles, buffer objects to APPL_DB before buffer pools are applied when the system is starting. This is to apply the items in an order from referenced items to referencing items and try to avoid buffer orchagent retrying due to referenced table items. However, it is still possible that the referencing items are handled before referenced items. In that case, there should not be any error message. [Mellanox] Plugin to calculate buffer pool size: Return the buffer pool sizes value currently in APPL_DB if the pool sizes are not able to be calculated due to lacking some information. This typically happens at the system start. This is to accelerate the progress of pushing tables to APPL_DB.

This is to backport #1996 to 202012. - What I did Common code update for reclaiming buffer. 1. Loading zero_profiles when dynamic buffer manager starting The buffer manager won't consume it for now. This is to pass Azure CI. 2. Support removing a buffer pool. 3. Support exposing maximum PGs and queues per port 4. Support transmit between bitmap and map string 5. Change the log severity from ERROR to NOTICE when parsing buffer profile from buffer profile list failed. Typically this can be resolved by retrying. The severity of similar log when parsing buffer PG and queue is already NOTICE. - Why I did it To split large PR into smaller ones and help pass CI. - How I verified it vs test and regression test. Signed-off-by: Stephen Sun stephens@nvidia.com

#### Why I did it Update sonic-swss-common 54879741 [202012][schema] Add vnet route tunnel and advertise network tables for state_db (sonic-net/sonic-swss-common#563) a5394f9d Update for BFD, default route table (sonic-net/sonic-swss-common#550) Update sonic-swss fbbe5bcc [202012][pfc_detect] fix RedisReply errors (sonic-net/sonic-swss#2078) 5762b0c2 [Reclaim buffer][202012] Reclaim unused buffer for dynamic buffer model (sonic-net/sonic-swss#1985) 33e9bd19 [Document][202012] Supply the missing ingress/egress port profile list in document (sonic-net/sonic-swss#2066) 1b6ffba1 [Reclaiming buffer][202012] Support reclaiming buffer in traditional buffer model (sonic-net/sonic-swss#2063) afb33f16 [202012] Update default route status to state DB (sonic-net/sonic-swss#2009) (sonic-net/sonic-swss#2067) b9c44f75 Common code update for reclaiming buffer (backport community PR sonic-net/sonic-swss#1996 to 202106/202012) (sonic-net/sonic-swss#2061) cf5182d8 [request parser] Allow request parser to parse multiple values

…c-net#1910) Signed-off-by: Stephen Sun stephens@nvidia.com What I did Reclaim reserved buffer of unused ports for both dynamic and traditional models. This is done by Removing lossless priority groups on unused ports. Applying zero buffer profiles on the buffer objects of unused ports. In the dynamic buffer model, the zero profiles are loaded from a JSON file and applied to APPL_DB if there are admin down ports. The default buffer configuration will be configured on all ports. Buffer manager will apply zero profiles on admin down ports. In the static buffer model, the zero profiles are loaded by the buffer template. Why I did it How I verified it Regression test and vs test. Details if related Static buffer model Remove the lossless buffer priority group if the port is admin-down and the buffer profile aligns with the speed and cable length of the port. Dynamic buffer model Handle zero buffer pools and profiles buffermgrd: add a CLI option to load the JSON file for zero profiles. (done in PR [Reclaiming buffer] Common code update sonic-net#1996) Load them from JSON file into the internal buffer manager's data structure (done in PR [Reclaiming buffer] Common code update sonic-net#1996) Apply them to APPL_DB once there is at least one admin-down port Record zero profiles' names in the pool object it references. By doing so, the zero profile lists can be constructed according to the normal profile list. There should be one profile for each pool on the ingress/egress side. And then apply the zero profiles to the buffer objects of the port. Unload them from APPL_DB once all ports are admin-up since the zero pools and profiles are no longer referenced. Remove buffer pool counter id when the zero pool is removed. Now that it's possible that a pool will be removed from the system, the watermark counter of the pool is removed ahead of the pool itself being removed. Handle port admin status change Currently, there is a logic of removing buffer priority groups of admin down ports. This logic will be reused and extended for all buffer objects, including BUFFER_QUEUE, BUFFER_PORT_INGRESS_PROFILE_LIST, and BUFFER_PORT_EGRESS_PROFILE_LIST. When the port is admin down, The normal profiles are removed from the buffer objects of the port The zero profiles, if provided, are applied to the port When the port is admin up, The zero profiles, if applied, are removed from the port The normal profiles are applied to the port. Ports orchagent exposes the number of queues and priority groups to STATE_DB. Buffer manager can take advantage of these values to apply zero profiles on all the priority groups and queues of the admin-down ports. In case it is not necessary to apply zero profiles on all priority groups or queues on a certain platform, ids_to_reclaim can be customized in the JSON file. Handle all buffer tables, including BUFFER_PG, BUFFER_QUEUE, BUFFER_PORT_INGRESS_PROFILE_LIST and BUFFER_PORT_EGRESS_PROFILE_LIST Originally, only the BUFFER_PG table was cached in the dynamic buffer manager. Now, all tables are cached in order to apply zero profiles when a port is admin down and apply normal profiles when it's up. The index of such tables can include a single port or a list of ports, like BUFFER_PG|Ethernet0|3-4 or BUFFER_PG|Ethernet0,Ethernet4,Ethernet8|3-4. Originally, there is a logic to handle such indexes for the BUFFER_PG table. Now it is reused and extended to handle all the tables. [Mellanox] Plugin to calculate buffer pool size: Originally, buffer for the queue, buffer profile list, etc. were not reclaimed for admin-down ports so they are reserved for all ports. Now, they are reserved for admin-up ports only. Accelerate the progress of applying buffer tables to APPL_DB This is an optimization on top of reclaiming buffer. Don't apply buffer profiles, buffer objects to APPL_DB before buffer pools are applied when the system is starting. This is to apply the items in an order from referenced items to referencing items and try to avoid buffer orchagent retrying due to referenced table items. However, it is still possible that the referencing items are handled before referenced items. In that case, there should not be any error message. [Mellanox] Plugin to calculate buffer pool size: Return the buffer pool sizes value currently in APPL_DB if the pool sizes are not able to be calculated due to lacking some information. This typically happens at the system start. This is to accelerate the progress of pushing tables to APPL_DB.

stephenxs requested a review from neethajohn November 1, 2021 15:16

stephenxs mentioned this pull request Nov 5, 2021

Reclaim reserved buffer for unused ports sonic-net/SONiC#831

Merged

stephenxs added 3 commits November 5, 2021 10:47

Add test cases

82807dc

- Remove corresponding flex counter when a buffer pool is removed - Check maximum priority groups and queues of the port Signed-off-by: Stephen Sun <stephens@nvidia.com>

Optimize the initialize flow

ffa802a

Don't call SAI API to remove an item if it doesn't exist in the local cache This means it hasn't been created in SAI yet. In most cases, this is just to notify orchagent "port ready" Signed-off-by: Stephen Sun <stephens@nvidia.com>

Fix a typo

8df51bb

Signed-off-by: Stephen Sun <stephens@nvidia.com>

stephenxs marked this pull request as ready for review November 10, 2021 07:28

stephenxs requested a review from prsunny as a code owner November 10, 2021 07:28

keboliu previously approved these changes Nov 15, 2021

View reviewed changes

neethajohn reviewed Nov 17, 2021

View reviewed changes

orchagent/bufferorch.cpp Outdated Show resolved Hide resolved

orchagent/orch.cpp Outdated Show resolved Hide resolved

orchagent/portsorch.cpp Outdated Show resolved Hide resolved

orchagent/orch.cpp Show resolved Hide resolved

tests/test_buffer_dynamic.py Show resolved Hide resolved

Fix review comments

84ed8cc

Signed-off-by: Stephen Sun <stephens@nvidia.com>

stephenxs dismissed keboliu’s stale review via 84ed8cc November 17, 2021 03:46

stephenxs commented Nov 17, 2021

View reviewed changes

orchagent/orch.cpp Outdated Show resolved Hide resolved

stephenxs added 2 commits November 17, 2021 05:53

Fix typo

c3a2527

clearBufferPoolWatermaskCounterCounterIdList => clearBufferPoolWatermarkCounterIdList Remove "reclaim" from a generic function in orch.cpp Signed-off-by: Stephen Sun <stephens@nvidia.com>

Make the comment more clear

7b2546c

Signed-off-by: Stephen Sun <stephens@nvidia.com>

neethajohn previously approved these changes Nov 17, 2021

View reviewed changes

neethajohn reviewed Nov 17, 2021

View reviewed changes

orchagent/bufferorch.cpp Show resolved Hide resolved

Check whether flex counter has been configured before clear counter o…

814f659

…n removing a pool Signed-off-by: Stephen Sun <stephens@nvidia.com>

stephenxs dismissed neethajohn’s stale review via 814f659 November 18, 2021 00:25

stephenxs requested a review from neethajohn November 18, 2021 00:50

neethajohn approved these changes Nov 18, 2021

View reviewed changes

stephenxs mentioned this pull request Nov 18, 2021

[Reclaim buffer] Common infrastructure update for reclaiming buffer sonic-net/sonic-buildimage#9133

Merged

5 tasks

Merge branch 'Azure:master' into reclaim-buffer-base

cdc7fd4

stephenxs requested a review from keboliu November 22, 2021 01:18

keboliu approved these changes Nov 22, 2021

View reviewed changes

stephenxs changed the title ~~[Reclaim buffer] Common code update~~ [Reclaiming buffer] Common code update Nov 22, 2021

liat-grozovik merged commit 32d7a69 into sonic-net:master Nov 22, 2021

stephenxs deleted the reclaim-buffer-base branch November 22, 2021 09:50

stephenxs mentioned this pull request Dec 14, 2021

[202012][swss-common][swss] Submodule update sonic-net/sonic-buildimage#9531

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reclaiming buffer] Common code update #1996

[Reclaiming buffer] Common code update #1996

stephenxs commented Nov 1, 2021 •

edited by liat-grozovik

Loading

stephenxs commented Nov 17, 2021

mssonicbld commented Nov 17, 2021

azure-pipelines bot commented Nov 17, 2021

stephenxs commented Nov 18, 2021

mssonicbld commented Nov 18, 2021

azure-pipelines bot commented Nov 18, 2021

[Reclaiming buffer] Common code update #1996

[Reclaiming buffer] Common code update #1996

Conversation

stephenxs commented Nov 1, 2021 • edited by liat-grozovik Loading

stephenxs commented Nov 17, 2021

mssonicbld commented Nov 17, 2021

azure-pipelines bot commented Nov 17, 2021

stephenxs commented Nov 18, 2021

mssonicbld commented Nov 18, 2021

azure-pipelines bot commented Nov 18, 2021

stephenxs commented Nov 1, 2021 •

edited by liat-grozovik

Loading