Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant calls to get child scheduler group during initialization #1965

Merged
merged 3 commits into from
Nov 17, 2021

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented Oct 17, 2021

What I did

The QoS orchagent calls SAI APIs to get the number of child scheduler groups and then initialize them.
After that, the size of the child scheduler groups vector will be non-zero, which indicates the child scheduler groups have been initialized and prevents QoS orchagent from calling SAI get APIs again.
However, on some platforms, it may be that some of the scheduler groups don't have child groups, leaving size of child scheduler groups always being zero. It causes QoS orchagent to call the SAI get API each time the scheduler group is handled, which wastes a lot of time, especially during fast reboot.

An extra flag indicating whether the child groups have been initialized is introduced to avoid redundant calls.

Signed-off-by: Stephen Sun stephens@nvidia.com

Why I did it

To optimize QoS orchagent performance during initialization especially for fast reboot.

How I verified it

It can be covered by the existing regression test.

Details if related

I did a pair of tests, comparing the time to handle scheduler and child scheduler groups between the old and the new (optimized) version. Dump and sairedis.record attached.

Start Finish Number of Calls Duration FR longest down time
Old version 14:01:45.548761 14:01:48.337098 3950 2.788337 28.804206
New version 14:26:14.613581 14:26:16.016768 1598 1.403187 27.643941
Difference -2352 -1.38515 -1.160265

sairedis.rec.old.log
sairedis.rec.new.log
sonic_dump_mtbc-sonic-03-2700_20211104_145113.tar.gz

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…tion

The QoS orchagent call SAI APIs to get the number of child scheduler groups
and then initialize them.
After that, the size of child scheduler groups vector will be non-zero,
which indicates the child scheduler groups have been initialized and
prevent QoS orchagent from calling SAI APIs.
However, on Mellanox platform, some scheduler groups don't have child group,
leaving size of child scheduler groups always being zero.
This causes QoS orchagent to call the SAI API each time the scheduler group
is handled, which wastes a lot of time especially during fast reboot.

An extra flag indicating whether the child groups have been initialized
is introduced to avoid the redundant calls.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs
Copy link
Collaborator Author

Arm build failed due to the following error, which looks like not related to change in the PR

2021-10-29T20:50:13.8907727Z g++ -DHAVE_CONFIG_H -I. -I.. -I ../lib -I .. -I ../warmrestart -I flex_counter -I debug_counter -I pbh -g -DNDEBUG  -std=c++14 -Wall -fPIC -Wno-write-strings -I/usr/include/swss -I/usr/include/libnl3 -Werror -Wno-reorder -Wcast-align -Wcast-qual -Wconversion -Wdisabled-optimization -Wextra -Wfloat-equal -Wformat=2 -Wformat-nonliteral -Wformat-security -Wformat-y2k -Wimport -Winit-self -Winvalid-pch -Wlong-long -Wmissing-field-initializers -Wmissing-format-attribute -Wno-aggregate-return -Wno-padded -Wno-switch-enum -Wno-unused-parameter -Wpacked -Wpointer-arith -Wredundant-decls -Wstack-protector -Wstrict-aliasing=3 -Wswitch -Wswitch-default -Wunreachable-code -Wunused -Wvariadic-macros -Wno-switch-default -Wno-long-long -Wno-redundant-decls -I /usr/include/sai -Wdate-time -D_FORTIFY_SOURCE=2  -g -O2 -fdebug-prefix-map=/__w/3/s=. -fstack-protector-strong -Wformat -Werror=format-security -c -o orchagent-orch.o `test -f 'orch.cpp' || echo './'`orch.cpp
2021-10-29T20:51:40.9023022Z In file included from /usr/include/c++/8/vector:69,
2021-10-29T20:51:40.9023846Z                  from /usr/include/swss/dbconnector.h:5,
2021-10-29T20:51:40.9024498Z                  from orchdaemon.h:4,
2021-10-29T20:51:40.9025107Z                  from main.cpp:25:
2021-10-29T20:51:40.9027854Z /usr/include/c++/8/bits/vector.tcc: In member function 'void std::vector<_Tp, _Alloc>::_M_realloc_insert(std::vector<_Tp, _Alloc>::iterator, _Args&& ...) [with _Args = {const _sai_attribute_t&}; _Tp = _sai_attribute_t; _Alloc = std::allocator<_sai_attribute_t>]':
2021-10-29T20:51:40.9030599Z /usr/include/c++/8/bits/vector.tcc:413:7: note: parameter passing for argument of type 'std::vector<_sai_attribute_t>::iterator' {aka '__gnu_cxx::__normal_iterator<_sai_attribute_t*, std::vector<_sai_attribute_t> >'} changed in GCC 7.1
2021-10-29T20:51:40.9031950Z        vector<_Tp, _Alloc>::
2021-10-29T20:51:40.9032497Z        ^~~~~~~~~~~~~~~~~~~

@liat-grozovik
Copy link
Collaborator

@prsunny can you please help with the swss failures?
@kcudnik this PR is due to a debug done on the fastboot and the improvements required to reduce qosorch performance is any boot but especially in fastboot.

@stephenxs please add how much time it takes for qosorch to finish the same work with this fix and without it.

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@keboliu
Copy link
Collaborator

keboliu commented Nov 15, 2021

/azpw run

1 similar comment
@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

keboliu
keboliu previously approved these changes Nov 15, 2021
@liat-grozovik liat-grozovik requested review from kcudnik and removed request for neethajohn November 15, 2021 11:13
kcudnik
kcudnik previously approved these changes Nov 15, 2021
if (child_group_id == queue_id)
{
SWSS_LOG_INFO("Found group id %lx for port %s queue %lx", group_id, port.m_alias.c_str(), queue_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many logs added to code. Could you please revisit the logs? Also the compilation seems to be failing for armhf build. Please use the object ids as PRIx64. Refer existing logs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all the logs which were added but not related to a code change in this PR. For the rest, I tend to keep them.
Thank you for the comments on ARM building failure. Also fixed it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prsunny can you please review it? comments fixed. thanks.

- Fix compiling errors in ARM
- Remove some redundant log message

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs dismissed stale reviews from kcudnik and keboliu via 8e7c34f November 16, 2021 04:18
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@liat-grozovik liat-grozovik merged commit 103fdf0 into sonic-net:master Nov 17, 2021
@stephenxs stephenxs deleted the fix-redundant-scheduler-get branch November 17, 2021 06:23
arlakshm pushed a commit that referenced this pull request Nov 18, 2021
…tion (#1965)

- What I did
The QoS orchagent calls SAI APIs to get the number of child scheduler groups and then initialize them.
After that, the size of the child scheduler groups vector will be non-zero, which indicates the child scheduler groups have been initialized and prevents QoS orchagent from calling SAI get APIs again.
However, on some platforms, it may be that some of the scheduler groups don't have child groups, leaving size of child scheduler groups always being zero. It causes QoS orchagent to call the SAI get API each time the scheduler group is handled, which wastes a lot of time, especially during fast reboot.

An extra flag indicating whether the child groups have been initialized is introduced to avoid redundant calls.

- Why I did it
To optimize QoS orchagent performance during initialization especially for fast reboot.

- How I verified it
It can be covered by the existing regression test.

- Details if related
I did a pair of tests, comparing the time to handle scheduler and child scheduler groups between the old and the new (optimized) version. Dump and sairedis.record attached.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
qiluo-msft pushed a commit that referenced this pull request Nov 20, 2021
…tion (#1965)

- What I did
The QoS orchagent calls SAI APIs to get the number of child scheduler groups and then initialize them.
After that, the size of the child scheduler groups vector will be non-zero, which indicates the child scheduler groups have been initialized and prevents QoS orchagent from calling SAI get APIs again.
However, on some platforms, it may be that some of the scheduler groups don't have child groups, leaving size of child scheduler groups always being zero. It causes QoS orchagent to call the SAI get API each time the scheduler group is handled, which wastes a lot of time, especially during fast reboot.

An extra flag indicating whether the child groups have been initialized is introduced to avoid redundant calls.

- Why I did it
To optimize QoS orchagent performance during initialization especially for fast reboot.

- How I verified it
It can be covered by the existing regression test.

- Details if related
I did a pair of tests, comparing the time to handle scheduler and child scheduler groups between the old and the new (optimized) version. Dump and sairedis.record attached.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
What I did
Add CLI for configuring buffer profiles for queues

How to verify it
Run unit test to verify the logic

New command output (if the output of a command-line utility has changed)
config interface buffer queue add
config interface buffer queue set
config interface buffer queue remove

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants