Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sonic-cfggen is consuming a lot of CPU during switch startup #4553

Closed
stepanblyschak opened this issue May 7, 2020 · 8 comments
Closed

sonic-cfggen is consuming a lot of CPU during switch startup #4553

stepanblyschak opened this issue May 7, 2020 · 8 comments
Assignees

Comments

@stepanblyschak
Copy link
Collaborator

Description

During switch bootup sonic-cfggen is called over 100 of times from different places from different SONiC containers. It consumes a lot of CPU mainly because of jinja2 and natsort python packages which which compile a lot of regular expressions on import time. It makes containers to start very slow and has impact on cold/fast/warm boot timings.

Steps to reproduce the issue:
No specific steps, just perform any kind of reload/reboot and start some profiling tool (bootchart, perf&flamegraphs)

Describe the results you received:
sonic-cfggen is a very CPU intensive utility, however it is used everywhere, causing slow start.

Fast boot suffers because platform SDK may not be able to perfrom switch init and reconfiguration fast enough if other CPU intensive tasks are running in parallel.
Fast/Warm boot suffers because switch control plane downtime is increased.

Describe the results you expected:
sonic-cfggen should be optimized. More templates to be generated will delay other tasks in the system.

Additional information you deem important (e.g. issue happens only occasionally):

This is very platform specific, depending on platform CPU you may have different results.

Output of show version:

The version is debug version compiled with SONIC_PROFILING_ON=y and '-fno-omit-frame-pointer':
Attached is system perf recording and generated flamegraph during bootup. Perf was started at /etc/rc.local phase with command:
perf_4.9 record -F 99 -a -g -o /home/admin/perf -- sleep 100 &
system-perf.svg.gz
We can see a lot of sonic-cfggen samples collected, more than any critical SONiC component, like SDK, syncd, orchagent or redis-server.

Bootchart plot (https://elinux.org/Bootchart)
bootchart
We can see sonic-cfggen executions during SDK start and configuration.

SONiC Software Version: SONiC.201911.0-8367dfeb
Distribution: Debian 9.12
Kernel: 4.9.0-11-2-amd64
Build commit: 8367dfeb
Build date: Wed May  6 16:14:29 UTC 2020
Built by: stepanb@r-build-sonic02

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700-D48C8
ASIC: mellanox
Serial Number: MT1822K07823
Uptime: 10:36:48 up 8 min,  1 user,  load average: 1.61, 1.69, 0.99

Docker images:
REPOSITORY                        TAG                 IMAGE ID            SIZE
docker-platform-monitor-dbg       201911.0-8367dfeb   af5e7e595e0c        641MB
docker-platform-monitor-dbg       latest              af5e7e595e0c        641MB
docker-platform-monitor           201911.0-8367dfeb   af5e7e595e0c        641MB
docker-platform-monitor           latest              af5e7e595e0c        641MB
docker-sflow-dbg                  201911.0-8367dfeb   7a794dfac076        332MB
docker-sflow-dbg                  latest              7a794dfac076        332MB
docker-sflow                      201911.0-8367dfeb   7a794dfac076        332MB
docker-sflow                      latest              7a794dfac076        332MB
docker-fpm-frr-dbg                201911.0-8367dfeb   231d4d8e8fbd        385MB
docker-fpm-frr-dbg                latest              231d4d8e8fbd        385MB
docker-fpm-frr                    201911.0-8367dfeb   231d4d8e8fbd        385MB
docker-fpm-frr                    latest              231d4d8e8fbd        385MB
docker-lldp-sv2-dbg               201911.0-8367dfeb   63600bb51fad        326MB
docker-lldp-sv2-dbg               latest              63600bb51fad        326MB
docker-lldp-sv2                   201911.0-8367dfeb   63600bb51fad        326MB
docker-lldp-sv2                   latest              63600bb51fad        326MB
docker-orchagent-dbg              201911.0-8367dfeb   1f2d3cbf41c4        376MB
docker-orchagent-dbg              latest              1f2d3cbf41c4        376MB
docker-orchagent                  201911.0-8367dfeb   1f2d3cbf41c4        376MB
docker-orchagent                  latest              1f2d3cbf41c4        376MB
docker-snmp-sv2-dbg               201911.0-8367dfeb   36b4cb692870        357MB
docker-snmp-sv2-dbg               latest              36b4cb692870        357MB
docker-snmp-sv2                   201911.0-8367dfeb   36b4cb692870        357MB
docker-snmp-sv2                   latest              36b4cb692870        357MB
docker-nat-dbg                    201911.0-8367dfeb   8b9e98228c50        362MB
docker-nat-dbg                    latest              8b9e98228c50        362MB
docker-nat                        201911.0-8367dfeb   8b9e98228c50        362MB
docker-nat                        latest              8b9e98228c50        362MB
docker-sonic-mgmt-framework-dbg   201911.0-8367dfeb   f6f222fbb114        443MB
docker-sonic-mgmt-framework-dbg   latest              f6f222fbb114        443MB
docker-sonic-mgmt-framework       201911.0-8367dfeb   f6f222fbb114        443MB
docker-sonic-mgmt-framework       latest              f6f222fbb114        443MB
docker-teamd-dbg                  201911.0-8367dfeb   aea02a411903        360MB
docker-teamd-dbg                  latest              aea02a411903        360MB
docker-teamd                      201911.0-8367dfeb   aea02a411903        360MB
docker-teamd                      latest              aea02a411903        360MB
docker-syncd-mlnx-dbg             201911.0-8367dfeb   a6067b546dec        416MB
docker-syncd-mlnx-dbg             latest              a6067b546dec        416MB
docker-syncd-mlnx                 201911.0-8367dfeb   a6067b546dec        416MB
docker-syncd-mlnx                 latest              a6067b546dec        416MB
docker-sonic-telemetry-dbg        201911.0-8367dfeb   2653e5a6978d        369MB
docker-sonic-telemetry-dbg        latest              2653e5a6978d        369MB
docker-sonic-telemetry            201911.0-8367dfeb   2653e5a6978d        369MB
docker-sonic-telemetry            latest              2653e5a6978d        369MB
docker-database-dbg               201911.0-8367dfeb   e224f6c86f35        303MB
docker-database-dbg               latest              e224f6c86f35        303MB
docker-database                   201911.0-8367dfeb   e224f6c86f35        303MB
docker-database                   latest              e224f6c86f35        303MB
docker-router-advertiser-dbg      201911.0-8367dfeb   eddcecf2bd70        306MB
docker-router-advertiser-dbg      latest              eddcecf2bd70        306MB
docker-router-advertiser          201911.0-8367dfeb   eddcecf2bd70        306MB
docker-router-advertiser          latest              eddcecf2bd70        306MB
docker-dhcp-relay-dbg             201911.0-8367dfeb   e1cc48d4ce38        312MB
docker-dhcp-relay-dbg             latest              e1cc48d4ce38        312MB
docker-dhcp-relay                 201911.0-8367dfeb   e1cc48d4ce38        312MB
docker-dhcp-relay                 latest              e1cc48d4ce38        312MB

Attach debug file sudo generate_dump:
sonic_dump.tar.gz

@lguohan
Copy link
Collaborator

lguohan commented May 13, 2020

image

looks like supvervisorctl is causing lots of cpu, it seems better just to leverage the supervisord autostart

@jleveque
Copy link
Contributor

@tahmed-dev has been working to decrease CPU load at boot time due to sonic-cfggen. Reassigning this issue.

@jleveque jleveque assigned tahmed-dev and unassigned jleveque Aug 20, 2020
@rlhui
Copy link
Contributor

rlhui commented Oct 5, 2020

@tahmed-dev , are all fixes for this in 201911 branch and ready for verification by @stepanblyschak ? Thanks.

@tahmed-dev
Copy link
Contributor

@rlhui Yes! all low hanging fruit fixes went into master. Do we have plans to port those fixes to 201911 branch?

@rlhui
Copy link
Contributor

rlhui commented Oct 5, 2020

@tahmed-dev, I believe some PRs are in 201911 branch already. are you saying this is still a issue for 201911?
could you please map the PRs to this issue? Which ones are not yet in 201911 branch?

@tahmed-dev
Copy link
Contributor

@rlhui, I'll defer it to @stepanblyschak to answer if he still see this issue on 201911.

Here are the remianing PRs: /pull/5250, /pull/5203, /pull/5200, /pull/5178, /pull/5176, /pull/5175, /pull/5174, /pull/5166, /pull/4937, and this commit

@liat-grozovik
Copy link
Collaborator

Following tests on 201911_T0 by @stepanblyschak we definitely see major improvements comparing to the time the issue was raised. All the above PRs were cherry picked to 201911_T0 branch but not yet in 201911.
@rlhui when can we plane a merge window of 201911_T0 to 201911?

@rlhui
Copy link
Contributor

rlhui commented Oct 13, 2020

@liat-grozovik, great, good to know. 201911 branch needs to be in a bit tighter control to accept critical bug fixes only at this moment. We can assess this late this week/early next week. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants