Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features were not disabled after config load_minigraph #13293

Closed
ysmanman opened this issue Jan 6, 2023 · 9 comments
Closed

Features were not disabled after config load_minigraph #13293

ysmanman opened this issue Jan 6, 2023 · 9 comments
Assignees
Labels
Chassis 🤖 Modular chassis support MSFT Triaged this issue has been triaged

Comments

@ysmanman
Copy link
Contributor

ysmanman commented Jan 6, 2023

Description

We encountered the issue when config-load sonic-mgmt T2 minigraph in a VOQ chassis. Before config-load minigraph, feature bgp was enabled on the supervisor (so bgp containers of all fabric AISCs were running). After config-load sonic-mgmt T2 minigraph that disables feature bgp on the sup, container_check failed because some bgp containers were still running:

$ sonic-db-cli CONFIG_DB hgetall "FEATURE|bgp"
{'has_timer': 'False', 'check_up_status': 'false', 'state': 'disabled', 'auto_restart': 'enabled', 'has_global_scope': 'False', 'has_per_asic_scope': 'True', 'high_mem_alert': 'disabled'}

$ container_checker
Unexpected running containers: bgp7, bgp9, bgp6, bgp10, bgp11, bgp5, bgp8, bgp4

According to syslog, hostcfgd failed to stop bgp4 because systemctl stop was canceled.

Jan 6 05:35:11.848413 INFO hostcfgd: Running cmd: 'sudo systemctl stop bgp@4.service'
Jan 6 05:35:21.417044 INFO hostcfgd[705707]: Job for bgp@4.service canceled.
Jan 6 05:35:21.424134 ERR hostcfgd: sudo systemctl stop bgp@4.service - failed: return code - 1, output:#012None

The reason that systemctl stop was canceled is likely because there was another systemctl action being issued on bgp4, e.g., starting swss4 would systemctl start bgp4 as well. In another words, this may be a race condition.

Furthermore, with today's implementation of hostcfgd, it bails out immediately if it encountered error when stopping a feature. This explains why the rest bgp containers (bgp5-11) were not stopped.

This issue was seen with recent 202205 image.

Steps to reproduce the issue:

  1. Enable feature bgp on sup in VOQ chassis.
  2. Config-load a minigraph that disabled feature bgp on sup.
  3. Run container_checker on sup.

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@arlakshm arlakshm self-assigned this Jan 10, 2023
@arlakshm arlakshm added the Chassis 🤖 Modular chassis support label Jan 10, 2023
@rlhui
Copy link
Contributor

rlhui commented Jan 11, 2023

#13064 will help. Please retry with this fix.

@rlhui
Copy link
Contributor

rlhui commented Jan 11, 2023

sonic-net/sonic-host-services#8 could help , but not yet merged.

@rlhui rlhui added the Triaged this issue has been triaged label Jan 11, 2023
@ysmanman
Copy link
Contributor Author

#13064 will help. Please retry with this fix.

The PR may help to reduce the window of the trace, but it may not able to eliminate the race.
We will try the fix and report back.

@ysmanman
Copy link
Contributor Author

#13064 will help. Please retry with this fix.

The PR may help to reduce the window of the trace, but it may not able to eliminate the race. We will try the fix and report back.

Hi @rlhui, I was still observing the issue with recent 202205 image:

admin@cmp227:/$ sonic-db-cli CONFIG_DB hgetall "FEATURE|bgp"
{'has_global_scope': 'False', 'has_timer': 'False', 'auto_restart': 'enabled', 'state': 'disabled', 'has_per_asic_scope': 'True', 'high_mem_alert': 'disabled', 'check_up_status': 'false'}

admin@cmp227:/$ container_checker
Unexpected running containers: bgp8, bgp9, bgp10, bgp11, bgp7

@wenyiz2021
Copy link
Contributor

wenyiz2021 commented Feb 7, 2023

about the start of dependent service(swss) also starts bgp, I don't quite think this way
e.g. when bgp5 was stopped earlier by hostcfgd, it will be masked
later when swss.sh tries to start bgp5 it'll fail, because the dependent swss.sh file doesn't unmask bgp5, which prioritizes hostcfgd behavior.
current race condition in hostcfgd should only btn systemd and hostcfgd at the very beginning when device boot up and load minigraph.

@wenyiz2021
Copy link
Contributor

Feb  7 18:30:10.947161 str2-7804-sup-1 INFO hostcfgd: Running cmd: 'sudo systemctl disable bgp@10.service'
Feb  7 18:30:10.998839 str2-7804-sup-1 INFO hostcfgd[1485657]: Removed /etc/systemd/system/sonic.target.wants/bgp@10.service.
Feb  7 18:30:11.448645 str2-7804-sup-1 INFO hostcfgd: Running cmd: 'sudo systemctl mask bgp@10.service'
Feb  7 18:30:11.498486 str2-7804-sup-1 INFO hostcfgd[1485773]: Created symlink /etc/systemd/system/bgp@10.service → /dev/null.
Feb  7 18:30:31.466672 str2-7804-sup-1 INFO swss.sh[1491611]: Failed to start bgp@10.service: Unit bgp@10.service is masked.

@wenyiz2021
Copy link
Contributor

sonic-net/sonic-host-services#8 Hostcfgd change didn't help the issue from my experiment. Moreover, it causes dhcp_relay/mux/bgp/macsec/teamd feature status to be uncertain by directly taking logic from init_cfg.json.j2

@wenyiz2021
Copy link
Contributor

@arlakshm has local fix for this, to raise a PR

lguohan pushed a commit that referenced this issue Aug 7, 2023
…#15734)

Fixes #15667 and #13293

Work item tracking
Microsoft ADO 24472854:

How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.

How to verify it
Tests on chassis supervisor and LC
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Aug 10, 2023
…sonic-net#15734)

Fixes sonic-net#15667 and sonic-net#13293

Work item tracking
Microsoft ADO 24472854:

How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.

How to verify it
Tests on chassis supervisor and LC
yxieca pushed a commit that referenced this issue Aug 11, 2023
…#15734) (#16099)

Fixes #15667 and #13293

Work item tracking
Microsoft ADO 24472854:

How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.

How to verify it
Tests on chassis supervisor and LC

Co-authored-by: Arvindsrinivasan Lakshmi Narasimhan <55814491+arlakshm@users.noreply.github.com>
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Aug 14, 2023
…sonic-net#15734)

Fixes sonic-net#15667 and sonic-net#13293

Work item tracking
Microsoft ADO 24472854:

How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.

How to verify it
Tests on chassis supervisor and LC
StormLiangMS pushed a commit that referenced this issue Aug 14, 2023
…#15734) (#16135)

Fixes #15667 and #13293

Work item tracking
Microsoft ADO 24472854:

How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.

How to verify it
Tests on chassis supervisor and LC

Co-authored-by: Arvindsrinivasan Lakshmi Narasimhan <55814491+arlakshm@users.noreply.github.com>
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this issue Sep 20, 2023
…sonic-net#15734)

Fixes sonic-net#15667 and sonic-net#13293

Work item tracking
Microsoft ADO 24472854:

How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.

How to verify it
Tests on chassis supervisor and LC
@arlakshm
Copy link
Contributor

#15734 merged. Closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chassis 🤖 Modular chassis support MSFT Triaged this issue has been triaged
Projects
Archived in project
Development

No branches or pull requests

5 participants