Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hostcfgd] hoscfgd doesn't honor CFG DB updates if they arrive in a specific time interval #8619

Closed
vivekrnv opened this issue Aug 27, 2021 · 1 comment · Fixed by #8861
Closed
Assignees
Labels
Issue for 202106 Triaged this issue has been triaged

Comments

@vivekrnv
Copy link
Contributor

Description

We have an internal test which invokes "config feature state sflow enabled" at around 92 sec after the boot and the update was never honored. The issue was narrowed down to the delayed start of hostcfgd (which starts at 90th sec). I'll explain with an example

Let say system boots up at 0th sec. hostcfg starts at 90th sec. The hostcfgd reads the FEATURE|* tables at 90th sec and acts on it accordingly for the next 10 seconds (i.e mask/unmask/enable/disable). And it then starts listening to notifications of feature table from 100th sec.

So, an update which is made before 90th sec and after 100th sec is honored. Updates made between 90th – 100th sec will not.

Steps to reproduce the issue:

  1. Make sure SFLOW feature state is disabled, save it.
  2. Write a one shot timer to start at 94 sec after boot up. Make it enable sflow using "config feature state sflow enabled"
  3. Reboot the switch with one-shot timer in place.

Describe the results you received:

The update will not be honored

# Command was run on

2021-08-23 12:33:23 [    INFO]  Running cmd: sudo config feature state sflow enabled 


# Pre-notification listen phase i.e. https://github.com/Azure/sonic-buildimage/blob/202012/src/sonic-host-services/scripts/hostcfgd#L656 was active from 

12:33:22.986854  <->  12:33:34.245694 and thus the update was missed

Aug 23 12:33:22.986854 sonic INFO /hostcfgd: Waiting for systemctl to finish initialization
Aug 23 12:33:23.000117 sonic INFO /hostcfgd: systemctl has finished initialization -- proceeding ....

...............

................

Aug 23 12:33:33.421544 sonic INFO /hostcfgd: Feature 'lldp.service' is enabled and started
Aug 23 12:33:33.437933 sonic INFO /hostcfgd: Feature 'sflow' is stopped and disabled
..........

.........

Aug 23 12:33:34.234307 sonic INFO /hostcfgd: Running cmd: 'sudo systemctl start dhcp_relay.service'
Aug 23 12:33:34.245694 sonic INFO /hostcfgd: Feature 'dhcp_relay.service' is enabled and started

Describe the results you expected:

The update should be honored

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@zhangyanzhao
Copy link
Collaborator

Vivek will fix this issue.

@zhangyanzhao zhangyanzhao added the Triaged this issue has been triaged label Sep 1, 2021
qiluo-msft pushed a commit that referenced this issue Oct 28, 2021
…Table (#8861)

#### Why I did it

Fixes #8619

#### How I did it

1) Listening to CFG_DB notifications was migrated from ConfigDBConnector to SubscriberStateTable & Select
2) This change in design helped me to remove `update_all_features_config` which was roughly taking a 5-10 sec time to execute and thus the reason for blackout
3) Edited FeatureHandler, Feature & NtpCfgd classes to suit this design  
4) Added corresponding mocks and UT's

**Changes made to classes other than HostConfigDaemon:**
With the previous design, the initially read data from the config db was applied by using hardcoded methods even before the config_db.listen() was called. For Eg: `update_all_features_config` for FeatureHandler and load() named methods for NtpCfgd etc

But with this design, since the existing data is read and given out as a notification by SubscriberStateTable, i've pretty much removed these hardcoded methods. Thus changes made to these class will be around adapting them to the new design and no change in the actual functionality .


#### How to verify it

UT's:
```
tests/determine-reboot-cause_test.py .........                                                                                                                                                                                        [ 29%]
tests/procdockerstatsd_test.py .                                                                                                                                                                                                      [ 32%]
tests/caclmgrd/caclmgrd_dhcp_test.py ......                                                                                                                                                                                           [ 51%]
tests/hostcfgd/hostcfgd_radius_test.py ..                                                                                                                                                                                             [ 58%]
tests/hostcfgd/hostcfgd_test.py .............                                                                                                                                                                                         [100%]
```
Verified manually, 

```
Sep 10 22:53:25.662621 sonic INFO systemd[1]: hostcfgd.service: Succeeded.
Sep 10 22:55:04.127719 sonic INFO /hostcfgd: ConfigDB connect success
Sep 10 22:55:04.128108 sonic INFO /hostcfgd: KdumpCfg init ...
Sep 10 22:55:04.148819 sonic INFO /hostcfgd: Waiting for systemctl to finish initialization
Sep 10 22:55:04.163452 sonic INFO /hostcfgd: systemctl has finished initialization -- proceeding ...
Sep 10 22:55:04.163834 sonic INFO /hostcfgd: Kdump handler...
Sep 10 22:55:04.164019 sonic INFO /hostcfgd: Kdump global configuration update
Sep 10 22:55:04.758784 sonic INFO hostcfgd[184471]: kdump is already disabled
Sep 10 22:55:04.758876 sonic INFO hostcfgd[184471]: Kdump is already disabled
Sep 10 22:55:05.182021 sonic INFO hostcfgd[184511]: Kdump configuration has been updated in the startup configuration
Sep 10 22:55:05.596919 sonic INFO hostcfgd[184528]: Kdump configuration has been updated in the startup configuration
Sep 10 22:55:06.140627 sonic INFO /hostcfgd: Feature nat is stopped and disabled
Sep 10 22:55:06.642629 sonic INFO /hostcfgd: Feature telemetry is enabled and started
Sep 10 22:55:07.101297 sonic INFO /hostcfgd: Feature pmon is enabled and started
Sep 10 22:55:07.554366 sonic INFO /hostcfgd: Feature database is enabled and started
Sep 10 22:55:08.009329 sonic INFO /hostcfgd: Feature mgmt-framework is enabled and started
Sep 10 22:55:08.394952 sonic INFO /hostcfgd: Feature macsec is stopped and disabled
Sep 10 22:55:08.782853 sonic INFO /hostcfgd: Feature snmp is enabled and started
Sep 10 22:55:09.205381 sonic INFO /hostcfgd: Feature teamd is enabled and started
Sep 10 22:55:09.224877 sonic INFO /hostcfgd: Feature what-just-happened is enabled and started
Sep 10 22:55:09.627929 sonic INFO /hostcfgd: Feature lldp is enabled and started
Sep 10 22:55:10.086993 sonic INFO /hostcfgd: Feature swss is enabled and started
Sep 10 22:55:10.170312 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 22:55:11.012236 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 22:55:12.225946 sonic INFO /hostcfgd: Feature bgp is enabled and started
Sep 10 22:55:12.712792 sonic INFO /hostcfgd: Feature dhcp_relay is enabled and started
Sep 10 22:55:13.166656 sonic INFO /hostcfgd: Feature sflow is stopped and disabled
Sep 10 22:55:13.593639 sonic INFO /hostcfgd: Feature radv is enabled and started
Sep 10 22:55:14.034106 sonic INFO /hostcfgd: Feature syncd is enabled and started
Sep 10 22:55:14.113064 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 22:55:14.863601 sonic INFO /hostcfgd: RADIUS_SERVER update: key: 10.10.10.1, op: SET, data: {'auth_type': 'pap', 'passkey': 'p*****', 'retransmit': '1', 'timeout': '1'}
Sep 10 22:55:14.938605 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 22:55:15.667545 sonic INFO /hostcfgd: RADIUS_SERVER update: key: 10.10.10.3, op: SET, data: {'auth_type': 'chap', 'passkey': 'p*****', 'retransmit': '2', 'timeout': '2'}
Sep 10 22:55:15.667801 sonic INFO /hostcfgd: RADIUS (NAS) IP change - key:eth0, current global info {}
Sep 10 22:55:15.746531 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 23:04:47.435340 sonic INFO /hostcfgd: ntp server update key 0.debian.pool.ntp.org
Sep 10 23:04:47.435661 sonic INFO /hostcfgd: ntp server update, restarting ntp-config, ntp servers configured {'0.debian.pool.ntp.org'}
Sep 10 23:04:47.866394 sonic INFO /hostcfgd: NTP GLOBAL Update
Sep 10 23:04:47.866557 sonic INFO /hostcfgd: ntp global update for source intf old {''} new {'eth0', 'Loopback0'}, restarting ntp-config
Sep 10 23:16:25.157600 sonic INFO /hostcfgd: Running cmd: 'sudo systemctl unmask sflow.service'
Sep 10 23:16:25.178472 sonic INFO hostcfgd[192106]: Removed /etc/systemd/system/sflow.service.
Sep 10 23:16:25.582018 sonic INFO /hostcfgd: Running cmd: 'sudo systemctl enable sflow.service'
Sep 10 23:16:25.604534 sonic INFO hostcfgd[192123]: Created symlink /etc/systemd/system/sonic.target.wants/sflow.service → /lib/systemd/system/sflow.service.
Sep 10 23:16:26.029416 sonic INFO /hostcfgd: Running cmd: 'sudo systemctl start sflow.service'
Sep 10 23:16:26.691927 sonic INFO /hostcfgd: Feature sflow is enabled and started
```
ganglyu pushed a commit to sonic-net/sonic-host-services that referenced this issue Jul 12, 2022
…Table (#8861)

#### Why I did it

Fixes sonic-net/sonic-buildimage#8619

#### How I did it

1) Listening to CFG_DB notifications was migrated from ConfigDBConnector to SubscriberStateTable & Select
2) This change in design helped me to remove `update_all_features_config` which was roughly taking a 5-10 sec time to execute and thus the reason for blackout
3) Edited FeatureHandler, Feature & NtpCfgd classes to suit this design  
4) Added corresponding mocks and UT's

**Changes made to classes other than HostConfigDaemon:**
With the previous design, the initially read data from the config db was applied by using hardcoded methods even before the config_db.listen() was called. For Eg: `update_all_features_config` for FeatureHandler and load() named methods for NtpCfgd etc

But with this design, since the existing data is read and given out as a notification by SubscriberStateTable, i've pretty much removed these hardcoded methods. Thus changes made to these class will be around adapting them to the new design and no change in the actual functionality .


#### How to verify it

UT's:
```
tests/determine-reboot-cause_test.py .........                                                                                                                                                                                        [ 29%]
tests/procdockerstatsd_test.py .                                                                                                                                                                                                      [ 32%]
tests/caclmgrd/caclmgrd_dhcp_test.py ......                                                                                                                                                                                           [ 51%]
tests/hostcfgd/hostcfgd_radius_test.py ..                                                                                                                                                                                             [ 58%]
tests/hostcfgd/hostcfgd_test.py .............                                                                                                                                                                                         [100%]
```
Verified manually, 

```
Sep 10 22:53:25.662621 sonic INFO systemd[1]: hostcfgd.service: Succeeded.
Sep 10 22:55:04.127719 sonic INFO /hostcfgd: ConfigDB connect success
Sep 10 22:55:04.128108 sonic INFO /hostcfgd: KdumpCfg init ...
Sep 10 22:55:04.148819 sonic INFO /hostcfgd: Waiting for systemctl to finish initialization
Sep 10 22:55:04.163452 sonic INFO /hostcfgd: systemctl has finished initialization -- proceeding ...
Sep 10 22:55:04.163834 sonic INFO /hostcfgd: Kdump handler...
Sep 10 22:55:04.164019 sonic INFO /hostcfgd: Kdump global configuration update
Sep 10 22:55:04.758784 sonic INFO hostcfgd[184471]: kdump is already disabled
Sep 10 22:55:04.758876 sonic INFO hostcfgd[184471]: Kdump is already disabled
Sep 10 22:55:05.182021 sonic INFO hostcfgd[184511]: Kdump configuration has been updated in the startup configuration
Sep 10 22:55:05.596919 sonic INFO hostcfgd[184528]: Kdump configuration has been updated in the startup configuration
Sep 10 22:55:06.140627 sonic INFO /hostcfgd: Feature nat is stopped and disabled
Sep 10 22:55:06.642629 sonic INFO /hostcfgd: Feature telemetry is enabled and started
Sep 10 22:55:07.101297 sonic INFO /hostcfgd: Feature pmon is enabled and started
Sep 10 22:55:07.554366 sonic INFO /hostcfgd: Feature database is enabled and started
Sep 10 22:55:08.009329 sonic INFO /hostcfgd: Feature mgmt-framework is enabled and started
Sep 10 22:55:08.394952 sonic INFO /hostcfgd: Feature macsec is stopped and disabled
Sep 10 22:55:08.782853 sonic INFO /hostcfgd: Feature snmp is enabled and started
Sep 10 22:55:09.205381 sonic INFO /hostcfgd: Feature teamd is enabled and started
Sep 10 22:55:09.224877 sonic INFO /hostcfgd: Feature what-just-happened is enabled and started
Sep 10 22:55:09.627929 sonic INFO /hostcfgd: Feature lldp is enabled and started
Sep 10 22:55:10.086993 sonic INFO /hostcfgd: Feature swss is enabled and started
Sep 10 22:55:10.170312 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 22:55:11.012236 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 22:55:12.225946 sonic INFO /hostcfgd: Feature bgp is enabled and started
Sep 10 22:55:12.712792 sonic INFO /hostcfgd: Feature dhcp_relay is enabled and started
Sep 10 22:55:13.166656 sonic INFO /hostcfgd: Feature sflow is stopped and disabled
Sep 10 22:55:13.593639 sonic INFO /hostcfgd: Feature radv is enabled and started
Sep 10 22:55:14.034106 sonic INFO /hostcfgd: Feature syncd is enabled and started
Sep 10 22:55:14.113064 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 22:55:14.863601 sonic INFO /hostcfgd: RADIUS_SERVER update: key: 10.10.10.1, op: SET, data: {'auth_type': 'pap', 'passkey': 'p*****', 'retransmit': '1', 'timeout': '1'}
Sep 10 22:55:14.938605 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 22:55:15.667545 sonic INFO /hostcfgd: RADIUS_SERVER update: key: 10.10.10.3, op: SET, data: {'auth_type': 'chap', 'passkey': 'p*****', 'retransmit': '2', 'timeout': '2'}
Sep 10 22:55:15.667801 sonic INFO /hostcfgd: RADIUS (NAS) IP change - key:eth0, current global info {}
Sep 10 22:55:15.746531 sonic INFO /hostcfgd: cmd - service aaastatsd stop
Sep 10 23:04:47.435340 sonic INFO /hostcfgd: ntp server update key 0.debian.pool.ntp.org
Sep 10 23:04:47.435661 sonic INFO /hostcfgd: ntp server update, restarting ntp-config, ntp servers configured {'0.debian.pool.ntp.org'}
Sep 10 23:04:47.866394 sonic INFO /hostcfgd: NTP GLOBAL Update
Sep 10 23:04:47.866557 sonic INFO /hostcfgd: ntp global update for source intf old {''} new {'eth0', 'Loopback0'}, restarting ntp-config
Sep 10 23:16:25.157600 sonic INFO /hostcfgd: Running cmd: 'sudo systemctl unmask sflow.service'
Sep 10 23:16:25.178472 sonic INFO hostcfgd[192106]: Removed /etc/systemd/system/sflow.service.
Sep 10 23:16:25.582018 sonic INFO /hostcfgd: Running cmd: 'sudo systemctl enable sflow.service'
Sep 10 23:16:25.604534 sonic INFO hostcfgd[192123]: Created symlink /etc/systemd/system/sonic.target.wants/sflow.service → /lib/systemd/system/sflow.service.
Sep 10 23:16:26.029416 sonic INFO /hostcfgd: Running cmd: 'sudo systemctl start sflow.service'
Sep 10 23:16:26.691927 sonic INFO /hostcfgd: Feature sflow is enabled and started
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202106 Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants