Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Celestica-E1031] Enable CPU watchdog #16083

Merged
merged 3 commits into from
Aug 14, 2023
Merged

Conversation

lizhijianrd
Copy link
Contributor

@lizhijianrd lizhijianrd commented Aug 9, 2023

Why I did it

Enable CPU watchdog on Celestica-E1031.

Work item tracking
  • Microsoft ADO (number only): 24536684

How I did it

Add a system service cpu_wdt to enable CPU watchdog and send keep-alive signal to watchdog periodically.

How to verify it

Build SONiC image and installed on physical device. Can see the cpu_wdt work as expected.

$ sudo systemctl status cpu_wdt.service
● cpu_wdt.service - CPU WDT
     Loaded: loaded (/lib/systemd/system/cpu_wdt.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-08-09 14:37:12 UTC; 1min 11s ago
   Main PID: 324656 (cpu_wdt)
      Tasks: 2 (limit: 2329)
     Memory: 2.1M
     CGroup: /system.slice/cpu_wdt.service
             ├─324656 /bin/bash /usr/local/bin/cpu_wdt start
             └─325540 sleep 60

Aug 09 14:37:12 e1031-1 systemd[1]: Started CPU WDT.
Aug 09 14:37:12 e1031-1 cpu_wdt[324657]: Enable CPU WDT..
Aug 09 14:37:13 e1031-1 cpu_wdt[324675]: CPU WDT has been enabled with 180 seconds timeout
Aug 09 14:37:13 e1031-1 cpu_wdt[324676]: Enable keep alive messaging every 60 seconds

When I stopped cpu_wtd service, it will disarm watchdog before exit:

$ sudo systemctl stop cpu_wdt.service
$ sudo systemctl status cpu_wdt.service
● cpu_wdt.service - CPU WDT
     Loaded: loaded (/lib/systemd/system/cpu_wdt.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Wed 2023-08-09 14:47:24 UTC; 18s ago
    Process: 324656 ExecStart=/usr/local/bin/cpu_wdt start (code=exited, status=0/SUCCESS)
   Main PID: 324656 (code=exited, status=0/SUCCESS)

Aug 09 14:37:12 e1031-1 cpu_wdt[324657]: Enable CPU WDT..
Aug 09 14:37:13 e1031-1 cpu_wdt[324675]: CPU WDT has been enabled with 180 seconds timeout
Aug 09 14:37:13 e1031-1 cpu_wdt[324676]: Enable keep alive messaging every 60 seconds
Aug 09 14:46:23 e1031-1 systemd[1]: Stopping CPU WDT...
Aug 09 14:46:23 e1031-1 cpu_wdt[324656]: Terminated
Aug 09 14:46:23 e1031-1 cpu_wdt[331991]: Caught SIGTERM - exiting...
Aug 09 14:47:24 e1031-1 cpu_wdt[332812]: Watchdog disarmed successfully
Aug 09 14:47:24 e1031-1 cpu_wdt[332826]: CPU WDT has been disabled!
Aug 09 14:47:24 e1031-1 systemd[1]: cpu_wdt.service: Succeeded.
Aug 09 14:47:24 e1031-1 systemd[1]: Stopped CPU WDT.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@lizhijianrd lizhijianrd marked this pull request as ready for review August 9, 2023 14:49
@lizhijianrd lizhijianrd requested a review from Blueve August 9, 2023 14:49
@prgeor
Copy link
Contributor

prgeor commented Aug 11, 2023

@lizhijianrd how did you test this? Did the system reboot if watchdog is not ticked? Can you issue echo c > /proc/sysrq-trigger and see if system finally rebooted?

@prgeor
Copy link
Contributor

prgeor commented Aug 11, 2023

@lizhijianrd after reboot of system due to watchdog timeout, is the "show reboot-cause" showing "watchdog" ?

@lizhijianrd
Copy link
Contributor Author

Hi @prgeor I tried to set the watchdog time to 60 seconds and keepalive interval to 120 seconds, the watchdog was not ticked on time and triggered reboot successfully. After the reboot, I can see the reboot-cause is Watchdog (None).

Per your suggestion, I also tried echo c > /proc/sysrq-trigger several times. It triggers the system reboot immediately and the reboot-cause is Unknown. I'm not sure is this the expected behavior?

@yxieca yxieca merged commit ab7c4ee into sonic-net:master Aug 14, 2023
11 checks passed
@Blueve
Copy link
Contributor

Blueve commented Aug 14, 2023

We request to backport 202012 only because:

  1. 201811 already have cpu_wdt service
  2. later releases are no longer supported on this platform

@lizhijianrd lizhijianrd deleted the hlx-cpu-wdt branch August 14, 2023 08:27
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Aug 17, 2023
Enable CPU watchdog on Celestica-E1031.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202012: #16193

mssonicbld pushed a commit that referenced this pull request Aug 19, 2023
Enable CPU watchdog on Celestica-E1031.
lizhijianrd added a commit to sonic-net/sonic-mgmt that referenced this pull request Aug 30, 2023
What is the motivation for this PR?
PR sonic-net/sonic-buildimage#16083 introduced cpu_wdt service on Celestica E1031 platform. The cpu_wdt service periodically sends keep alive message to watchdog via "watchdogutil arm -s " command. This may affect the test result of test_watchdog_reboot. This PR add one step to stop the cpu_wdt service before doing watchdog reboot on the DUT.

How did you do it?
Add one step in test_watchdog_reboot to stop the cpu_wdt service before doing watchdog reboot.

How did you verify/test it?
Verified on Celestica-E1031 testbed.

Signed-off-by: Zhijian Li <zhijianli@microsoft.com>
wangxin pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Aug 31, 2023
Backport #9745

What is the motivation for this PR?
PR sonic-net/sonic-buildimage#16083 introduced cpu_wdt service on Celestica E1031 platform. The cpu_wdt service periodically sends keep alive message to watchdog via "watchdogutil arm -s " command. This may affect the test result of test_watchdog_reboot. This PR add one step to stop the cpu_wdt service before doing watchdog reboot on the DUT.

How did you do it?
Add one step in test_watchdog_reboot to stop the cpu_wdt service before doing watchdog reboot.

How did you verify/test it?
Verified on Celestica-E1031 testbed.

Signed-off-by: Zhijian Li <zhijianli@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Sep 15, 2023
What is the motivation for this PR?
PR sonic-net/sonic-buildimage#16083 introduced cpu_wdt service on Celestica E1031 platform. The cpu_wdt service periodically sends keep alive message to watchdog via "watchdogutil arm -s " command. This may affect the test result of test_watchdog_reboot. This PR add one step to stop the cpu_wdt service before doing watchdog reboot on the DUT.

How did you do it?
Add one step in test_watchdog_reboot to stop the cpu_wdt service before doing watchdog reboot.

How did you verify/test it?
Verified on Celestica-E1031 testbed.

Signed-off-by: Zhijian Li <zhijianli@microsoft.com>
mssonicbld pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Sep 15, 2023
What is the motivation for this PR?
PR sonic-net/sonic-buildimage#16083 introduced cpu_wdt service on Celestica E1031 platform. The cpu_wdt service periodically sends keep alive message to watchdog via "watchdogutil arm -s " command. This may affect the test result of test_watchdog_reboot. This PR add one step to stop the cpu_wdt service before doing watchdog reboot on the DUT.

How did you do it?
Add one step in test_watchdog_reboot to stop the cpu_wdt service before doing watchdog reboot.

How did you verify/test it?
Verified on Celestica-E1031 testbed.

Signed-off-by: Zhijian Li <zhijianli@microsoft.com>
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
Enable CPU watchdog on Celestica-E1031.
lizhijianrd added a commit to lizhijianrd/sonic-buildimage that referenced this pull request Dec 5, 2023
Enable CPU watchdog on Celestica-E1031.
StormLiangMS pushed a commit that referenced this pull request Dec 26, 2023
* [Celestica-E1031] Enable CPU watchdog (#16083)

Enable CPU watchdog on Celestica-E1031.

* Add info syslog for cpu_wdt.service (#16678)

Why I did it
Add info syslog for cpu_wdt.service when trigger watchdog arm action.

How I did it
Add info syslog for cpu_wdt.service when trigger watchdog arm action.
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Jan 25, 2024
What is the motivation for this PR?
PR sonic-net/sonic-buildimage#16083 introduced cpu_wdt service on Celestica E1031 platform. The cpu_wdt service periodically sends keep alive message to watchdog via "watchdogutil arm -s " command. This may affect the test result of test_watchdog_reboot. This PR add one step to stop the cpu_wdt service before doing watchdog reboot on the DUT.

How did you do it?
Add one step in test_watchdog_reboot to stop the cpu_wdt service before doing watchdog reboot.

How did you verify/test it?
Verified on Celestica-E1031 testbed.

Signed-off-by: Zhijian Li <zhijianli@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants