Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Nokia][sonic-platform] Update Nokia sonic-platform submodule for enhanced platform stability #411

Merged
merged 2 commits into from
Jul 12, 2024

Conversation

snider-nokia
Copy link
Contributor

Why I did it

To ensure that a specific problematic error path is recognized and handled when platform specific device can become unresponsive. Also to separately enhance certain platform reboot path diagnostics to assist with system operational analysis.

Please pair these changes with NDK release version 22.9.28.

These changes relate to the issue described at https://github.com/Nokia-ION/ndk/issues/49.

Work item tracking
  • Microsoft ADO (number only):

How I did it

The changes involved here, in association with NDK version 22.9.28, are as follows:

  • Better NDK health monitoring by the python script which is constantly/periodically pinging device_manager via gRPC API:
    o Previously, this mechanism could get potentially get stuck if the gRPC API call was stalled somehow. Now, the API ping will timeout (really, be interrupted) in 5 secs if stalled.
  • Watchdog shell script (which repeatedly invokes above python NDK health monitoring script) is now improved as well:
    o More explicit logging when/if an API ping failure occurs in the above referenced python based health monitoring script
    o Also, this shell script previously slept for a large chunk of time (~30 secs) between python health monitor invocations. Now it is sleeping for only a sec at a time while monitoring for FileSystem based device hang signal (this FS based failure signal is triggered by the new NDK device hang monitoring logic). Failure signature file appearance then results in watchdog script rebooting without waiting for HW watchdog to fire. When failure signature is detected then Nokia specific Systemd controlled services are also prevented from re-starting.
  • Implementation of NDK integral driver based monitoring for device hang signature. New logic ensures that:
    o If one of these internal devices stop responding to transactions, said signature will be natively detected so that device can be isolated (no more transactions originated from CPU), signature noted/reported by platform code vis a vis reboot-cause, system timely rebooted, etc.
    o Driver transaction history archive is captured by SW and logged at signature detection time, along with call stack backtrace, for assistance in post-mortem analysis
    o NDK core file is also dumped to aid in comprehensive post-mortem and root-cause analysis
  • LC watchdog kernel driver is additionally modified in order to ensure that BDB (proprietary slot presence mechanism) is explicitly disabled at reboot or panic time so that SUP card will be timely informed of LC disappearance while LC is on the way down, as opposed to being informed only when LC is on the way back up again.

How to verify it

Device hang can be reproduced using the procedure described at the related issue here:
https://github.com/Nokia-ION/ndk/issues/49#issuecomment-2201157983

After the condition is detected then the LC will reboot and reboot cause should reflect appropriately. Further, an NDK core file should have been dropped at /var/core/ directory.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Tested branch (Please provide the tested image version)

  • 202205

Description for the changelog

[Nokia][sonic-platform] Update Nokia sonic-platform submodule for enhanced platform stability

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@snider-nokia
Copy link
Contributor Author

snider-nokia commented Jul 11, 2024

@judyjoseph, Please prep for merge as soon as NDK release 22.9.28 becomes available (that process is also happening today).

@judyjoseph
Copy link
Collaborator

General comment, should we stick to using /var/log as the log file locations than /tmp ?

wd_log_file=/var/log/nokia-watchdog.log
wd_temp_file=/tmp/nokia-watchdog.tmp

Another instance was /tmp/fsde_dev_hung_sig

@snider-nokia
Copy link
Contributor Author

General comment, should we stick to using /var/log as the log file locations than /tmp ?

wd_log_file=/var/log/nokia-watchdog.log
wd_temp_file=/tmp/nokia-watchdog.tmp

Another instance was /tmp/fsde_dev_hung_sig

/tmp/nokia-watchdog.tmp is put there specifically to minimize SSD file system operations. Much more efficient to use RAM for this particular file (where /tmp/ lives).

/tmp/fsde_dev_hung_sig is also purposefully put at /tmp/ for similar, and additional, reasons. This file needs to disappear at reboot time in conjunction with the logic that prevents our platform services from restarting after/if the device hang condition occurs.

@gechiang gechiang merged commit 710fbb7 into sonic-net:202205 Jul 12, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants