[Nokia][sonic-platform] Update Nokia sonic-platform submodule for enhanced platform stability #411
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why I did it
To ensure that a specific problematic error path is recognized and handled when platform specific device can become unresponsive. Also to separately enhance certain platform reboot path diagnostics to assist with system operational analysis.
Please pair these changes with NDK release version 22.9.28.
These changes relate to the issue described at https://github.com/Nokia-ION/ndk/issues/49.
Work item tracking
How I did it
The changes involved here, in association with NDK version 22.9.28, are as follows:
o Previously, this mechanism could get potentially get stuck if the gRPC API call was stalled somehow. Now, the API ping will timeout (really, be interrupted) in 5 secs if stalled.
o More explicit logging when/if an API ping failure occurs in the above referenced python based health monitoring script
o Also, this shell script previously slept for a large chunk of time (~30 secs) between python health monitor invocations. Now it is sleeping for only a sec at a time while monitoring for FileSystem based device hang signal (this FS based failure signal is triggered by the new NDK device hang monitoring logic). Failure signature file appearance then results in watchdog script rebooting without waiting for HW watchdog to fire. When failure signature is detected then Nokia specific Systemd controlled services are also prevented from re-starting.
o If one of these internal devices stop responding to transactions, said signature will be natively detected so that device can be isolated (no more transactions originated from CPU), signature noted/reported by platform code vis a vis reboot-cause, system timely rebooted, etc.
o Driver transaction history archive is captured by SW and logged at signature detection time, along with call stack backtrace, for assistance in post-mortem analysis
o NDK core file is also dumped to aid in comprehensive post-mortem and root-cause analysis
How to verify it
Device hang can be reproduced using the procedure described at the related issue here:
https://github.com/Nokia-ION/ndk/issues/49#issuecomment-2201157983
After the condition is detected then the LC will reboot and reboot cause should reflect appropriately. Further, an NDK core file should have been dropped at /var/core/ directory.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
[Nokia][sonic-platform] Update Nokia sonic-platform submodule for enhanced platform stability
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)