-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Nokia][armhf] SQUASHFS errors which can lead to box becoming inaccessible #8479
Comments
Repair (Work Around)Clean install from ONIE Investigation from Nokia team:Ruled out
Bad SQUASHFS blockGood data
Bad data
Analysis show the corruption data is IPv6 BGP Keep Alive packet |
After much effort we have determined the root cause of the squashfs errors and ext4 filesystem corruption observed on the Nokia ixs-7215 M0 platform when running the sonic-mgmt open community test suite. We isolated this down to the auto-restart and critical-process-monitoring test cases. In both of these test cases, a SIGKILL signal is issued to the syncd process within the syncd docker. There is a flaw in the current handling of this SIGKILL operation on the ixs-7215. Specifically the DMA engine on the Alley-Cat3x (AC3x) is not gracefully shutdown before the allocated DMA buffer memory is freed back to the Linux kernel. As a result, the AC3x DMA engine continues to operate on DMA buffer memory that may have been re-allocated by the kernel for another purpose, for example file system buffers used by the squashfs and/or other ext4 filesystems. This is confirmed by our observation that squashfs file system buffers were being overwritten by BGP control packets. We also observed corruption of physical memory location 0 indicating that the AC3x DMA engine is in an inconsistent state as a result of the SIGKILL of syncd. The Marvell team is currently testing a fix for this problem which performs a graceful shutdown of the AC3x DMA engine before DMA buffers are freed. So far the test results show that the squashfs errors and ext4 corruption are no longer seen. Both Nokia and Marvell continue to run and re-run the sonic-mgmt test suite on multiple test beds as we rigorously test this fix. Note: |
Observed the issue again. Reopen to track. |
Description
We observed that SQUASHFS errors raised randomly on our testbed and that issue can lead the switch becoming inaccessible.
The SQUASHFS errors happening on Open Community (OC) Test Run, usually happens after 35-45 hours.
Steps to reproduce the issue:
Describe the results you received:
Tests pass.
Describe the results you expected:
Some tests failed due to switch inaccessible.
Output of
show version
:N/A
Output of
show techsupport
:N/A
Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: