Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Nokia][armhf] SQUASHFS errors which can lead to box becoming inaccessible #8479

Closed
Blueve opened this issue Aug 16, 2021 · 4 comments · Fixed by #8666
Closed

[Nokia][armhf] SQUASHFS errors which can lead to box becoming inaccessible #8479

Blueve opened this issue Aug 16, 2021 · 4 comments · Fixed by #8666

Comments

@Blueve
Copy link
Contributor

Blueve commented Aug 16, 2021

Description

We observed that SQUASHFS errors raised randomly on our testbed and that issue can lead the switch becoming inaccessible.

The SQUASHFS errors happening on Open Community (OC) Test Run, usually happens after 35-45 hours.

Steps to reproduce the issue:

  1. Install SONiC 202012 armhf on Nokia platform
  2. Run OC test suite continuously

Describe the results you received:

Tests pass.

Describe the results you expected:

Some tests failed due to switch inaccessible.

Output of show version:

N/A

Output of show techsupport:

N/A

Additional information you deem important (e.g. issue happens only occasionally):

@Blueve
Copy link
Contributor Author

Blueve commented Aug 19, 2021

Repair (Work Around)

Clean install from ONIE

Investigation from Nokia team:

Ruled out

  • Bad memory (DIMMS) - multiple memory stress tests run
  • Bug in kernel zlib code - believe zlib code is victim of prior failure
  • Kernel config changes to QSQUASHFS_DIRECT
  • SCSI subsystem errors (scsi_logging_level - nothing reported)
  • SSD device health, temperature, bad block count
  • Compile kernel with gcc-7 compiler
  • fstrim interference (commented out)
  • Linux VM file system buffer cache config
  • Arm Coretex-A9 r4 Errata #854369 Under very rate timing circumstances, transitioning into streaming mode might create a data corruption (details can be found in https://developer.arm.com/documentation/uan0009/latest)

Bad SQUASHFS block

Good data

o	< 0007f0 f3 e1 6e f9 70 f7 7f e1 1b a2 bd ad ef 6c f1 e1 
o	< 000800 cf 22 fd c2 5c 9f 7f 05 20 be c6 00 a2 91 50 df
o	< 000810 0f b2 fe 99 4d ef 28 a7 0d f2 f4 c4 fb d9 f8 d0
o	< 000820 e3 9e 05 41 bf 92 fb c3 b3 c0 9b ce af 63 2e 04
o	< 000830 3c 89 fb 9f 60 52 b9 27 f0 d3 8d e5 1d 51 79 61
o	< 000840 e0 15 9b c8 17 4c 95 7b 01 6f ca dc 49 a8 bc 08
o	< 000850 f0 56 41 dc 33 e8 b8 37 f0 94 2f 79 62 d5 ad 5f
o	< 000860 14 b8 b9 29 d1 0c dd eb d3 67 10 ff d2 34 d9 d3
o	< 000870 c3 a0 ea 8b 81 fe 11 eb 07 eb d6 2f 8e eb 37 e3

Bad data

o	> 0007f0 f3 e1 6e f9 70 f7 7f e1 1b a2 bd ad ef 6c 50 e0
o	> 000800 ef 7a f8 51 0e 75 1e 54 3b 41 01 9f 10 02 88 06
o	> 000810 d4 c1 80 00 00 18 08 00 00 00 86 dd 6c 0c fe 5e
o	> 000820 00 33 06 01 fc 00 00 00 00 00 00 00 00 00 00 00
o	> 000830 00 00 00 7a fc 00 00 00 00 00 00 00 00 00 00 00
o	> 000840 00 00 00 79 00 b3 dd 62 b3 62 7e d0 50 09 c9 3b
o	> 000850 80 18 00 07 72 6f 00 00 01 01 08 0a c2 50 6b 4a
o	> 000860 12 2d 9d ce ff ff ff ff ff ff ff ff ff ff ff ff
o	> 000870 ff ff ff ff 00 13 04 55 55 55 55 00 00 00 37 e3

Analysis show the corruption data is IPv6 BGP Keep Alive packet
Nokia have multiple samples with corruption that shows BGP Keep Alive packet

@dflynn-Nokia
Copy link
Contributor

After much effort we have determined the root cause of the squashfs errors and ext4 filesystem corruption observed on the Nokia ixs-7215 M0 platform when running the sonic-mgmt open community test suite. We isolated this down to the auto-restart and critical-process-monitoring test cases. In both of these test cases, a SIGKILL signal is issued to the syncd process within the syncd docker. There is a flaw in the current handling of this SIGKILL operation on the ixs-7215. Specifically the DMA engine on the Alley-Cat3x (AC3x) is not gracefully shutdown before the allocated DMA buffer memory is freed back to the Linux kernel. As a result, the AC3x DMA engine continues to operate on DMA buffer memory that may have been re-allocated by the kernel for another purpose, for example file system buffers used by the squashfs and/or other ext4 filesystems. This is confirmed by our observation that squashfs file system buffers were being overwritten by BGP control packets. We also observed corruption of physical memory location 0 indicating that the AC3x DMA engine is in an inconsistent state as a result of the SIGKILL of syncd.

The Marvell team is currently testing a fix for this problem which performs a graceful shutdown of the AC3x DMA engine before DMA buffers are freed. So far the test results show that the squashfs errors and ext4 corruption are no longer seen. Both Nokia and Marvell continue to run and re-run the sonic-mgmt test suite on multiple test beds as we rigorously test this fix.

Note:
The AC3x packet processing SW design performs control packet processing in user-mode drivers within the syncd process and thus is susceptible to the syncd SIGKILL operation. Other vendor designs that implement packet processing in kernel-mode drivers are not susceptible. This explains why this problem is not seen when running the sonic-mgmt test suite against other platforms.

@Blueve Blueve linked a pull request Sep 2, 2021 that will close this issue
5 tasks
@Blueve Blueve closed this as completed Sep 7, 2021
lguohan pushed a commit that referenced this issue Sep 16, 2021
1) Enhancements for squashfs issue.
2) Fixed log levels.

Fix #8479
Fix #8698
Fix #8699

Signed-off-by: Rajkumar Pennadam Ramamoorthy <rpennadamram@marvell.com>
@Blueve
Copy link
Contributor Author

Blueve commented Sep 24, 2021

Observed the issue again. Reopen to track.

@Blueve Blueve reopened this Sep 24, 2021
@radha-danda
Copy link

@Blueve, issue is fixed in 202012 and master branch as well.
Link to PR (on master branch): #10403
Link to PR (on 202012 branch): #8836

Kindly check and close the issue

@Blueve Blueve closed this as completed May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants