Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chassis] Clearwater2 cards are going to read-only #90

Closed
arlakshm opened this issue May 21, 2023 · 6 comments
Closed

[chassis] Clearwater2 cards are going to read-only #90

arlakshm opened this issue May 21, 2023 · 6 comments

Comments

@arlakshm
Copy link

The clearwater 2 linecards are going to read-only mode during sonic-mgmt. nightly test.

Message from dmsg

[  650.871471] EXT4-fs (loop1): I/O error while writing superblock
[  650.871473] EXT4-fs (loop1): previous I/O error to superblock detected
[  650.871474] EXT4-fs error (device loop1): ext4_journal_check_start:83: Detected aborted journal
[  650.871475] EXT4-fs (loop1): Remounting filesystem read-only
[  650.871504] Buffer I/O error on dev loop1, logical block 0, lost sync page write
[  650.871510] EXT4-fs (loop1): I/O error while writing superblock
[  650.871511] EXT4-fs error (device loop1): ext4_journal_check_start:83: Detected aborted journal
[  650.871514] EXT4-fs (loop1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 12, error -30)
[  650.871516] Buffer I/O error on device loop1, logical block 429626
[  650.871519] Buffer I/O error on device loop1, logical block 429627
[  650.871520] Buffer I/O error on device loop1, logical block 429628
[  650.871521] Buffer I/O error on device loop1, logical block 429629
[  650.871575] EXT4-fs (loop1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 13, error -30)
[  650.963391] EXT4-fs (loop1): This should not happen!! Data will be lost
@kenneth-arista
Copy link

@arlakshm as discussed offline, mounting /tmp and /var as tmpfs will help minimize potential flash corruption due to power loss, etc.

@kenneth-arista
Copy link

In terms of logs/info to collect,

  • dmesg: logs are stored in a ring buffer and thus should be queried ASAP after a failure is seen
  • smartctl -a /dev/sda1
  • /var/log/syslog
  • journalctl

@rlhui
Copy link

rlhui commented Jul 8, 2023

@kenneth-arista is root cause confirmed/known?

@kenneth-arista
Copy link

The trigger is not specific to CL2. But instead it is a known behavior of EXT4 when there is some file system corruption due to unclean unmounts (e.g. sudden power loss, etc.).

@kenneth-arista
Copy link

Looks like other platforms are moving /var/log to tmpfs to minimize writes to flash. See sonic-net/sonic-buildimage#15077

@kenneth-arista
Copy link

The problem is understood and thus closing this issue. We'll be pushing some changes in the platform code that should help mitigate occurrences in sonic-mgmt testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants