Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2arc checksum errors #2879

Closed
ColdCanuck opened this issue Nov 7, 2014 · 8 comments
Closed

L2arc checksum errors #2879

ColdCanuck opened this issue Nov 7, 2014 · 8 comments
Labels
Status: Inactive Not being actively updated Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@ColdCanuck
Copy link
Contributor

I have noticed that the L2 checksum error count as reported by /proc/spl/kstat/zfs/arcstats l2_cksum_bad starts to show non-zero after the L2cache has been filled and wraps around.

I have observed this on three different machines with three different cache devices (SSD, /dev/zram, and a file backed block device mounted on /dev/loop. In all cases there are zero errors until the L2cache fills and presumably wraps around.

zfs version is 0.6.3 from Darik's PPA for Ubuntu 12.04
dmesg | grep ZFS
[ 6.521205] ZFS: Loaded module v0.6.3-2~precise, ZFS pool version 5000, ZFS filesystem version 5

kernel is:
Linux version 3.5.0-54-generic (buildd@allspice) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #81~precise1-Ubuntu SMP Tue Jul 15 04:02:22 UTC 2014

The phenomenon is independent of cache size, i've tried 4, 8 and 64GB sizes.

L2arc compression is on

cat /sys/module/zfs/parameters/l2arc_nocompress
0

Since it happens on three very different types of block devices, and on different machines, I don't think the underlying device is causing data corruption. My guess would be a bug in the "wrap code", but that's only from the fact that the errors start to accumulate once the underlying block device has been filled.

My hope is that this just lowers the cache efficiency; the data with the bad checksum is simply ignored.

@behlendorf
Copy link
Contributor

@ColdCanuck I just checked two of my machines which are also running 0.6.3 and have 100% full L2ARC devices. Neither are reporting non-zero values in l2_cksum_bad for what it's worth. On one of your systems here are a few things to check:

  1. Is the l2_io_error counter being incremented. This would indicate an issue with the disk but it sounds like that's not the case here.
  2. Are there any non-default tuning you're using?
  3. Are you getting a cksum error for every L2ARC block, or just some?

My hope is that this just lowers the cache efficiency; the data with the bad checksum is simply ignored.

That's exactly right. If the checksum fails the ARC just falls back to reading the data from the primary store. Your application won't be given incorrect data even if the L2ARC device feeds you random garbage.

@ColdCanuck
Copy link
Contributor Author

@behlendorf No special tuning used other than ARC size limited to 12GB; two pools each with a 64GB L2ARC backed in this case with part ions on a SSD. But I have seen this with /dev/zram backed L2ARcs as well, and perhaps with L2ARC compression on and off.

I had to reboot the machine for an unrelated matter, so the cache is no longer full and no longer throwing errors. I will monitor this as they fill up again.

However I did have some saved logs of /proc/.../arcstats from when it was having problems

The io_error is non-zero but much smaller that the cksum error. There are no errors logged in the system logs.

here are the deltas for two states separated in time

cksum_error 75600
io_error 26871
l2_hits 176330
l2-read_bytes 12016334336

or 68146 bytes / hit
0.152 io_errors / hit
0.429 cksum_errors / hit

Since the deltas were taken after the errors were happening, it does not appear to happen on every l2_hit.

I will watch and report back as it starts again. Given that it does not compromise data, there are more important issues to fix.

@AndCycle
Copy link

AndCycle commented Nov 8, 2014

I have similar issue few times, but didn't find out how to correct reproduce that,
always after the L2ARC is full and start overlapping from the beginning,
only happened if add/remove L2ARC without reboot,

haven't got this issue for a while so I doesn't report it.

@ColdCanuck
Copy link
Contributor Author

In my case the L2ARC was always added after reboot as well.

@maci0
Copy link
Contributor

maci0 commented Nov 9, 2014

if it happens only after adding L2ARC devices without reboot it should be easy to reproduce.. can just create a 10MB zram device or so and add that.. ill give it a try

@hunleyd
Copy link

hunleyd commented Jan 21, 2015

Just curious if you were able to repro this @maci0 as I'm seeing it on my zram-based l2arc as well (uptime ~25 days):
l2_cksum_bad 4 1950160
l2_io_error 4 0

and yes, my l2arc dev is added after a reboot (by a systemd service script, if it matters)

@maci0
Copy link
Contributor

maci0 commented Jan 27, 2015

in the past i have had problems with zram and other filesystems.. those boiled down to the need to use the systems PAGESIZE as blocksize for the FS.. maybe that applies here as well...

@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@stale stale bot closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

6 participants
@behlendorf @ColdCanuck @hunleyd @maci0 @AndCycle and others