-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L2arc checksum errors #2879
Comments
@ColdCanuck I just checked two of my machines which are also running 0.6.3 and have 100% full L2ARC devices. Neither are reporting non-zero values in
That's exactly right. If the checksum fails the ARC just falls back to reading the data from the primary store. Your application won't be given incorrect data even if the L2ARC device feeds you random garbage. |
@behlendorf No special tuning used other than ARC size limited to 12GB; two pools each with a 64GB L2ARC backed in this case with part ions on a SSD. But I have seen this with /dev/zram backed L2ARcs as well, and perhaps with L2ARC compression on and off. I had to reboot the machine for an unrelated matter, so the cache is no longer full and no longer throwing errors. I will monitor this as they fill up again. However I did have some saved logs of /proc/.../arcstats from when it was having problems The io_error is non-zero but much smaller that the cksum error. There are no errors logged in the system logs. here are the deltas for two states separated in time cksum_error 75600 or 68146 bytes / hit Since the deltas were taken after the errors were happening, it does not appear to happen on every l2_hit. I will watch and report back as it starts again. Given that it does not compromise data, there are more important issues to fix. |
I have similar issue few times, but didn't find out how to correct reproduce that, haven't got this issue for a while so I doesn't report it. |
In my case the L2ARC was always added after reboot as well. |
if it happens only after adding L2ARC devices without reboot it should be easy to reproduce.. can just create a 10MB zram device or so and add that.. ill give it a try |
Just curious if you were able to repro this @maci0 as I'm seeing it on my zram-based l2arc as well (uptime ~25 days): and yes, my l2arc dev is added after a reboot (by a systemd service script, if it matters) |
in the past i have had problems with zram and other filesystems.. those boiled down to the need to use the systems PAGESIZE as blocksize for the FS.. maybe that applies here as well... |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
I have noticed that the L2 checksum error count as reported by /proc/spl/kstat/zfs/arcstats l2_cksum_bad starts to show non-zero after the L2cache has been filled and wraps around.
I have observed this on three different machines with three different cache devices (SSD, /dev/zram, and a file backed block device mounted on /dev/loop. In all cases there are zero errors until the L2cache fills and presumably wraps around.
zfs version is 0.6.3 from Darik's PPA for Ubuntu 12.04
dmesg | grep ZFS
[ 6.521205] ZFS: Loaded module v0.6.3-2~precise, ZFS pool version 5000, ZFS filesystem version 5
kernel is:
Linux version 3.5.0-54-generic (buildd@allspice) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #81~precise1-Ubuntu SMP Tue Jul 15 04:02:22 UTC 2014
The phenomenon is independent of cache size, i've tried 4, 8 and 64GB sizes.
L2arc compression is on
cat /sys/module/zfs/parameters/l2arc_nocompress
0
Since it happens on three very different types of block devices, and on different machines, I don't think the underlying device is causing data corruption. My guess would be a bug in the "wrap code", but that's only from the fact that the errors start to accumulate once the underlying block device has been filled.
My hope is that this just lowers the cache efficiency; the data with the bad checksum is simply ignored.
The text was updated successfully, but these errors were encountered: