L2arc checksum errors #2879

ColdCanuck · 2014-11-07T18:10:58Z

I have noticed that the L2 checksum error count as reported by /proc/spl/kstat/zfs/arcstats l2_cksum_bad starts to show non-zero after the L2cache has been filled and wraps around.

I have observed this on three different machines with three different cache devices (SSD, /dev/zram, and a file backed block device mounted on /dev/loop. In all cases there are zero errors until the L2cache fills and presumably wraps around.

zfs version is 0.6.3 from Darik's PPA for Ubuntu 12.04
dmesg | grep ZFS
[ 6.521205] ZFS: Loaded module v0.6.3-2~precise, ZFS pool version 5000, ZFS filesystem version 5

kernel is:
Linux version 3.5.0-54-generic (buildd@allspice) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #81~precise1-Ubuntu SMP Tue Jul 15 04:02:22 UTC 2014

The phenomenon is independent of cache size, i've tried 4, 8 and 64GB sizes.

L2arc compression is on

cat /sys/module/zfs/parameters/l2arc_nocompress
0

Since it happens on three very different types of block devices, and on different machines, I don't think the underlying device is causing data corruption. My guess would be a bug in the "wrap code", but that's only from the fact that the errors start to accumulate once the underlying block device has been filled.

My hope is that this just lowers the cache efficiency; the data with the bad checksum is simply ignored.

behlendorf · 2014-11-08T01:07:11Z

@ColdCanuck I just checked two of my machines which are also running 0.6.3 and have 100% full L2ARC devices. Neither are reporting non-zero values in l2_cksum_bad for what it's worth. On one of your systems here are a few things to check:

Is the l2_io_error counter being incremented. This would indicate an issue with the disk but it sounds like that's not the case here.
Are there any non-default tuning you're using?
Are you getting a cksum error for every L2ARC block, or just some?

My hope is that this just lowers the cache efficiency; the data with the bad checksum is simply ignored.

That's exactly right. If the checksum fails the ARC just falls back to reading the data from the primary store. Your application won't be given incorrect data even if the L2ARC device feeds you random garbage.

ColdCanuck · 2014-11-08T13:40:28Z

@behlendorf No special tuning used other than ARC size limited to 12GB; two pools each with a 64GB L2ARC backed in this case with part ions on a SSD. But I have seen this with /dev/zram backed L2ARcs as well, and perhaps with L2ARC compression on and off.

I had to reboot the machine for an unrelated matter, so the cache is no longer full and no longer throwing errors. I will monitor this as they fill up again.

However I did have some saved logs of /proc/.../arcstats from when it was having problems

The io_error is non-zero but much smaller that the cksum error. There are no errors logged in the system logs.

here are the deltas for two states separated in time

cksum_error 75600
io_error 26871
l2_hits 176330
l2-read_bytes 12016334336

or 68146 bytes / hit
0.152 io_errors / hit
0.429 cksum_errors / hit

Since the deltas were taken after the errors were happening, it does not appear to happen on every l2_hit.

I will watch and report back as it starts again. Given that it does not compromise data, there are more important issues to fix.

AndCycle · 2014-11-08T14:57:10Z

I have similar issue few times, but didn't find out how to correct reproduce that,
always after the L2ARC is full and start overlapping from the beginning,
only happened if add/remove L2ARC without reboot,

haven't got this issue for a while so I doesn't report it.

ColdCanuck · 2014-11-08T15:00:20Z

In my case the L2ARC was always added after reboot as well.

maci0 · 2014-11-09T14:47:02Z

if it happens only after adding L2ARC devices without reboot it should be easy to reproduce.. can just create a 10MB zram device or so and add that.. ill give it a try

hunleyd · 2015-01-21T23:54:26Z

Just curious if you were able to repro this @maci0 as I'm seeing it on my zram-based l2arc as well (uptime ~25 days):
l2_cksum_bad 4 1950160
l2_io_error 4 0

and yes, my l2arc dev is added after a reboot (by a systemd service script, if it matters)

maci0 · 2015-01-27T19:59:44Z

in the past i have had problems with zram and other filesystems.. those boiled down to the need to use the systems PAGESIZE as blocksize for the FS.. maybe that applies here as well...

stale · 2020-08-25T01:58:32Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf added Bug - Minor labels Nov 8, 2014

kernelOfTruth mentioned this issue Nov 9, 2014

3525 persistent l2arc. #2672

Closed

behlendorf removed Bug - Minor labels Sep 30, 2016

stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020

stale bot closed this as completed Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L2arc checksum errors #2879

L2arc checksum errors #2879

ColdCanuck commented Nov 7, 2014

behlendorf commented Nov 8, 2014

ColdCanuck commented Nov 8, 2014

AndCycle commented Nov 8, 2014

ColdCanuck commented Nov 8, 2014

maci0 commented Nov 9, 2014

hunleyd commented Jan 21, 2015

maci0 commented Jan 27, 2015

stale bot commented Aug 25, 2020

L2arc checksum errors #2879

L2arc checksum errors #2879

Comments

ColdCanuck commented Nov 7, 2014

behlendorf commented Nov 8, 2014

ColdCanuck commented Nov 8, 2014

AndCycle commented Nov 8, 2014

ColdCanuck commented Nov 8, 2014

maci0 commented Nov 9, 2014

hunleyd commented Jan 21, 2015

maci0 commented Jan 27, 2015

stale bot commented Aug 25, 2020