-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VERIFY3(0 == remove_reference(hdr, ((void *)0), tag)) failed (0 == 1) #6881
Comments
I don't have a solution, yet, but also see the same bug. This is on a 36 bay system populated with 8T disks. `# zpool status
errors: 2927 data errors, use '-v' for a list In the dmesg:
Then followed by many more of the same messages. I started out with an Areca SAS Raid Card, and initially thought the problem was with the card, so I replaced it with an LSI, but still see the same issue. Initially using the latest CentOS 7 kernel with the CentOS 7 zfsonlinux, but installed kernel 4.14.3 yesterday with spl-0.7.3 (patched to change "vfs_read and vfs_write to kernel_read and kernel_write" and zfs-0.7.3. Still see the same error. If anyone on the zfsonlinux project has an idea what is going on here, I would very much appreciate it! |
Any movement on this issue? |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
This issue was raised at the September 2021 OpenZFS Leadership meeting as a potentially serious bug that has fallen through the cracks. It may relate to https://www.illumos.org/issues/14003#note-1 |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
This bug is still out there, but it's apparently hard to reproduce. We've seen it "in the wild". |
I can confirm this issue is still relevant - I just got it. I might suggest a cause though. First my details And the error: [1351503.757506] INFO: task z_rd_int_1:7824 blocked for more than 362 seconds. and the console: kernel:[1351131.586643] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1) This was a device that is hanging off the usbc that developed a fault on an mdraid I am running (in the middle of upgrading this laptop). I had to reseat the cable (pull it out, plug it in), so it is highly likely that something about device errors in the interim caused this issue. The mdraid BTW recovered with no issues (AFAIK), but it would appear to be a definite trigger event. |
I also just ran into something very similar. This is with regular spinning rust drives connected via SATA (using a SAS expander) and there were no other IO-related errors in the kernel log. Since it was mentioned in some other issue, I use ZFS encryption. From SMART monitoring it seems that
EDIT: Weirdly, after a reboot the warning was still there in status, but there was no CKSUM error listed for the problem drive, and after running |
Skip down a couple sections if you don't care about any of this backstory; it might be relevant to the problem, or it might not, I don't know.
Background information
I have a ZFS filesystem containing disk image files which I update every week and then take a snapshot, so that I have a weekly backup history stored up.
I had some disk corruption on this pool's 4-disk raidz1 vdev during a period in which I was running it on only 3 disks, so
zpool status -v
knows of a couple of snapshots that contain corrupted data. I'm fairly certain that there are also a few more snapshots affected than just whatzpool status -v
reports, due to shared blocks between the snapshots.Basically, this pool is barely hanging on, with at least one very marginal drive. I've spent over 3 months and numerous hours now attempting to get all of my data off of this doomed pool via
zfs send
/zfs recv
, something that is still not fully possible due toEINVAL
-related problems withzfs recv
that I've described in other issue reports here.I also cannot really "simply swap out the bad disks", because whenever I attempt a
zfs replace
, it gets stuck at a certain point, never completes, and never detaches the old device. (Oh and this pool now also spends its time perpetually resilvering over and over, apparently due to the stuck-replacement problem. Or some other bug. I don't know.)Here's what this wonderful pool looks like:
What I was doing when the panics happened
I've resigned myself to the inevitability that I'll never be able to
zfs send
/zfs recv
my data off of this pool and onto a new one before my drives have fully disintegrated into scrap. Since this collection of historical backups relies extremely heavily on CoW space savings between snapshots (it's about 2.77 TiB compressed, but would be well over 7x larger otherwise), I decided to write up some scripts and manually port over the data to the new pool in a manner that maintains the CoW space savings by only writing the changed parts.The details of that aren't relevant; what is relevant here is that as a precondition to doing this manual transfer process, I ran a bunch of
md5sum
processes with GNUparallel
to get hashes of each disk image at each snapshot. Most of this went fine.For reference,
zpool status -v
knows of data corruption in snapshots20170327
and20170529
.The
md5sum
process that attempted to read the disk image in the20170327
snapshot error'd out withEIO
, as expected.However, the
md5sum
processes that attempted to read the disk image in the20170515
,20170522
,20170529
, and20170605
snapshots got stuck in stateD
(uninterruptible sleep) forever, assumedly due to the task thread panics that are the subject of this report.Interestingly, the system is still entirely usable (i.e. other I/O initiated after the problem doesn't get stuck); but those
md5sum
processes are forever stuck in stateD
, and the couple ofz_rd_int_7
andz_rd_int_0
kernel threads that had the panics are now idle, doing nothing.The task thread panics themselves
Two panics, separated by approximately 2 seconds.
The stuck
md5sum
processesAll of the stuck
md5sum
processes have the following stack:System information
The text was updated successfully, but these errors were encountered: