Kernel panic during "zfs recv" of multiple snapshot stream #1795

olw2005 · 2013-10-17T14:35:28Z

Same setup as outlined in issue #1761. Rebooted a server (hostname ctc-san2) affected by the "dataset busy" error to allow snapshot send / recv to continue at roughly 16:17 local time. At 16:45 local time, scripted zfs send cron task kicked off from another host, with ctc-san2 as the "zfs recv" side. Fairly large snapshot stream of around 20 incremental snapshots. Logs indicate that about eight snapshots were received successfully, up to 17:26 local time. Shortly after that (17:34 local time) the system dumped core. (Large core file, so the dump took awhile. But when it did finish, the system rebooted and continued with normal operation.)

Comments / suggestions most welcome. =)

Debug output from crash utility:

crash> log
...
BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
IP: [] mutex_lock+0x21/0x50
PGD 108383d067 PUD 1186785067 PMD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/devices/virtual/block/zd16/dev
CPU 1
Modules linked in: mptctl mptbase drbd(U) autofs4 sunrpc cpufreq_ondemand freq_table pcc_cpufreq bonding 8021q garp stp llc ipv6 uinput hpilo hpwdt qlcnic zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode serio_raw iTCO_wdt iTCO_vendor_support ioatdma dca sg power_meter bnx2x libcrc32c mdio tg3 shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom hpsa pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 14962, comm: zvol_id Tainted: P --------------- 2.6.32-358.18.1.el6.x86_64 #1 HP ProLiant DL380p Gen8
RIP: 0010:[] [] mutex_lock+0x21/0x50
RSP: 0018:ffff88107f9cdbe8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000058 RCX: ffff8816f611d9e8
RDX: 0000000000000000 RSI: ffffffffa038d628 RDI: 0000000000000058
RBP: ffff88107f9cdbf8 R08: ffff88107f9cdb20 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa038d628
R13: 000000000000001d R14: ffff8817c3f05be0 R15: ffff8810803ce000
FS: 00007f961929d740(0000) GS:ffff880094820000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000058 CR3: 000000108118f000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process zvol_id (pid: 14962, threadinfo ffff88107f9cc000, task ffff88181b29a040)
Stack:
0000000000000000 ffffffffa038d628 ffff88107f9cdc18 ffffffffa02dca75
ffff8816fffc6000 ffffffffa038d628 ffff88107f9cdc38 ffffffffa0380c6c
ffff8816fffc6000 0000000000000001 ffff88107f9cdc58 ffffffffa0380d58
Call Trace:
[] dmu_buf_rele+0x25/0x40 [zfs]
[] zvol_last_close+0x3c/0xb0 [zfs]
[] zvol_release+0x78/0x80 [zfs]
[] __blkdev_put+0x1a6/0x1b0
[] __blkdev_get+0x2e9/0x3b0
[] ? blkdev_open+0x0/0xc0
[] blkdev_get+0x10/0x20
[] blkdev_open+0x71/0xc0
[] __dentry_open+0x10a/0x360
[] ? security_inode_permission+0x1f/0x30
[] nameidata_to_filp+0x54/0x70
[] do_filp_open+0x6d0/0xdc0
[] ? __do_page_fault+0x1ec/0x480
[] ? alloc_fd+0x92/0x160
[] do_sys_open+0x69/0x140
[] sys_open+0x20/0x30
[] system_call_fastpath+0x16/0x1b
Code: ff ff 85 c0 75 ed eb d6 66 90 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 0f 1f 44 00 00 48 89 fb e8 a2 f4 ff ff 48 89 df ff 0f 79 05 e8 25 00 00 00 65 48 8b 04 25 c8 cb 00 00 48 2d
RIP [] mutex_lock+0x21/0x50
RSP
CR2: 0000000000000058

BACKTRACE
crash> bt
PID: 14962 TASK: ffff88181b29a040 CPU: 1 COMMAND: "zvol_id"
#0 [ffff88107f9cd7b0] machine_kexec at ffffffff81035d6b
#1 [ffff88107f9cd810] crash_kexec at ffffffff810c0e22
#2 [ffff88107f9cd8e0] oops_end at ffffffff81511c20
#3 [ffff88107f9cd910] no_context at ffffffff81046c1b
#4 [ffff88107f9cd960] __bad_area_nosemaphore at ffffffff81046ea5
#5 [ffff88107f9cd9b0] bad_area at ffffffff81046fce
#6 [ffff88107f9cd9e0] __do_page_fault at ffffffff81047780
#7 [ffff88107f9cdb00] do_page_fault at ffffffff81513b6e
#8 [ffff88107f9cdb30] page_fault at ffffffff81510f25
[exception RIP: mutex_lock+33]
RIP: ffffffff8150f621 RSP: ffff88107f9cdbe8 RFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000058 RCX: ffff8816f611d9e8
RDX: 0000000000000000 RSI: ffffffffa038d628 RDI: 0000000000000058
RBP: ffff88107f9cdbf8 R8: ffff88107f9cdb20 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa038d628
R13: 000000000000001d R14: ffff8817c3f05be0 R15: ffff8810803ce000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88107f9cdc00] dmu_buf_rele at ffffffffa02dca75 [zfs]
#10 [ffff88107f9cdc20] zvol_last_close at ffffffffa0380c6c [zfs]
#11 [ffff88107f9cdc40] zvol_release at ffffffffa0380d58 [zfs]
#12 [ffff88107f9cdc60] __blkdev_put at ffffffff811bcdb6
#13 [ffff88107f9cdcb0] __blkdev_get at ffffffff811bd169
#14 [ffff88107f9cdd10] blkdev_get at ffffffff811bd240
#15 [ffff88107f9cdd20] blkdev_open at ffffffff811bd2c1
#16 [ffff88107f9cdd50] __dentry_open at ffffffff8117e49a
#17 [ffff88107f9cddb0] nameidata_to_filp at ffffffff8117e804
#18 [ffff88107f9cddd0] do_filp_open at ffffffff81193260
#19 [ffff88107f9cdf20] do_sys_open at ffffffff8117e249
#20 [ffff88107f9cdf70] sys_open at ffffffff8117e360
#21 [ffff88107f9cdf80] system_call_fastpath at ffffffff8100b072

RIP: 0000003c276db440  RSP: 00007fff2fce1048  RFLAGS: 00010246
RAX: 0000000000000002  RBX: ffffffff8100b072  RCX: 0000003c276daf65
RDX: 00007fff2fce1e0c  RSI: 0000000000000000  RDI: 00007fff2fce1e0c
RBP: 0000000000000000   R8: 0000003c2798e300   R9: 0000003c26e0e9e0
R10: 000000000000000d  R11: 0000000000000246  R12: ffffffff8117e360
R13: ffff88107f9cdf78  R14: 0000000000000000  R15: 0000000000000000
ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

Subset of zpool history -i:
2013-10-16.17:05:09 [internal destroy txg:4673114] dataset = 45990
2013-10-16.17:05:14 zfs destroy pool/chicago@hour-20131016130101
2013-10-16.17:07:23 [internal snapshot txg:4673142] dataset = 61585
2013-10-16.17:07:23 [internal destroy txg:4673143] dataset = 61535
2013-10-16.17:07:23 [internal property set txg:4673143] reservation=0 dataset = 61535
2013-10-16.17:07:23 [internal replay_inc_sync txg:4673144] dataset = 61591
2013-10-16.17:11:46 [internal snapshot txg:4673198] dataset = 61615
2013-10-16.17:11:46 [internal destroy txg:4673199] dataset = 61591
2013-10-16.17:11:46 [internal property set txg:4673199] reservation=0 dataset = 61591
2013-10-16.17:11:47 [internal replay_inc_sync txg:4673200] dataset = 61621
2013-10-16.17:21:05 [internal snapshot txg:4673313] dataset = 61645
2013-10-16.17:21:05 [internal destroy txg:4673314] dataset = 61621
2013-10-16.17:21:05 [internal property set txg:4673314] reservation=0 dataset = 61621
2013-10-16.17:21:06 [internal replay_inc_sync txg:4673315] dataset = 61651
2013-10-16.17:22:15 [internal snapshot txg:4673330] dataset = 61670
2013-10-16.17:22:15 [internal destroy txg:4673331] dataset = 61651
2013-10-16.17:22:15 [internal property set txg:4673331] reservation=0 dataset = 61651
2013-10-16.17:22:16 [internal replay_inc_sync txg:4673332] dataset = 61676
2013-10-16.17:25:35 [internal snapshot txg:4673373] dataset = 61702
2013-10-16.17:25:36 [internal destroy txg:4673374] dataset = 61676
2013-10-16.17:25:36 [internal property set txg:4673374] reservation=0 dataset = 61676
2013-10-16.17:25:36 [internal replay_inc_sync txg:4673375] dataset = 61708
2013-10-16.17:26:32 [internal snapshot txg:4673388] dataset = 61731
2013-10-16.17:26:32 [internal destroy txg:4673389] dataset = 61708
2013-10-16.17:26:32 [internal property set txg:4673389] reservation=0 dataset = 61708
2013-10-16.17:26:33 [internal replay_inc_sync txg:4673390] dataset = 61737
2013-10-16.17:34:19 [internal snapshot txg:4673485] dataset = 61764
2013-10-16.17:34:19 [internal destroy txg:4673486] dataset = 61737
2013-10-16.17:34:19 [internal property set txg:4673486] reservation=0 dataset = 61737
2013-10-16.19:01:01 [internal snapshot txg:4673546] dataset = 61443
2013-10-16.19:01:06 zfs snapshot pool/chicago@hour-20131016190101

The text was updated successfully, but these errors were encountered:

olw2005 · 2014-02-18T18:14:32Z

I've been running main + some patches (see below) for two weeks now with major snapshot transfer traffic. No kernel panics or dataset busy errors! It appears this issue is fixed.

[root@dtc-san2 spl]# git status

On branch master

Your branch is ahead of 'origin/master' by 2 commits.

nothing to commit (working directory clean)
[root@dtc-san2 spl]# git log --oneline | head -n 4
75a9afb Merge branch 'kernel-slab' of https://github.com/DeHackEd/spl
4c99541 Remove incorrect use of EXTRA_DIST for man pages
1f3c06f Use Linux slab allocator for small allocations
d58a99a Define the needed ISA types for Sparc
[root@dtc-san2 spl]# cd ../zfs
[root@dtc-san2 zfs]# git status

On branch master

Your branch is ahead of 'origin/master' by 17 commits.

nothing to commit (working directory clean)
[root@dtc-san2 zfs]# git log --oneline | head -n 20
05c3ca1 Merge branch 'arc-mutexes' of https://github.com/DeHackEd/zfs
cad1342 Merge branch 'issue-2094' of https://github.com/behlendorf/zfs
dc47d0c Workaround for pools affected by #2094 erratum
02c0d35 Revert changes to zbookmark_t
bff1dc8 Add zimport.sh compatibility test script
92d5283 Set "arc_meta_limit" to 3/4 arc_c_max by default
520f372 Split "data_size" into "meta" and "data"
513198f Prioritize "metadata" in arc_get_data_buf
863f942 Remove "arc_meta_used" from arc_adjust calculation
d03b1ff Prune metadata from ghost lists in arc_adjust_meta
cd661cd Revert "Return -1 from arc_shrinker_func()"
f5ee929 Disable arc_p adapt dampener by default
fdd21ca Allow "arc_p" to drop to zero or grow to "arc_c"
00fc78f Disable aggressive arc_p growth by default
7364874 Adjust arc_p based on "bytes" in arc_shrink
8e82b07 Set zfs_arc_min to 4MB
98fad86 Propagate errors when registering "relatime" property callback.
99d3ece Add default FILEDIR path to zpool-config scripts
f12971e Add explicit Conflicts for zfs-fuse packages
0820ec6 Fix zconfig.sh test 9

behlendorf · 2014-02-19T05:10:26Z

@olw2005 That's great news. Thanks for posting your results.

olw2005 closed this as completed Feb 18, 2014

olw2005 mentioned this issue Feb 18, 2014

Zfs incremental recv fails with dataset is busy error #1761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel panic during "zfs recv" of multiple snapshot stream #1795

Kernel panic during "zfs recv" of multiple snapshot stream #1795

olw2005 commented Oct 17, 2013

olw2005 commented Feb 18, 2014

behlendorf commented Feb 19, 2014