Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfsstress: send/recv deadlock #4169

Closed
nedbass opened this issue Jan 5, 2016 · 7 comments
Closed

zfsstress: send/recv deadlock #4169

nedbass opened this issue Jan 5, 2016 · 7 comments
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Inactive Not being actively updated Status: Stale No recent activity for issue Status: Understood The root cause of the issue is known

Comments

@nedbass
Copy link
Contributor

nedbass commented Jan 5, 2016

My test system deadlocked while running the zfsstress test scripts. It was running the large dnode patch from #3542 but the deadlock doesn't appear related to the patch. I have no analysis but I gathered the relevant stack traces below.

root      9988     2  0  2015 ?        00:00:28 [z_fr_iss_7]
[<ffffffff81194afb>] pipe_wait+0x5b/0x80
[<ffffffff811953da>] pipe_write+0x36a/0x6b0
[<ffffffff81189bea>] do_sync_write+0xfa/0x140
[<ffffffff81189ee8>] vfs_write+0xb8/0x1a0
[<ffffffffa03c8a9d>] vn_rdwr+0xbd/0x290 [spl]
[<ffffffffa0c10db2>] dump_bytes_cb+0x82/0x170 [zfs]
[<ffffffffa03c66f4>] taskq_thread+0x264/0x560 [spl]
[<ffffffff8109b4d6>] kthread+0x96/0xa0
[<ffffffff8100c24a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

root     10039     2  0  2015 ?        00:35:16 [txg_sync]
[<ffffffffa03cb946>] cv_wait_common+0xf6/0x2a0 [spl]
[<ffffffffa03cbb08>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa0cd21ab>] zio_wait+0x1bb/0x350 [zfs]
[<ffffffffa0c578b0>] spa_sync+0x830/0xdc0 [zfs]
[<ffffffffa0c73673>] txg_sync_thread+0x3e3/0x700 [zfs]
[<ffffffffa03c4eb1>] thread_generic_wrapper+0x71/0xb0 [spl]
[<ffffffff8109b4d6>] kthread+0x96/0xa0
[<ffffffff8100c24a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

root     12870 12869  0 Jan03 pts/1    00:00:00 /home/bass6/zfs/cmd/zfs/.libs/lt-zfs recv tank/d/RECV_
[<ffffffffa03cb9c5>] cv_wait_common+0x175/0x2a0 [spl]
[<ffffffffa03cbb45>] __cv_wait+0x15/0x20 [spl]
[<ffffffffa0c71f0f>] txg_wait_synced+0x11f/0x1f0 [zfs]
[<ffffffffa0c4231a>] dsl_sync_task+0x16a/0x250 [zfs]
[<ffffffffa0c10026>] dmu_recv_begin+0x3f6/0x420 [zfs]
[<ffffffffa0ca847c>] zfs_ioc_recv+0x18c/0xc30 [zfs]
[<ffffffffa0ca7ac8>] zfsdev_ioctl+0x4a8/0x510 [zfs]
[<ffffffff8119f362>] vfs_ioctl+0x22/0xa0
[<ffffffff8119f984>] do_vfs_ioctl+0x84/0x5e0
[<ffffffff8119ff61>] sys_ioctl+0x81/0xa0
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

root     12873 12868  0 Jan03 pts/1    00:00:00 /home/bass6/zfs/cmd/zfs/.libs/lt-zfs send tank/d/RECV_DUm8VsVfZGig mlYlNH4Cv0EvVqvIwLaieutj
[<ffffffffa03c5c35>] taskq_wait_id+0x65/0xa0 [spl]
[<ffffffffa0c5e10c>] spa_taskq_dispatch_sync+0x9c/0x130 [zfs]
[<ffffffffa0c10ee2>] dump_bytes+0x42/0x50 [zfs]
[<ffffffffa0c10fc5>] dump_record+0xd5/0x150 [zfs]
[<ffffffffa0c11585>] dump_write+0x2a5/0x4f0 [zfs]
[<ffffffffa0c12dcd>] backup_cb+0x88d/0x9c0 [zfs]
[<ffffffffa0c13faa>] traverse_visitbp+0x50a/0x900 [zfs]
[<ffffffffa0c13e1f>] traverse_visitbp+0x37f/0x900 [zfs]
[<ffffffffa0c14c01>] traverse_dnode+0x71/0xe0 [zfs]
[<ffffffffa0c1414d>] traverse_visitbp+0x6ad/0x900 [zfs]
[<ffffffffa0c13e1f>] traverse_visitbp+0x37f/0x900 [zfs]
[<ffffffffa0c13e1f>] traverse_visitbp+0x37f/0x900 [zfs]
[<ffffffffa0c13e1f>] traverse_visitbp+0x37f/0x900 [zfs]
[<ffffffffa0c13e1f>] traverse_visitbp+0x37f/0x900 [zfs]
[<ffffffffa0c13e1f>] traverse_visitbp+0x37f/0x900 [zfs]
[<ffffffffa0c13e1f>] traverse_visitbp+0x37f/0x900 [zfs]
[<ffffffffa0c14c01>] traverse_dnode+0x71/0xe0 [zfs]
[<ffffffffa0c1423d>] traverse_visitbp+0x79d/0x900 [zfs]
[<ffffffffa0c14543>] traverse_impl+0x1a3/0x580 [zfs]
[<ffffffffa0c149b6>] traverse_dataset+0x56/0x60 [zfs]
[<ffffffffa0c11bdb>] dmu_send_impl+0x40b/0x5b0 [zfs]
[<ffffffffa0c12257>] dmu_send_obj+0x197/0x210 [zfs]
[<ffffffffa0ca3ef7>] zfs_ioc_send+0xa7/0x280 [zfs]
[<ffffffffa0ca7ac8>] zfsdev_ioctl+0x4a8/0x510 [zfs]
[<ffffffff8119f362>] vfs_ioctl+0x22/0xa0
[<ffffffff8119f984>] do_vfs_ioctl+0x84/0x5e0
[<ffffffff8119ff61>] sys_ioctl+0x81/0xa0
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
@tuxoko
Copy link
Contributor

tuxoko commented Jan 5, 2016

There's no thread doing zio?
What's in /proc/spl/taskq?

@nedbass
Copy link
Contributor Author

nedbass commented Jan 6, 2016

@tuxoko unfortunately I rebooted the system and lost the state. But it looks like [z_fr_iss_7] is stuck and holding things up. None of the other taskq threads had anything noteworthy in their stacks. I'll post more data if I manage to reproduce it.

@tuxoko
Copy link
Contributor

tuxoko commented Jan 6, 2016

[z_fr_iss_7] is on the send side writing to pipe and it itself looks fairly normal. It doesn't hold anything in zfs except for the taskq thread itself. The problem is whether there's anyone doing I/O in the recv side.

openzfs/spl#496
I had a very similar issue, where the recv I/O taskq is the same as the send side taskq. And the dynamic spawning doesn't spawn correctly, therefore no one doing real work. Though, I'm not sure if you have already include that patch or if you're using dynamic taskq.

@nedbass
Copy link
Contributor Author

nedbass commented Jan 6, 2016

I was running recent SPL master branch with the fix for #496.

@tuxoko
Copy link
Contributor

tuxoko commented Jan 6, 2016

@nedbass @behlendorf
I think I understand why. There's a window where tq_nactive is not accurate.

If t1 and t2 both want to dispatch on the same dynamic taskq with one thread.
It may not spawn thread when it should.

t1                    t2                   taskq_thread
taskq_dispatch()
 spin_lock()          taskq_dispatch()
 //tq_nactive=0        spin_lock() //spin
 //tq_nthread=1
 //nospawn                                 //woken
 spin_unlock()                             spin_lock() //spin
                       //tq_nactive=0
                       //tq_nthread=1
                       //nospawn
                       spin_unlock()
                                           tq_nactive=1
                                           spin_unlock()
                                           tq_func() //wait on t2 work item

@tuxoko
Copy link
Contributor

tuxoko commented Jan 6, 2016

The easy fix would be just make dump_bytes_cb in its own taskq, which would eliminate this specific circular dependence.

Edit:
This also brings up an ancient circular dependence deadlock with DMU traverse prefetch and vdev_file both using system_taskq. The vdev_file was changed to dedicated taskq (bc25c93) but was changed back later when dynamic taskq is added (aa9af22).

So it comes to whether we want to eliminate all circular dependency or fix the dynamic spawning behaviour.

@behlendorf behlendorf added the Component: Test Suite Indicates an issue with the test framework or a test case label Nov 8, 2016
@behlendorf behlendorf changed the title send/recv deadlock zfsstress: send/recv deadlock Jan 24, 2017
@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@stale stale bot closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Inactive Not being actively updated Status: Stale No recent activity for issue Status: Understood The root cause of the issue is known
Projects
None yet
Development

No branches or pull requests

4 participants
@tuxoko @behlendorf @nedbass and others