Skip to content

Commit

Permalink
Fixing race condition between O_DIRECT and ZIL
Browse files Browse the repository at this point in the history
There existed a race condition that was discovered through the
dio_random test. When doing fio with --fsync=32, after 32 writes fsync
is called on the file. When this happens, blocks committed to the ZIL
will be sync'ed out. However, the code for the O_DIRECT write was
updated in 31983d2 to always wait if there was an associated ARC buf
with the dbuf for all previous TXG's to sync out.

There was an oversight with this update. When waiting on previous TXG's
to sync out, the O_DIRECT write is holding the rangelock as a writer the
entire time. This causes an issue with the ZIL is commit writes out
though `zfs_get_data()` because it will try and grab the rangelock as
reader. This will lead to a deadlock.

In order to fix this race condition, I updated the `dmu_buf_impl_t`
struct to contain a uint8_t variable that is used to signal if the dbuf
attached to a O_DIRECT write is the wait hold because of mixed direct
and buffered data. Using this new `db_mixed_io_dio_wait` variable in the
in the `dmu_buf_impl_t` the code in `zfs_get_data()` can tell that
rangelock is already being held across the entire block and there is no
need to grab the rangelock at all. Because the rangelock is being held
as a writer across the entire block already, no modifications can take
place against the block as long as the O_DIRECT write is stalled
waiting in `dmu_buf_direct_mixed_io_wait()`.

Also as part of this update, I realized the `db_state` in
`dmu_buf_direct_mixed_io_wait()` needs to be changed temporarily to
`DB_CACHED`. This is necessary so the logic in `dbuf_read()` is correct
if `dmu_sync_late_arrival()` is called by `dmu_sync()`. It is completely
valid to switch the `db_state` back to `DB_CACHED` is there is still an
associated ARC buf that will not be freed till out O_DIRECT write is
completed which will only happen after if leaves
`dmu_buf_direct_mixed_io_wait()`.

Here is the stack trace of the deadlock that happen with
`dio_random.ksh` before this commit:
[ 5513.663402] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 7496.580415] INFO: task z_wr_int:1098000 blocked for more than 120
seconds.
[ 7496.585709]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.593349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.600839] task:z_wr_int        state:D stack:    0 pid:1098000
ppid:     2 flags:0x80004080
[ 7496.608622] Call Trace:
[ 7496.611770]  __schedule+0x2d1/0x870
[ 7496.615404]  schedule+0x55/0xf0
[ 7496.618866]  cv_wait_common+0x16d/0x280 [spl]
[ 7496.622910]  ? finish_wait+0x80/0x80
[ 7496.626601]  dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs]
[ 7496.631327]  dmu_write_direct_done+0x90/0x3a0 [zfs]
[ 7496.635798]  zio_done+0x373/0x1d40 [zfs]
[ 7496.639795]  zio_execute+0xee/0x210 [zfs]
[ 7496.643840]  taskq_thread+0x203/0x420 [spl]
[ 7496.647836]  ? wake_up_q+0x70/0x70
[ 7496.651411]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[ 7496.656489]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 7496.660604]  kthread+0x134/0x150
[ 7496.664092]  ? set_kthread_struct+0x50/0x50
[ 7496.668080]  ret_from_fork+0x35/0x40
[ 7496.671745] INFO: task txg_sync:1098025 blocked for more than 120
seconds.
[ 7496.676991]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.684666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.692060] task:txg_sync        state:D stack:    0 pid:1098025
ppid:     2 flags:0x80004080
[ 7496.699888] Call Trace:
[ 7496.703012]  __schedule+0x2d1/0x870
[ 7496.706658]  schedule+0x55/0xf0
[ 7496.710093]  schedule_timeout+0x197/0x300
[ 7496.713982]  ? __next_timer_interrupt+0xf0/0xf0
[ 7496.718135]  io_schedule_timeout+0x19/0x40
[ 7496.722049]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7496.726349]  ? finish_wait+0x80/0x80
[ 7496.730039]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7496.734100]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7496.738082]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 7496.742205]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7496.746534]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 7496.750842]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[ 7496.755742]  spa_sync+0x362/0x8d0 [zfs]
[ 7496.759689]  txg_sync_thread+0x274/0x3b0 [zfs]
[ 7496.763928]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 7496.768439]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 7496.772799]  thread_generic_wrapper+0x63/0x90 [spl]
[ 7496.777097]  kthread+0x134/0x150
[ 7496.780616]  ? set_kthread_struct+0x50/0x50
[ 7496.784549]  ret_from_fork+0x35/0x40
[ 7496.788204] INFO: task fio:1101750 blocked for more than 120 seconds.
[ 7496.895852]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.903765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.911170] task:fio             state:D stack:    0 pid:1101750
ppid:1101741 flags:0x00004080
[ 7496.919033] Call Trace:
[ 7496.922136]  __schedule+0x2d1/0x870
[ 7496.925769]  schedule+0x55/0xf0
[ 7496.929245]  schedule_timeout+0x197/0x300
[ 7496.933120]  ? __next_timer_interrupt+0xf0/0xf0
[ 7496.937213]  io_schedule_timeout+0x19/0x40
[ 7496.941126]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7496.945444]  ? finish_wait+0x80/0x80
[ 7496.949125]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7496.953191]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7496.957180]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 7496.961319]  dmu_write_uio_direct+0x79/0xf0 [zfs]
[ 7496.965731]  dmu_write_uio_dnode+0xa6/0x2d0 [zfs]
[ 7496.970043]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 7496.974305]  zfs_write+0x55f/0xea0 [zfs]
[ 7496.978325]  ? iov_iter_get_pages+0xe9/0x390
[ 7496.982333]  ? trylock_page+0xd/0x20 [zfs]
[ 7496.986451]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7496.990713]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7496.995031]  zpl_iter_write_direct+0xda/0x170 [zfs]
[ 7496.999489]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.003476]  zpl_iter_write+0xd5/0x110 [zfs]
[ 7497.007610]  new_sync_write+0x112/0x160
[ 7497.011429]  vfs_write+0xa5/0x1b0
[ 7497.014916]  ksys_write+0x4f/0xb0
[ 7497.018443]  do_syscall_64+0x5b/0x1b0
[ 7497.022150]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.026532] RIP: 0033:0x7f8771d72a17
[ 7497.030195] Code: Unable to access opcode bytes at RIP
0x7f8771d729ed.
[ 7497.035263] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 7497.042547] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72a17
[ 7497.047933] RDX: 000000000009b000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7497.053269] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.058660] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000009b000
[ 7497.063960] R13: 000055b390afcac0 R14: 000000000009b000 R15:
000055b390afcae8
[ 7497.069334] INFO: task fio:1101751 blocked for more than 120 seconds.
[ 7497.074308]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.081973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.089371] task:fio             state:D stack:    0 pid:1101751
ppid:1101741 flags:0x00000080
[ 7497.097147] Call Trace:
[ 7497.100263]  __schedule+0x2d1/0x870
[ 7497.103897]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.107878]  schedule+0x55/0xf0
[ 7497.111386]  cv_wait_common+0x16d/0x280 [spl]
[ 7497.115391]  ? finish_wait+0x80/0x80
[ 7497.119028]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7497.123667]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7497.128240]  zfs_read+0xaf/0x3f0 [zfs]
[ 7497.132146]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.136091]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7497.140366]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7497.144679]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[ 7497.149054]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.153040]  zpl_iter_read+0x94/0xb0 [zfs]
[ 7497.157103]  new_sync_read+0x10f/0x160
[ 7497.160855]  vfs_read+0x91/0x150
[ 7497.164336]  ksys_read+0x4f/0xb0
[ 7497.168004]  do_syscall_64+0x5b/0x1b0
[ 7497.171706]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.176105] RIP: 0033:0x7f8771d72ab4
[ 7497.179742] Code: Unable to access opcode bytes at RIP
0x7f8771d72a8a.
[ 7497.184807] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 7497.192129] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72ab4
[ 7497.197485] RDX: 0000000000002000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7497.202922] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.208309] R10: 00000001ffffffff R11: 0000000000000246 R12:
0000000000002000
[ 7497.213694] R13: 000055b390afcac0 R14: 0000000000002000 R15:
000055b390afcae8
[ 7497.219063] INFO: task fio:1101755 blocked for more than 120 seconds.
[ 7497.224098]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.231786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.239165] task:fio             state:D stack:    0 pid:1101755
ppid:1101744 flags:0x00000080
[ 7497.246989] Call Trace:
[ 7497.250121]  __schedule+0x2d1/0x870
[ 7497.253779]  schedule+0x55/0xf0
[ 7497.257240]  schedule_preempt_disabled+0xa/0x10
[ 7497.261344]  __mutex_lock.isra.7+0x349/0x420
[ 7497.265326]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7497.269674]  zil_commit_writer+0x89/0x230 [zfs]
[ 7497.273938]  zil_commit_impl+0x5f/0xd0 [zfs]
[ 7497.278101]  zfs_fsync+0x81/0xa0 [zfs]
[ 7497.282002]  zpl_fsync+0xe5/0x140 [zfs]
[ 7497.285985]  do_fsync+0x38/0x70
[ 7497.289458]  __x64_sys_fsync+0x10/0x20
[ 7497.293208]  do_syscall_64+0x5b/0x1b0
[ 7497.296928]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.301260] RIP: 0033:0x7f9559073027
[ 7497.304920] Code: Unable to access opcode bytes at RIP
0x7f9559072ffd.
[ 7497.310015] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX:
000000000000004a
[ 7497.317346] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9559073027
[ 7497.322722] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI:
0000000000000005
[ 7497.328126] RBP: 00007f94fb858000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.333514] R10: 0000000000008000 R11: 0000000000000293 R12:
0000000000000003
[ 7497.338887] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15:
0000563adcbf2ae8
[ 7497.344247] INFO: task fio:1101756 blocked for more than 120 seconds.
[ 7497.349327]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.357032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.364517] task:fio             state:D stack:    0 pid:1101756
ppid:1101744 flags:0x00004080
[ 7497.372310] Call Trace:
[ 7497.375433]  __schedule+0x2d1/0x870
[ 7497.379004]  schedule+0x55/0xf0
[ 7497.382454]  cv_wait_common+0x16d/0x280 [spl]
[ 7497.386477]  ? finish_wait+0x80/0x80
[ 7497.390137]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7497.394816]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7497.399397]  zfs_get_data+0x1a8/0x7e0 [zfs]
[ 7497.403515]  zil_lwb_commit+0x1a5/0x400 [zfs]
[ 7497.407712]  zil_lwb_write_close+0x408/0x630 [zfs]
[ 7497.412126]  zil_commit_waiter_timeout+0x16d/0x520 [zfs]
[ 7497.416801]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 7497.421139]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 7497.425294]  zfs_fsync+0x81/0xa0 [zfs]
[ 7497.429454]  zpl_fsync+0xe5/0x140 [zfs]
[ 7497.433396]  do_fsync+0x38/0x70
[ 7497.436878]  __x64_sys_fsync+0x10/0x20
[ 7497.440586]  do_syscall_64+0x5b/0x1b0
[ 7497.444313]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.448659] RIP: 0033:0x7f9559073027
[ 7497.452343] Code: Unable to access opcode bytes at RIP
0x7f9559072ffd.
[ 7497.457408] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX:
000000000000004a
[ 7497.464724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9559073027
[ 7497.470106] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI:
0000000000000005
[ 7497.475477] RBP: 00007f94fb89ca18 R08: 0000000000000000 R09:
0000000000000000
[ 7497.480806] R10: 00000000000b4cc0 R11: 0000000000000293 R12:
0000000000000003
[ 7497.486158] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15:
0000563adcbf2ae8
[ 7619.459402] INFO: task z_wr_int:1098000 blocked for more than 120
seconds.
[ 7619.464605]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.472233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.479659] task:z_wr_int        state:D stack:    0 pid:1098000
ppid:     2 flags:0x80004080
[ 7619.487518] Call Trace:
[ 7619.490650]  __schedule+0x2d1/0x870
[ 7619.494246]  schedule+0x55/0xf0
[ 7619.497719]  cv_wait_common+0x16d/0x280 [spl]
[ 7619.501749]  ? finish_wait+0x80/0x80
[ 7619.505411]  dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs]
[ 7619.510143]  dmu_write_direct_done+0x90/0x3a0 [zfs]
[ 7619.514603]  zio_done+0x373/0x1d40 [zfs]
[ 7619.518594]  zio_execute+0xee/0x210 [zfs]
[ 7619.522619]  taskq_thread+0x203/0x420 [spl]
[ 7619.526567]  ? wake_up_q+0x70/0x70
[ 7619.530208]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[ 7619.535302]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 7619.539385]  kthread+0x134/0x150
[ 7619.542873]  ? set_kthread_struct+0x50/0x50
[ 7619.546810]  ret_from_fork+0x35/0x40
[ 7619.550477] INFO: task txg_sync:1098025 blocked for more than 120
seconds.
[ 7619.555715]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.563415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.570851] task:txg_sync        state:D stack:    0 pid:1098025
ppid:     2 flags:0x80004080
[ 7619.578606] Call Trace:
[ 7619.581742]  __schedule+0x2d1/0x870
[ 7619.585396]  schedule+0x55/0xf0
[ 7619.589006]  schedule_timeout+0x197/0x300
[ 7619.592916]  ? __next_timer_interrupt+0xf0/0xf0
[ 7619.597027]  io_schedule_timeout+0x19/0x40
[ 7619.600947]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7619.709878]  ? finish_wait+0x80/0x80
[ 7619.713565]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7619.717596]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7619.721567]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 7619.725657]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7619.730050]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 7619.734415]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[ 7619.739268]  spa_sync+0x362/0x8d0 [zfs]
[ 7619.743270]  txg_sync_thread+0x274/0x3b0 [zfs]
[ 7619.747494]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 7619.751939]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 7619.756279]  thread_generic_wrapper+0x63/0x90 [spl]
[ 7619.760569]  kthread+0x134/0x150
[ 7619.764050]  ? set_kthread_struct+0x50/0x50
[ 7619.767978]  ret_from_fork+0x35/0x40
[ 7619.771639] INFO: task fio:1101750 blocked for more than 120 seconds.
[ 7619.776678]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.784324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.791914] task:fio             state:D stack:    0 pid:1101750
ppid:1101741 flags:0x00004080
[ 7619.799712] Call Trace:
[ 7619.802816]  __schedule+0x2d1/0x870
[ 7619.806427]  schedule+0x55/0xf0
[ 7619.809867]  schedule_timeout+0x197/0x300
[ 7619.813760]  ? __next_timer_interrupt+0xf0/0xf0
[ 7619.817848]  io_schedule_timeout+0x19/0x40
[ 7619.821766]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7619.826097]  ? finish_wait+0x80/0x80
[ 7619.829780]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7619.833857]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7619.837838]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 7619.842015]  dmu_write_uio_direct+0x79/0xf0 [zfs]
[ 7619.846388]  dmu_write_uio_dnode+0xa6/0x2d0 [zfs]
[ 7619.850760]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 7619.855011]  zfs_write+0x55f/0xea0 [zfs]
[ 7619.859008]  ? iov_iter_get_pages+0xe9/0x390
[ 7619.863036]  ? trylock_page+0xd/0x20 [zfs]
[ 7619.867084]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7619.871366]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7619.875715]  zpl_iter_write_direct+0xda/0x170 [zfs]
[ 7619.880164]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7619.884174]  zpl_iter_write+0xd5/0x110 [zfs]
[ 7619.888492]  new_sync_write+0x112/0x160
[ 7619.892285]  vfs_write+0xa5/0x1b0
[ 7619.895829]  ksys_write+0x4f/0xb0
[ 7619.899384]  do_syscall_64+0x5b/0x1b0
[ 7619.903071]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7619.907394] RIP: 0033:0x7f8771d72a17
[ 7619.911026] Code: Unable to access opcode bytes at RIP
0x7f8771d729ed.
[ 7619.916073] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 7619.923363] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72a17
[ 7619.928675] RDX: 000000000009b000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7619.934019] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7619.939354] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000009b000
[ 7619.944775] R13: 000055b390afcac0 R14: 000000000009b000 R15:
000055b390afcae8
[ 7619.950175] INFO: task fio:1101751 blocked for more than 120 seconds.
[ 7619.955232]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.962889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.970301] task:fio             state:D stack:    0 pid:1101751
ppid:1101741 flags:0x00000080
[ 7619.978139] Call Trace:
[ 7619.981278]  __schedule+0x2d1/0x870
[ 7619.984872]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7619.989260]  schedule+0x55/0xf0
[ 7619.992725]  cv_wait_common+0x16d/0x280 [spl]
[ 7619.996754]  ? finish_wait+0x80/0x80
[ 7620.000414]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7620.005050]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7620.009617]  zfs_read+0xaf/0x3f0 [zfs]
[ 7620.013503]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7620.017489]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7620.021774]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7620.026091]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[ 7620.030508]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7620.034497]  zpl_iter_read+0x94/0xb0 [zfs]
[ 7620.038579]  new_sync_read+0x10f/0x160
[ 7620.042325]  vfs_read+0x91/0x150
[ 7620.045809]  ksys_read+0x4f/0xb0
[ 7620.049273]  do_syscall_64+0x5b/0x1b0
[ 7620.052965]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7620.057354] RIP: 0033:0x7f8771d72ab4
[ 7620.060988] Code: Unable to access opcode bytes at RIP
0x7f8771d72a8a.
[ 7620.066041] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 7620.073256] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72ab4
[ 7620.078553] RDX: 0000000000002000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7620.083878] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7620.089353] R10: 00000001ffffffff R11: 0000000000000246 R12:
0000000000002000
[ 7620.094697] R13: 000055b390afcac0 R14: 0000000000002000 R15:
000055b390afcae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
  • Loading branch information
bwatkinson committed Sep 9, 2024
1 parent 3436ba4 commit 9565350
Show file tree
Hide file tree
Showing 5 changed files with 101 additions and 63 deletions.
8 changes: 8 additions & 0 deletions include/sys/dbuf.h
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,14 @@ typedef struct dmu_buf_impl {

/* The buffer was partially read. More reads may follow. */
uint8_t db_partial_read;

/*
* This block is being held under a writer rangelock of a Direct I/O
* write that is waiting for previous buffered writes to synced out
* due to mixed buffered and O_DIRECT operations. This is needed to
* check whether to grab the rangelock in zfs_get_data().
*/
uint8_t db_mixed_io_dio_wait;
} dmu_buf_impl_t;

#define DBUF_HASH_MUTEX(h, idx) \
Expand Down
1 change: 1 addition & 0 deletions include/sys/dmu.h
Original file line number Diff line number Diff line change
Expand Up @@ -1088,6 +1088,7 @@ typedef struct zgd {
struct blkptr *zgd_bp;
dmu_buf_t *zgd_db;
struct zfs_locked_range *zgd_lr;
boolean_t zgd_grabbed_rangelock;
void *zgd_private;
} zgd_t;

Expand Down
12 changes: 12 additions & 0 deletions module/zfs/dbuf.c
Original file line number Diff line number Diff line change
Expand Up @@ -2804,13 +2804,24 @@ dmu_buf_direct_mixed_io_wait(dmu_buf_impl_t *db, uint64_t txg, boolean_t read)
* There must be an ARC buf associated with this Direct I/O
* write otherwise there is no reason to wait for previous
* dirty records to sync out.
*
* The db_state will temporarily be set to DB_CACHED so that
* that any synchronous writes issued through the ZIL will
* still be handled properly. In particular, the call to
* dbuf_read() in dmu_sync_late_arrival() must account for the
* data still being in the ARC. After waiting here for previous
* TXGs to sync out, dmu_write_direct_done() will update the
* db_state.
*/
ASSERT3P(db->db_buf, !=, NULL);
ASSERT3U(txg, >, 0);
db->db_mixed_io_dio_wait = TRUE;
db->db_state = DB_CACHED;
while (dbuf_find_dirty_lte(db, txg) != NULL) {
DBUF_STAT_BUMP(direct_mixed_io_write_wait);
cv_wait(&db->db_changed, &db->db_mtx);
}
db->db_mixed_io_dio_wait = FALSE;
}
}

Expand Down Expand Up @@ -3516,6 +3527,7 @@ dbuf_create(dnode_t *dn, uint8_t level, uint64_t blkid,
db->db_user_immediate_evict = FALSE;
db->db_freed_in_flight = FALSE;
db->db_pending_evict = FALSE;
db->db_mixed_io_dio_wait = FALSE;

if (blkid == DMU_BONUS_BLKID) {
ASSERT3P(parent, ==, dn->dn_dbuf);
Expand Down
2 changes: 1 addition & 1 deletion module/zfs/dmu_direct.c
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ dmu_write_direct_done(zio_t *zio)
* The current contents of the dbuf are now stale.
*/
ASSERT3P(dr->dt.dl.dr_data, ==, NULL);
ASSERT(db->db.db_data == NULL);
ASSERT3P(db->db.db_data, ==, NULL);
db->db_state = DB_UNCACHED;
} else {
if (zio->io_flags & ZIO_FLAG_DIO_CHKSUM_ERR)
Expand Down
141 changes: 79 additions & 62 deletions module/zfs/zfs_vnops.c
Original file line number Diff line number Diff line change
Expand Up @@ -1096,6 +1096,31 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
zgd->zgd_lwb = lwb;
zgd->zgd_private = zp;

dmu_buf_t *dbp;
error = dmu_buf_hold_noread(os, object, offset, zgd, &dbp);
zgd->zgd_db = dbp;
dmu_buf_impl_t *db = (dmu_buf_impl_t *)dbp;

if (error) {
zfs_get_done(zgd, error);
return (error);
}

/*
* If a Direct I/O write is waiting for previous dirty records to sync
* out in dmu_buf_direct_mixed_io_wait(), then the ranglock is already
* held across the entire block by the O_DIRECT write.
*
* The dirty record for this TXG will also be used to identify if this
* log record is associated with a Direct I/O write.
*/
mutex_enter(&db->db_mtx);
boolean_t rangelock_held = db->db_mixed_io_dio_wait;
zgd->zgd_grabbed_rangelock = !(rangelock_held);
dbuf_dirty_record_t *dr =
dbuf_find_dirty_eq(db, lr->lr_common.lrc_txg);
mutex_exit(&db->db_mtx);

/*
* Write records come in two flavors: immediate and indirect.
* For small writes it's cheaper to store the data with the
Expand All @@ -1104,8 +1129,10 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
* we don't have to write the data twice.
*/
if (buf != NULL) { /* immediate write */
zgd->zgd_lr = zfs_rangelock_enter(&zp->z_rangelock,
offset, size, RL_READER);
if (zgd->zgd_grabbed_rangelock) {
zgd->zgd_lr = zfs_rangelock_enter(&zp->z_rangelock,
offset, size, RL_READER);
}
/* test for truncation needs to be done while range locked */
if (offset >= zp->z_size) {
error = SET_ERROR(ENOENT);
Expand All @@ -1122,18 +1149,29 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
* that no one can change the data. We need to re-check
* blocksize after we get the lock in case it's changed!
*/
for (;;) {
uint64_t blkoff;
size = zp->z_blksz;
blkoff = ISP2(size) ? P2PHASE(offset, size) : offset;
offset -= blkoff;
zgd->zgd_lr = zfs_rangelock_enter(&zp->z_rangelock,
offset, size, RL_READER);
if (zp->z_blksz == size)
break;
offset += blkoff;
zfs_rangelock_exit(zgd->zgd_lr);
if (zgd->zgd_grabbed_rangelock) {
for (;;) {
uint64_t blkoff;
size = zp->z_blksz;
blkoff = ISP2(size) ? P2PHASE(offset, size) :
offset;
offset -= blkoff;
zgd->zgd_lr = zfs_rangelock_enter(
&zp->z_rangelock, offset, size, RL_READER);
if (zp->z_blksz == size)
break;
offset += blkoff;
zfs_rangelock_exit(zgd->zgd_lr);
}
ASSERT3U(dbp->db_size, ==, size);
ASSERT3U(dbp->db_offset, ==, offset);
} else {
/*
* A Direct I/O write always covers an entire block.
*/
ASSERT3U(dbp->db_size, ==, zp->z_blksz);
}

/* test for truncation needs to be done while range locked */
if (lr->lr_offset >= zp->z_size)
error = SET_ERROR(ENOENT);
Expand All @@ -1148,66 +1186,44 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
return (error);
}

dmu_buf_t *dbp;
error = dmu_buf_hold_noread(os, object, offset, zgd, &dbp);
if (error) {
zfs_get_done(zgd, error);
return (error);
}

zgd->zgd_db = dbp;

ASSERT3U(dbp->db_offset, ==, offset);
ASSERT3U(dbp->db_size, ==, size);

/*
* All O_DIRECT writes will have already completed and the
* All Direct I/O writes will have already completed and the
* block pointer can be immediately stored in the log record.
*/
dmu_buf_impl_t *db = (dmu_buf_impl_t *)dbp;
mutex_enter(&db->db_mtx);

dbuf_dirty_record_t *dr = dbuf_find_dirty_eq(db,
lr->lr_common.lrc_txg);

if (dr != NULL && dr->dt.dl.dr_data == NULL &&
dr->dt.dl.dr_override_state == DR_OVERRIDDEN) {
lr->lr_blkptr = dr->dt.dl.dr_overridden_by;
mutex_exit(&db->db_mtx);
zfs_get_done(zgd, 0);
return (0);
}
mutex_exit(&db->db_mtx);

if (error == 0) {
blkptr_t *bp = &lr->lr_blkptr;
zgd->zgd_bp = bp;
blkptr_t *bp = &lr->lr_blkptr;
zgd->zgd_bp = bp;

error = dmu_sync(zio, lr->lr_common.lrc_txg,
zfs_get_done, zgd);
ASSERT(error || lr->lr_length <= size);
error = dmu_sync(zio, lr->lr_common.lrc_txg,
zfs_get_done, zgd);
ASSERT(error || lr->lr_length <= size);

/*
* On success, we need to wait for the write I/O
* initiated by dmu_sync() to complete before we can
* release this dbuf. We will finish everything up
* in the zfs_get_done() callback.
*/
if (error == 0)
return (0);

if (error == EALREADY) {
lr->lr_common.lrc_txtype = TX_WRITE2;
/*
* On success, we need to wait for the write I/O
* initiated by dmu_sync() to complete before we can
* release this dbuf. We will finish everything up
* in the zfs_get_done() callback.
* TX_WRITE2 relies on the data previously
* written by the TX_WRITE that caused
* EALREADY. We zero out the BP because
* it is the old, currently-on-disk BP.
*/
if (error == 0)
return (0);

if (error == EALREADY) {
lr->lr_common.lrc_txtype = TX_WRITE2;
/*
* TX_WRITE2 relies on the data previously
* written by the TX_WRITE that caused
* EALREADY. We zero out the BP because
* it is the old, currently-on-disk BP.
*/
zgd->zgd_bp = NULL;
BP_ZERO(bp);
error = 0;
}
zgd->zgd_bp = NULL;
BP_ZERO(bp);
error = 0;
}
}

Expand All @@ -1223,10 +1239,11 @@ zfs_get_done(zgd_t *zgd, int error)
(void) error;
znode_t *zp = zgd->zgd_private;

if (zgd->zgd_db)
dmu_buf_rele(zgd->zgd_db, zgd);
ASSERT3P(zgd->zgd_db, !=, NULL);
dmu_buf_rele(zgd->zgd_db, zgd);

zfs_rangelock_exit(zgd->zgd_lr);
if (zgd->zgd_grabbed_rangelock)
zfs_rangelock_exit(zgd->zgd_lr);

/*
* Release the vnode asynchronously as we currently have the
Expand Down

0 comments on commit 9565350

Please sign in to comment.