Fixing race condition between O_DIRECT and ZIL

There existed a race condition that was discovered through the dio_random test. When doing fio with --fsync=32, after 32 writes fsync is called on the file. When this happens, blocks committed to the ZIL will be sync'ed out. However, the code for the O_DIRECT write was updated in 31983d2 to always wait if there was an associated ARC buf with the dbuf for all previous TXG's to sync out. There was an oversight with this update. When waiting on previous TXG's to sync out, the O_DIRECT write is holding the rangelock as a writer the entire time. This causes an issue with the ZIL is commit writes out though `zfs_get_data()` because it will try and grab the rangelock as reader. This will lead to a deadlock. In order to fix this race condition, I updated the `dmu_buf_impl_t` struct to contain a uint8_t variable that is used to signal if the dbuf attached to a O_DIRECT write is the wait hold because of mixed direct and buffered data. Using this new `db_mixed_io_dio_wait` variable in the in the `dmu_buf_impl_t` the code in `zfs_get_data()` can tell that rangelock is already being held across the entire block and there is no need to grab the rangelock at all. Because the rangelock is being held as a writer across the entire block already, no modifications can take place against the block as long as the O_DIRECT write is stalled waiting in `dmu_buf_direct_mixed_io_wait()`. Also as part of this update, I realized the `db_state` in `dmu_buf_direct_mixed_io_wait()` needs to be changed temporarily to `DB_CACHED`. This is necessary so the logic in `dbuf_read()` is correct if `dmu_sync_late_arrival()` is called by `dmu_sync()`. It is completely valid to switch the `db_state` back to `DB_CACHED` is there is still an associated ARC buf that will not be freed till out O_DIRECT write is completed which will only happen after if leaves `dmu_buf_direct_mixed_io_wait()`. Here is the stack trace of the deadlock that happen with `dio_random.ksh` before this commit: [ 5513.663402] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 7496.580415] INFO: task z_wr_int:1098000 blocked for more than 120 seconds. [ 7496.585709] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.593349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.600839] task:z_wr_int state:D stack: 0 pid:1098000 ppid: 2 flags:0x80004080 [ 7496.608622] Call Trace: [ 7496.611770] __schedule+0x2d1/0x870 [ 7496.615404] schedule+0x55/0xf0 [ 7496.618866] cv_wait_common+0x16d/0x280 [spl] [ 7496.622910] ? finish_wait+0x80/0x80 [ 7496.626601] dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs] [ 7496.631327] dmu_write_direct_done+0x90/0x3a0 [zfs] [ 7496.635798] zio_done+0x373/0x1d40 [zfs] [ 7496.639795] zio_execute+0xee/0x210 [zfs] [ 7496.643840] taskq_thread+0x203/0x420 [spl] [ 7496.647836] ? wake_up_q+0x70/0x70 [ 7496.651411] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 7496.656489] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 7496.660604] kthread+0x134/0x150 [ 7496.664092] ? set_kthread_struct+0x50/0x50 [ 7496.668080] ret_from_fork+0x35/0x40 [ 7496.671745] INFO: task txg_sync:1098025 blocked for more than 120 seconds. [ 7496.676991] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.684666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.692060] task:txg_sync state:D stack: 0 pid:1098025 ppid: 2 flags:0x80004080 [ 7496.699888] Call Trace: [ 7496.703012] __schedule+0x2d1/0x870 [ 7496.706658] schedule+0x55/0xf0 [ 7496.710093] schedule_timeout+0x197/0x300 [ 7496.713982] ? __next_timer_interrupt+0xf0/0xf0 [ 7496.718135] io_schedule_timeout+0x19/0x40 [ 7496.722049] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7496.726349] ? finish_wait+0x80/0x80 [ 7496.730039] __cv_timedwait_io+0x15/0x20 [spl] [ 7496.734100] zio_wait+0x1a2/0x4d0 [zfs] [ 7496.738082] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 7496.742205] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7496.746534] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 7496.750842] spa_sync_iterate_to_convergence+0xcf/0x310 [zfs] [ 7496.755742] spa_sync+0x362/0x8d0 [zfs] [ 7496.759689] txg_sync_thread+0x274/0x3b0 [zfs] [ 7496.763928] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 7496.768439] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 7496.772799] thread_generic_wrapper+0x63/0x90 [spl] [ 7496.777097] kthread+0x134/0x150 [ 7496.780616] ? set_kthread_struct+0x50/0x50 [ 7496.784549] ret_from_fork+0x35/0x40 [ 7496.788204] INFO: task fio:1101750 blocked for more than 120 seconds. [ 7496.895852] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.903765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.911170] task:fio state:D stack: 0 pid:1101750 ppid:1101741 flags:0x00004080 [ 7496.919033] Call Trace: [ 7496.922136] __schedule+0x2d1/0x870 [ 7496.925769] schedule+0x55/0xf0 [ 7496.929245] schedule_timeout+0x197/0x300 [ 7496.933120] ? __next_timer_interrupt+0xf0/0xf0 [ 7496.937213] io_schedule_timeout+0x19/0x40 [ 7496.941126] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7496.945444] ? finish_wait+0x80/0x80 [ 7496.949125] __cv_timedwait_io+0x15/0x20 [spl] [ 7496.953191] zio_wait+0x1a2/0x4d0 [zfs] [ 7496.957180] dmu_write_abd+0x174/0x1c0 [zfs] [ 7496.961319] dmu_write_uio_direct+0x79/0xf0 [zfs] [ 7496.965731] dmu_write_uio_dnode+0xa6/0x2d0 [zfs] [ 7496.970043] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 7496.974305] zfs_write+0x55f/0xea0 [zfs] [ 7496.978325] ? iov_iter_get_pages+0xe9/0x390 [ 7496.982333] ? trylock_page+0xd/0x20 [zfs] [ 7496.986451] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7496.990713] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7496.995031] zpl_iter_write_direct+0xda/0x170 [zfs] [ 7496.999489] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.003476] zpl_iter_write+0xd5/0x110 [zfs] [ 7497.007610] new_sync_write+0x112/0x160 [ 7497.011429] vfs_write+0xa5/0x1b0 [ 7497.014916] ksys_write+0x4f/0xb0 [ 7497.018443] do_syscall_64+0x5b/0x1b0 [ 7497.022150] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.026532] RIP: 0033:0x7f8771d72a17 [ 7497.030195] Code: Unable to access opcode bytes at RIP 0x7f8771d729ed. [ 7497.035263] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 7497.042547] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72a17 [ 7497.047933] RDX: 000000000009b000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7497.053269] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.058660] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000009b000 [ 7497.063960] R13: 000055b390afcac0 R14: 000000000009b000 R15: 000055b390afcae8 [ 7497.069334] INFO: task fio:1101751 blocked for more than 120 seconds. [ 7497.074308] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.081973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.089371] task:fio state:D stack: 0 pid:1101751 ppid:1101741 flags:0x00000080 [ 7497.097147] Call Trace: [ 7497.100263] __schedule+0x2d1/0x870 [ 7497.103897] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.107878] schedule+0x55/0xf0 [ 7497.111386] cv_wait_common+0x16d/0x280 [spl] [ 7497.115391] ? finish_wait+0x80/0x80 [ 7497.119028] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7497.123667] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7497.128240] zfs_read+0xaf/0x3f0 [zfs] [ 7497.132146] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.136091] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7497.140366] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7497.144679] zpl_iter_read_direct+0xe0/0x180 [zfs] [ 7497.149054] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.153040] zpl_iter_read+0x94/0xb0 [zfs] [ 7497.157103] new_sync_read+0x10f/0x160 [ 7497.160855] vfs_read+0x91/0x150 [ 7497.164336] ksys_read+0x4f/0xb0 [ 7497.168004] do_syscall_64+0x5b/0x1b0 [ 7497.171706] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.176105] RIP: 0033:0x7f8771d72ab4 [ 7497.179742] Code: Unable to access opcode bytes at RIP 0x7f8771d72a8a. [ 7497.184807] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 7497.192129] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72ab4 [ 7497.197485] RDX: 0000000000002000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7497.202922] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.208309] R10: 00000001ffffffff R11: 0000000000000246 R12: 0000000000002000 [ 7497.213694] R13: 000055b390afcac0 R14: 0000000000002000 R15: 000055b390afcae8 [ 7497.219063] INFO: task fio:1101755 blocked for more than 120 seconds. [ 7497.224098] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.231786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.239165] task:fio state:D stack: 0 pid:1101755 ppid:1101744 flags:0x00000080 [ 7497.246989] Call Trace: [ 7497.250121] __schedule+0x2d1/0x870 [ 7497.253779] schedule+0x55/0xf0 [ 7497.257240] schedule_preempt_disabled+0xa/0x10 [ 7497.261344] __mutex_lock.isra.7+0x349/0x420 [ 7497.265326] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7497.269674] zil_commit_writer+0x89/0x230 [zfs] [ 7497.273938] zil_commit_impl+0x5f/0xd0 [zfs] [ 7497.278101] zfs_fsync+0x81/0xa0 [zfs] [ 7497.282002] zpl_fsync+0xe5/0x140 [zfs] [ 7497.285985] do_fsync+0x38/0x70 [ 7497.289458] __x64_sys_fsync+0x10/0x20 [ 7497.293208] do_syscall_64+0x5b/0x1b0 [ 7497.296928] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.301260] RIP: 0033:0x7f9559073027 [ 7497.304920] Code: Unable to access opcode bytes at RIP 0x7f9559072ffd. [ 7497.310015] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 7497.317346] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9559073027 [ 7497.322722] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI: 0000000000000005 [ 7497.328126] RBP: 00007f94fb858000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.333514] R10: 0000000000008000 R11: 0000000000000293 R12: 0000000000000003 [ 7497.338887] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15: 0000563adcbf2ae8 [ 7497.344247] INFO: task fio:1101756 blocked for more than 120 seconds. [ 7497.349327] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.357032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.364517] task:fio state:D stack: 0 pid:1101756 ppid:1101744 flags:0x00004080 [ 7497.372310] Call Trace: [ 7497.375433] __schedule+0x2d1/0x870 [ 7497.379004] schedule+0x55/0xf0 [ 7497.382454] cv_wait_common+0x16d/0x280 [spl] [ 7497.386477] ? finish_wait+0x80/0x80 [ 7497.390137] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7497.394816] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7497.399397] zfs_get_data+0x1a8/0x7e0 [zfs] [ 7497.403515] zil_lwb_commit+0x1a5/0x400 [zfs] [ 7497.407712] zil_lwb_write_close+0x408/0x630 [zfs] [ 7497.412126] zil_commit_waiter_timeout+0x16d/0x520 [zfs] [ 7497.416801] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 7497.421139] zil_commit_impl+0x6d/0xd0 [zfs] [ 7497.425294] zfs_fsync+0x81/0xa0 [zfs] [ 7497.429454] zpl_fsync+0xe5/0x140 [zfs] [ 7497.433396] do_fsync+0x38/0x70 [ 7497.436878] __x64_sys_fsync+0x10/0x20 [ 7497.440586] do_syscall_64+0x5b/0x1b0 [ 7497.444313] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.448659] RIP: 0033:0x7f9559073027 [ 7497.452343] Code: Unable to access opcode bytes at RIP 0x7f9559072ffd. [ 7497.457408] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 7497.464724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9559073027 [ 7497.470106] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI: 0000000000000005 [ 7497.475477] RBP: 00007f94fb89ca18 R08: 0000000000000000 R09: 0000000000000000 [ 7497.480806] R10: 00000000000b4cc0 R11: 0000000000000293 R12: 0000000000000003 [ 7497.486158] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15: 0000563adcbf2ae8 [ 7619.459402] INFO: task z_wr_int:1098000 blocked for more than 120 seconds. [ 7619.464605] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.472233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.479659] task:z_wr_int state:D stack: 0 pid:1098000 ppid: 2 flags:0x80004080 [ 7619.487518] Call Trace: [ 7619.490650] __schedule+0x2d1/0x870 [ 7619.494246] schedule+0x55/0xf0 [ 7619.497719] cv_wait_common+0x16d/0x280 [spl] [ 7619.501749] ? finish_wait+0x80/0x80 [ 7619.505411] dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs] [ 7619.510143] dmu_write_direct_done+0x90/0x3a0 [zfs] [ 7619.514603] zio_done+0x373/0x1d40 [zfs] [ 7619.518594] zio_execute+0xee/0x210 [zfs] [ 7619.522619] taskq_thread+0x203/0x420 [spl] [ 7619.526567] ? wake_up_q+0x70/0x70 [ 7619.530208] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 7619.535302] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 7619.539385] kthread+0x134/0x150 [ 7619.542873] ? set_kthread_struct+0x50/0x50 [ 7619.546810] ret_from_fork+0x35/0x40 [ 7619.550477] INFO: task txg_sync:1098025 blocked for more than 120 seconds. [ 7619.555715] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.563415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.570851] task:txg_sync state:D stack: 0 pid:1098025 ppid: 2 flags:0x80004080 [ 7619.578606] Call Trace: [ 7619.581742] __schedule+0x2d1/0x870 [ 7619.585396] schedule+0x55/0xf0 [ 7619.589006] schedule_timeout+0x197/0x300 [ 7619.592916] ? __next_timer_interrupt+0xf0/0xf0 [ 7619.597027] io_schedule_timeout+0x19/0x40 [ 7619.600947] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7619.709878] ? finish_wait+0x80/0x80 [ 7619.713565] __cv_timedwait_io+0x15/0x20 [spl] [ 7619.717596] zio_wait+0x1a2/0x4d0 [zfs] [ 7619.721567] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 7619.725657] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7619.730050] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 7619.734415] spa_sync_iterate_to_convergence+0xcf/0x310 [zfs] [ 7619.739268] spa_sync+0x362/0x8d0 [zfs] [ 7619.743270] txg_sync_thread+0x274/0x3b0 [zfs] [ 7619.747494] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 7619.751939] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 7619.756279] thread_generic_wrapper+0x63/0x90 [spl] [ 7619.760569] kthread+0x134/0x150 [ 7619.764050] ? set_kthread_struct+0x50/0x50 [ 7619.767978] ret_from_fork+0x35/0x40 [ 7619.771639] INFO: task fio:1101750 blocked for more than 120 seconds. [ 7619.776678] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.784324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.791914] task:fio state:D stack: 0 pid:1101750 ppid:1101741 flags:0x00004080 [ 7619.799712] Call Trace: [ 7619.802816] __schedule+0x2d1/0x870 [ 7619.806427] schedule+0x55/0xf0 [ 7619.809867] schedule_timeout+0x197/0x300 [ 7619.813760] ? __next_timer_interrupt+0xf0/0xf0 [ 7619.817848] io_schedule_timeout+0x19/0x40 [ 7619.821766] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7619.826097] ? finish_wait+0x80/0x80 [ 7619.829780] __cv_timedwait_io+0x15/0x20 [spl] [ 7619.833857] zio_wait+0x1a2/0x4d0 [zfs] [ 7619.837838] dmu_write_abd+0x174/0x1c0 [zfs] [ 7619.842015] dmu_write_uio_direct+0x79/0xf0 [zfs] [ 7619.846388] dmu_write_uio_dnode+0xa6/0x2d0 [zfs] [ 7619.850760] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 7619.855011] zfs_write+0x55f/0xea0 [zfs] [ 7619.859008] ? iov_iter_get_pages+0xe9/0x390 [ 7619.863036] ? trylock_page+0xd/0x20 [zfs] [ 7619.867084] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7619.871366] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7619.875715] zpl_iter_write_direct+0xda/0x170 [zfs] [ 7619.880164] ? rrw_exit+0xc6/0x200 [zfs] [ 7619.884174] zpl_iter_write+0xd5/0x110 [zfs] [ 7619.888492] new_sync_write+0x112/0x160 [ 7619.892285] vfs_write+0xa5/0x1b0 [ 7619.895829] ksys_write+0x4f/0xb0 [ 7619.899384] do_syscall_64+0x5b/0x1b0 [ 7619.903071] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7619.907394] RIP: 0033:0x7f8771d72a17 [ 7619.911026] Code: Unable to access opcode bytes at RIP 0x7f8771d729ed. [ 7619.916073] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 7619.923363] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72a17 [ 7619.928675] RDX: 000000000009b000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7619.934019] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7619.939354] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000009b000 [ 7619.944775] R13: 000055b390afcac0 R14: 000000000009b000 R15: 000055b390afcae8 [ 7619.950175] INFO: task fio:1101751 blocked for more than 120 seconds. [ 7619.955232] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.962889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.970301] task:fio state:D stack: 0 pid:1101751 ppid:1101741 flags:0x00000080 [ 7619.978139] Call Trace: [ 7619.981278] __schedule+0x2d1/0x870 [ 7619.984872] ? rrw_exit+0xc6/0x200 [zfs] [ 7619.989260] schedule+0x55/0xf0 [ 7619.992725] cv_wait_common+0x16d/0x280 [spl] [ 7619.996754] ? finish_wait+0x80/0x80 [ 7620.000414] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7620.005050] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7620.009617] zfs_read+0xaf/0x3f0 [zfs] [ 7620.013503] ? rrw_exit+0xc6/0x200 [zfs] [ 7620.017489] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7620.021774] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7620.026091] zpl_iter_read_direct+0xe0/0x180 [zfs] [ 7620.030508] ? rrw_exit+0xc6/0x200 [zfs] [ 7620.034497] zpl_iter_read+0x94/0xb0 [zfs] [ 7620.038579] new_sync_read+0x10f/0x160 [ 7620.042325] vfs_read+0x91/0x150 [ 7620.045809] ksys_read+0x4f/0xb0 [ 7620.049273] do_syscall_64+0x5b/0x1b0 [ 7620.052965] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7620.057354] RIP: 0033:0x7f8771d72ab4 [ 7620.060988] Code: Unable to access opcode bytes at RIP 0x7f8771d72a8a. [ 7620.066041] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 7620.073256] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72ab4 [ 7620.078553] RDX: 0000000000002000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7620.083878] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7620.089353] R10: 00000001ffffffff R11: 0000000000000246 R12: 0000000000002000 [ 7620.094697] R13: 000055b390afcac0 R14: 0000000000002000 R15: 000055b390afcae8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson · Sep 9, 2024 · 9565350 · 9565350
1 parent 3436ba4
commit 9565350
Show file tree

Hide file tree

Showing 5 changed files with 101 additions and 63 deletions.
diff --git a/include/sys/dbuf.h b/include/sys/dbuf.h
@@ -336,6 +336,14 @@ typedef struct dmu_buf_impl {
 
 	/* The buffer was partially read.  More reads may follow. */
 	uint8_t db_partial_read;
+
+	/*
+	 * This block is being held under a writer rangelock of a Direct I/O
+	 * write that is waiting for previous buffered writes to synced out
+	 * due to mixed buffered and O_DIRECT operations. This is needed to
+	 * check whether to grab the rangelock in zfs_get_data().
+	 */
+	uint8_t db_mixed_io_dio_wait;
 } dmu_buf_impl_t;
 
 #define	DBUF_HASH_MUTEX(h, idx) \

diff --git a/include/sys/dmu.h b/include/sys/dmu.h
@@ -1088,6 +1088,7 @@ typedef struct zgd {
 	struct blkptr	*zgd_bp;
 	dmu_buf_t	*zgd_db;
 	struct zfs_locked_range *zgd_lr;
+	boolean_t	zgd_grabbed_rangelock;
 	void		*zgd_private;
 } zgd_t;
 

diff --git a/module/zfs/dbuf.c b/module/zfs/dbuf.c
@@ -2804,13 +2804,24 @@ dmu_buf_direct_mixed_io_wait(dmu_buf_impl_t *db, uint64_t txg, boolean_t read)
 		 * There must be an ARC buf associated with this Direct I/O
 		 * write otherwise there is no reason to wait for previous
 		 * dirty records to sync out.
+		 *
+		 * The db_state will temporarily be set to DB_CACHED so that
+		 * that any synchronous writes issued through the ZIL will
+		 * still be handled properly. In particular, the call to
+		 * dbuf_read() in dmu_sync_late_arrival() must account for the
+		 * data still being in the ARC. After waiting here for previous
+		 * TXGs to sync out, dmu_write_direct_done() will update the
+		 * db_state.
 		 */
 		ASSERT3P(db->db_buf, !=, NULL);
 		ASSERT3U(txg, >, 0);
+		db->db_mixed_io_dio_wait = TRUE;
+		db->db_state = DB_CACHED;
 		while (dbuf_find_dirty_lte(db, txg) != NULL) {
 			DBUF_STAT_BUMP(direct_mixed_io_write_wait);
 			cv_wait(&db->db_changed, &db->db_mtx);
 		}
+		db->db_mixed_io_dio_wait = FALSE;
 	}
 }
 
@@ -3516,6 +3527,7 @@ dbuf_create(dnode_t *dn, uint8_t level, uint64_t blkid,
 	db->db_user_immediate_evict = FALSE;
 	db->db_freed_in_flight = FALSE;
 	db->db_pending_evict = FALSE;
+	db->db_mixed_io_dio_wait = FALSE;
 
 	if (blkid == DMU_BONUS_BLKID) {
 		ASSERT3P(parent, ==, dn->dn_dbuf);

diff --git a/module/zfs/dmu_direct.c b/module/zfs/dmu_direct.c
@@ -134,7 +134,7 @@ dmu_write_direct_done(zio_t *zio)
 		 * The current contents of the dbuf are now stale.
 		 */
 		ASSERT3P(dr->dt.dl.dr_data, ==, NULL);
-		ASSERT(db->db.db_data == NULL);
+		ASSERT3P(db->db.db_data, ==, NULL);
 		db->db_state = DB_UNCACHED;
 	} else {
 		if (zio->io_flags & ZIO_FLAG_DIO_CHKSUM_ERR)

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
@@ -1096,6 +1096,31 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
 	zgd->zgd_lwb = lwb;
 	zgd->zgd_private = zp;
 
+	dmu_buf_t *dbp;
+	error = dmu_buf_hold_noread(os, object, offset, zgd, &dbp);
+	zgd->zgd_db = dbp;
+	dmu_buf_impl_t *db = (dmu_buf_impl_t *)dbp;
+
+	if (error) {
+		zfs_get_done(zgd, error);
+		return (error);
+	}
+
+	/*
+	 * If a Direct I/O write is waiting for previous dirty records to sync
+	 * out in dmu_buf_direct_mixed_io_wait(), then the ranglock is already
+	 * held across the entire block by the O_DIRECT write.
+	 *
+	 * The dirty record for this TXG will also be used to identify if this
+	 * log record is associated with a Direct I/O write.
+	 */
+	mutex_enter(&db->db_mtx);
+	boolean_t rangelock_held = db->db_mixed_io_dio_wait;
+	zgd->zgd_grabbed_rangelock = !(rangelock_held);
+	dbuf_dirty_record_t *dr =
+	    dbuf_find_dirty_eq(db, lr->lr_common.lrc_txg);
+	mutex_exit(&db->db_mtx);
+
 	/*
 	 * Write records come in two flavors: immediate and indirect.
 	 * For small writes it's cheaper to store the data with the
@@ -1104,8 +1129,10 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
 	 * we don't have to write the data twice.
 	 */
 	if (buf != NULL) { /* immediate write */
-		zgd->zgd_lr = zfs_rangelock_enter(&zp->z_rangelock,
-		    offset, size, RL_READER);
+		if (zgd->zgd_grabbed_rangelock) {
+			zgd->zgd_lr = zfs_rangelock_enter(&zp->z_rangelock,
+			    offset, size, RL_READER);
+		}
 		/* test for truncation needs to be done while range locked */
 		if (offset >= zp->z_size) {
 			error = SET_ERROR(ENOENT);
@@ -1122,18 +1149,29 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
 		 * that no one can change the data. We need to re-check
 		 * blocksize after we get the lock in case it's changed!
 		 */
-		for (;;) {
-			uint64_t blkoff;
-			size = zp->z_blksz;
-			blkoff = ISP2(size) ? P2PHASE(offset, size) : offset;
-			offset -= blkoff;
-			zgd->zgd_lr = zfs_rangelock_enter(&zp->z_rangelock,
-			    offset, size, RL_READER);
-			if (zp->z_blksz == size)
-				break;
-			offset += blkoff;
-			zfs_rangelock_exit(zgd->zgd_lr);
+		if (zgd->zgd_grabbed_rangelock) {
+			for (;;) {
+				uint64_t blkoff;
+				size = zp->z_blksz;
+				blkoff = ISP2(size) ? P2PHASE(offset, size) :
+				    offset;
+				offset -= blkoff;
+				zgd->zgd_lr = zfs_rangelock_enter(
+				    &zp->z_rangelock, offset, size, RL_READER);
+				if (zp->z_blksz == size)
+					break;
+				offset += blkoff;
+				zfs_rangelock_exit(zgd->zgd_lr);
+			}
+			ASSERT3U(dbp->db_size, ==, size);
+			ASSERT3U(dbp->db_offset, ==, offset);
+		} else {
+			/*
+			 * A Direct I/O write always covers an entire block.
+			 */
+			ASSERT3U(dbp->db_size, ==, zp->z_blksz);
 		}
+
 		/* test for truncation needs to be done while range locked */
 		if (lr->lr_offset >= zp->z_size)
 			error = SET_ERROR(ENOENT);
@@ -1148,66 +1186,44 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
 			return (error);
 		}
 
-		dmu_buf_t *dbp;
-		error = dmu_buf_hold_noread(os, object, offset, zgd, &dbp);
-		if (error) {
-			zfs_get_done(zgd, error);
-			return (error);
-		}
-
-		zgd->zgd_db = dbp;
-
-		ASSERT3U(dbp->db_offset, ==, offset);
-		ASSERT3U(dbp->db_size, ==, size);
-
 		/*
-		 * All O_DIRECT writes will have already completed and the
+		 * All Direct I/O writes will have already completed and the
 		 * block pointer can be immediately stored in the log record.
 		 */
-		dmu_buf_impl_t *db = (dmu_buf_impl_t *)dbp;
-		mutex_enter(&db->db_mtx);
-
-		dbuf_dirty_record_t *dr = dbuf_find_dirty_eq(db,
-		    lr->lr_common.lrc_txg);
-
 		if (dr != NULL && dr->dt.dl.dr_data == NULL &&
 		    dr->dt.dl.dr_override_state == DR_OVERRIDDEN) {
 			lr->lr_blkptr = dr->dt.dl.dr_overridden_by;
-			mutex_exit(&db->db_mtx);
 			zfs_get_done(zgd, 0);
 			return (0);
 		}
-		mutex_exit(&db->db_mtx);
 
-		if (error == 0) {
-			blkptr_t *bp = &lr->lr_blkptr;
-			zgd->zgd_bp = bp;
+		blkptr_t *bp = &lr->lr_blkptr;
+		zgd->zgd_bp = bp;
 
-			error = dmu_sync(zio, lr->lr_common.lrc_txg,
-			    zfs_get_done, zgd);
-			ASSERT(error || lr->lr_length <= size);
+		error = dmu_sync(zio, lr->lr_common.lrc_txg,
+		    zfs_get_done, zgd);
+		ASSERT(error || lr->lr_length <= size);
 
+		/*
+		 * On success, we need to wait for the write I/O
+		 * initiated by dmu_sync() to complete before we can
+		 * release this dbuf.  We will finish everything up
+		 * in the zfs_get_done() callback.
+		 */
+		if (error == 0)
+			return (0);
+
+		if (error == EALREADY) {
+			lr->lr_common.lrc_txtype = TX_WRITE2;
 			/*
-			 * On success, we need to wait for the write I/O
-			 * initiated by dmu_sync() to complete before we can
-			 * release this dbuf.  We will finish everything up
-			 * in the zfs_get_done() callback.
+			 * TX_WRITE2 relies on the data previously
+			 * written by the TX_WRITE that caused
+			 * EALREADY.  We zero out the BP because
+			 * it is the old, currently-on-disk BP.
 			 */
-			if (error == 0)
-				return (0);
-
-			if (error == EALREADY) {
-				lr->lr_common.lrc_txtype = TX_WRITE2;
-				/*
-				 * TX_WRITE2 relies on the data previously
-				 * written by the TX_WRITE that caused
-				 * EALREADY.  We zero out the BP because
-				 * it is the old, currently-on-disk BP.
-				 */
-				zgd->zgd_bp = NULL;
-				BP_ZERO(bp);
-				error = 0;
-			}
+			zgd->zgd_bp = NULL;
+			BP_ZERO(bp);
+			error = 0;
 		}
 	}
 
@@ -1223,10 +1239,11 @@ zfs_get_done(zgd_t *zgd, int error)
 	(void) error;
 	znode_t *zp = zgd->zgd_private;
 
-	if (zgd->zgd_db)
-		dmu_buf_rele(zgd->zgd_db, zgd);
+	ASSERT3P(zgd->zgd_db, !=, NULL);
+	dmu_buf_rele(zgd->zgd_db, zgd);
 
-	zfs_rangelock_exit(zgd->zgd_lr);
+	if (zgd->zgd_grabbed_rangelock)
+		zfs_rangelock_exit(zgd->zgd_lr);
 
 	/*
 	 * Release the vnode asynchronously as we currently have the