Skip to content

Commit

Permalink
Fixing race condition with rangelocks
Browse files Browse the repository at this point in the history
There existed a race condition between when a Direct I/O write could
complete and if a sync operation was issued. This was due to the fact
that a Direct I/O would sleep waiting on previous TXG's to sync out
their dirty records assosciated with a dbuf if there was an ARC buffer
associated with the dbuf. This was necessay to safely destroy the ARC
buffer in case previous dirty records dr_data as pointed at that the
db_buf. The main issue with this approach is a Direct I/o write holds
the rangelock across the entire block, so when a sync on that same block
was issued and tried to grab the rangelock as reader, it would be
blocked indefinitely because the Direct I/O that was now sleeping was
holding that same rangelock as writer. This led to a complete deadlock.

This commit fixes this issue and removes the wait in
dmu_write_direct_done().

The way this is now handled is the ARC buffer is destroyed, if there an
associated one with dbuf, before ever issuing the Direct I/O write.
This implemenation heavily borrows from the block cloning
implementation.

A new function dmu_buf_wil_clone_or_dio() is called in both
dmu_write_direct() and dmu_brt_clone() that does the following:
1. Undirties a dirty record for that db if there one currently
   associated with the current TXG.
2. Destroys the ARC buffer if the previous dirty record dr_data does not
   point at the dbufs ARC buffer (db_buf).
3. Sets the dbufs data pointers to NULL.
4. Redirties the dbuf using db_state = DB_NOFILL.

As part of this commit, the dmu_write_direct_done() function was also
cleaned up. Now dmu_sync_done() is called before undirtying the dbuf
dirty record associated with a failed Direct I/O write. This is correct
logic and how it always should have been.

The additional benefits of these modifications is there is no longer a
stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT
together. Also it unifies the block cloning and Direct I/O write path as
they both need to call dbuf_fix_old_data() before destroying the ARC
buffer.

As part of this commit, there is also just general code cleanup. Various
dbuf stats were removed because they are not necesary any longer.
Additionally, useless functions were removed to make the code paths
cleaner for Direct I/O.

Below is the race condtion stack trace that was being consistently
observed in the CI runs for the dio_random test case that prompted
these changes:
trace:
[ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[ 9954.770512]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.773848] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[ 9954.775512] Call Trace:
[ 9954.776406]  __schedule+0x2d1/0x870
[ 9954.777386]  ? free_one_page+0x204/0x530
[ 9954.778466]  schedule+0x55/0xf0
[ 9954.779355]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.780491]  ? finish_wait+0x80/0x80
[ 9954.781450]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[ 9954.782889]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[ 9954.784255]  zio_done+0x373/0x1d50 [zfs]
[ 9954.785410]  zio_execute+0xee/0x210 [zfs]
[ 9954.786588]  taskq_thread+0x205/0x3f0 [spl]
[ 9954.787673]  ? wake_up_q+0x60/0x60
[ 9954.788571]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[ 9954.790079]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 9954.791199]  kthread+0x134/0x150
[ 9954.792082]  ? set_kthread_struct+0x50/0x50
[ 9954.793189]  ret_from_fork+0x35/0x40
[ 9954.794108] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[ 9954.795535]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.798669] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[ 9954.800267] Call Trace:
[ 9954.801096]  __schedule+0x2d1/0x870
[ 9954.801972]  ? __wake_up_common+0x7a/0x190
[ 9954.802963]  schedule+0x55/0xf0
[ 9954.803884]  schedule_timeout+0x19f/0x320
[ 9954.804837]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.805932]  ? taskq_dispatch+0xab/0x280 [spl]
[ 9954.806959]  io_schedule_timeout+0x19/0x40
[ 9954.807989]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.809110]  ? finish_wait+0x80/0x80
[ 9954.810068]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.811103]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.812255]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 9954.813442]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 9954.814648]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[ 9954.816023]  spa_sync+0x362/0x8f0 [zfs]
[ 9954.817110]  txg_sync_thread+0x27a/0x3b0 [zfs]
[ 9954.818267]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 9954.819510]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 9954.820643]  thread_generic_wrapper+0x63/0x90 [spl]
[ 9954.821709]  kthread+0x134/0x150
[ 9954.822590]  ? set_kthread_struct+0x50/0x50
[ 9954.823584]  ret_from_fork+0x35/0x40
[ 9954.824444] INFO: task fio:1055501 blocked for more than 120
seconds.
[ 9954.825781]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.828871] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[ 9954.830463] Call Trace:
[ 9954.831280]  __schedule+0x2d1/0x870
[ 9954.832159]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[ 9954.833396]  schedule+0x55/0xf0
[ 9954.834286]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.835291]  ? finish_wait+0x80/0x80
[ 9954.836235]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 9954.837543]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 9954.838838]  zfs_get_data+0x566/0x810 [zfs]
[ 9954.840034]  zil_lwb_commit+0x194/0x3f0 [zfs]
[ 9954.841154]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[ 9954.842367]  ? __list_add+0x12/0x30 [zfs]
[ 9954.843496]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.844665]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[ 9954.845852]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[ 9954.847203]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 9954.848380]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.849550]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.850640]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.851729]  do_fsync+0x38/0x70
[ 9954.852585]  __x64_sys_fsync+0x10/0x20
[ 9954.853486]  do_syscall_64+0x5b/0x1b0
[ 9954.854416]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.855466] RIP: 0033:0x7eff236bb057
[ 9954.856388] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[ 9954.866149] INFO: task fio:1055502 blocked for more than 120
seconds.
[ 9954.867490]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.870571] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[ 9954.872162] Call Trace:
[ 9954.872947]  __schedule+0x2d1/0x870
[ 9954.873844]  schedule+0x55/0xf0
[ 9954.874716]  schedule_timeout+0x19f/0x320
[ 9954.875645]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.876722]  io_schedule_timeout+0x19/0x40
[ 9954.877677]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.878822]  ? finish_wait+0x80/0x80
[ 9954.879694]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.880763]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.881865]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 9954.883074]  dmu_write_uio_direct+0x79/0x100 [zfs]
[ 9954.884285]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[ 9954.885507]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 9954.886687]  zfs_write+0x581/0xe20 [zfs]
[ 9954.887822]  ? iov_iter_get_pages+0xe9/0x390
[ 9954.888862]  ? trylock_page+0xd/0x20 [zfs]
[ 9954.890005]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.891217]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 9954.892391]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[ 9954.893663]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.894764]  zpl_iter_write+0xd5/0x110 [zfs]
[ 9954.895911]  new_sync_write+0x112/0x160
[ 9954.896881]  vfs_write+0xa5/0x1b0
[ 9954.897701]  ksys_write+0x4f/0xb0
[ 9954.898569]  do_syscall_64+0x5b/0x1b0
[ 9954.899417]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.900515] RIP: 0033:0x7eff236baa47
[ 9954.901363] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8
[ 9954.911129] INFO: task fio:1055504 blocked for more than 120
seconds.
[ 9954.912381]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.915434] task:fio             state:D stack:0
pid:1055504
ppid:1055493 flags:0x00000080
[ 9954.917082] Call Trace:
[ 9954.917773]  __schedule+0x2d1/0x870
[ 9954.918648]  ? zilog_dirty+0x4f/0xc0 [zfs]
[ 9954.919831]  schedule+0x55/0xf0
[ 9954.920717]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.921704]  ? finish_wait+0x80/0x80
[ 9954.922639]  zfs_rangelock_enter_writer+0x46/0x1c0 [zfs]
[ 9954.923940]  zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs]
[ 9954.925306]  zfs_write+0x703/0xe20 [zfs]
[ 9954.926406]  zpl_iter_write_buffered+0xb2/0x120 [zfs]
[ 9954.927687]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.928821]  zpl_iter_write+0xbe/0x110 [zfs]
[ 9954.930028]  new_sync_write+0x112/0x160
[ 9954.930913]  vfs_write+0xa5/0x1b0
[ 9954.931758]  ksys_write+0x4f/0xb0
[ 9954.932666]  do_syscall_64+0x5b/0x1b0
[ 9954.933544]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.934689] RIP: 0033:0x7fcaee8f0a47
[ 9954.935551] Code: Unable to access opcode bytes at RIP
0x7fcaee8f0a1d.
[ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007fcaee8f0a47
[ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI:
0000000000000006
[ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09:
0000000000000000
[ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000001d000
[ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15:
0000557a2006bae8
[ 9954.945525] INFO: task fio:1055505 blocked for more than 120
seconds.
[ 9954.946819]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.949959] task:fio             state:D stack:0
pid:1055505
ppid:1055493 flags:0x00004080
[ 9954.951653] Call Trace:
[ 9954.952417]  __schedule+0x2d1/0x870
[ 9954.953393]  ? finish_wait+0x3e/0x80
[ 9954.954315]  schedule+0x55/0xf0
[ 9954.955212]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.956211]  ? finish_wait+0x80/0x80
[ 9954.957159]  zil_commit_waiter+0xfa/0x3b0 [zfs]
[ 9954.958343]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.959524]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.960626]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.961763]  do_fsync+0x38/0x70
[ 9954.962638]  __x64_sys_fsync+0x10/0x20
[ 9954.963520]  do_syscall_64+0x5b/0x1b0
[ 9954.964470]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.965567] RIP: 0033:0x7fcaee8f1057
[ 9954.966490] Code: Unable to access opcode bytes at RIP
0x7fcaee8f102d.
[ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007fcaee8f1057
[ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI:
0000000000000005
[ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09:
0000000000000000
[ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15:
0000557a2006bae8
[10077.648150] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[10077.649541]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.652782] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[10077.654420] Call Trace:
[10077.655267]  __schedule+0x2d1/0x870
[10077.656179]  ? free_one_page+0x204/0x530
[10077.657192]  schedule+0x55/0xf0
[10077.658004]  cv_wait_common+0x16d/0x280 [spl]
[10077.659018]  ? finish_wait+0x80/0x80
[10077.660013]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[10077.661396]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[10077.662617]  zio_done+0x373/0x1d50 [zfs]
[10077.663783]  zio_execute+0xee/0x210 [zfs]
[10077.664921]  taskq_thread+0x205/0x3f0 [spl]
[10077.665982]  ? wake_up_q+0x60/0x60
[10077.666842]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[10077.668295]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[10077.669360]  kthread+0x134/0x150
[10077.670191]  ? set_kthread_struct+0x50/0x50
[10077.671209]  ret_from_fork+0x35/0x40
[10077.672076] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[10077.673467]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.676612] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[10077.678288] Call Trace:
[10077.679024]  __schedule+0x2d1/0x870
[10077.679948]  ? __wake_up_common+0x7a/0x190
[10077.681042]  schedule+0x55/0xf0
[10077.681899]  schedule_timeout+0x19f/0x320
[10077.682951]  ? __next_timer_interrupt+0xf0/0xf0
[10077.684005]  ? taskq_dispatch+0xab/0x280 [spl]
[10077.685085]  io_schedule_timeout+0x19/0x40
[10077.686080]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.687227]  ? finish_wait+0x80/0x80
[10077.688123]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.689206]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.690300]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[10077.691435]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[10077.692636]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[10077.693997]  spa_sync+0x362/0x8f0 [zfs]
[10077.695112]  txg_sync_thread+0x27a/0x3b0 [zfs]
[10077.696239]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[10077.697512]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[10077.698639]  thread_generic_wrapper+0x63/0x90 [spl]
[10077.699687]  kthread+0x134/0x150
[10077.700567]  ? set_kthread_struct+0x50/0x50
[10077.701502]  ret_from_fork+0x35/0x40
[10077.702430] INFO: task fio:1055501 blocked for more than 120
seconds.
[10077.703697]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.706780] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[10077.708479] Call Trace:
[10077.709231]  __schedule+0x2d1/0x870
[10077.710190]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[10077.711368]  schedule+0x55/0xf0
[10077.712286]  cv_wait_common+0x16d/0x280 [spl]
[10077.713316]  ? finish_wait+0x80/0x80
[10077.714262]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[10077.715566]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[10077.716878]  zfs_get_data+0x566/0x810 [zfs]
[10077.718032]  zil_lwb_commit+0x194/0x3f0 [zfs]
[10077.719234]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[10077.720413]  ? __list_add+0x12/0x30 [zfs]
[10077.721525]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.722708]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[10077.723931]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[10077.725273]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[10077.726438]  zil_commit_impl+0x6d/0xd0 [zfs]
[10077.727586]  zfs_fsync+0x66/0x90 [zfs]
[10077.728675]  zpl_fsync+0xe5/0x140 [zfs]
[10077.729755]  do_fsync+0x38/0x70
[10077.730607]  __x64_sys_fsync+0x10/0x20
[10077.731482]  do_syscall_64+0x5b/0x1b0
[10077.732415]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.733487] RIP: 0033:0x7eff236bb057
[10077.734399] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[10077.744168] INFO: task fio:1055502 blocked for more than 120
seconds.
[10077.745505]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.748642] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[10077.750233] Call Trace:
[10077.751011]  __schedule+0x2d1/0x870
[10077.751915]  schedule+0x55/0xf0
[10077.752811]  schedule_timeout+0x19f/0x320
[10077.753762]  ? __next_timer_interrupt+0xf0/0xf0
[10077.754824]  io_schedule_timeout+0x19/0x40
[10077.755782]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.756922]  ? finish_wait+0x80/0x80
[10077.757788]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.758845]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.759941]  dmu_write_abd+0x174/0x1c0 [zfs]
[10077.761144]  dmu_write_uio_direct+0x79/0x100 [zfs]
[10077.762327]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[10077.763523]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[10077.764749]  zfs_write+0x581/0xe20 [zfs]
[10077.765825]  ? iov_iter_get_pages+0xe9/0x390
[10077.766842]  ? trylock_page+0xd/0x20 [zfs]
[10077.767956]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.769189]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[10077.770343]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[10077.771570]  ? rrw_exit+0xc6/0x200 [zfs]
[10077.772674]  zpl_iter_write+0xd5/0x110 [zfs]
[10077.773834]  new_sync_write+0x112/0x160
[10077.774805]  vfs_write+0xa5/0x1b0
[10077.775634]  ksys_write+0x4f/0xb0
[10077.776526]  do_syscall_64+0x5b/0x1b0
[10077.777386]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.778488] RIP: 0033:0x7eff236baa47
[10077.779339] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
  • Loading branch information
bwatkinson committed Sep 12, 2024
1 parent dd8fef6 commit 93c5c84
Show file tree
Hide file tree
Showing 6 changed files with 154 additions and 340 deletions.
13 changes: 1 addition & 12 deletions include/sys/dbuf.h
Original file line number Diff line number Diff line change
Expand Up @@ -336,14 +336,6 @@ typedef struct dmu_buf_impl {

/* The buffer was partially read. More reads may follow. */
uint8_t db_partial_read;

/*
* This block is being held under a writer rangelock of a Direct I/O
* write that is waiting for previous buffered writes to synced out
* due to mixed buffered and O_DIRECT operations. This is needed to
* check whether to grab the rangelock in zfs_get_data().
*/
uint8_t db_mixed_io_dio_wait;
} dmu_buf_impl_t;

#define DBUF_HASH_MUTEX(h, idx) \
Expand Down Expand Up @@ -392,7 +384,7 @@ dmu_buf_impl_t *dbuf_find(struct objset *os, uint64_t object, uint8_t level,
uint64_t blkid, uint64_t *hash_out);

int dbuf_read(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags);
void dmu_buf_will_clone(dmu_buf_t *db, dmu_tx_t *tx);
void dmu_buf_will_clone_or_dio(dmu_buf_t *db, dmu_tx_t *tx);
void dmu_buf_will_not_fill(dmu_buf_t *db, dmu_tx_t *tx);
void dmu_buf_will_fill(dmu_buf_t *db, dmu_tx_t *tx, boolean_t canfail);
boolean_t dmu_buf_fill_done(dmu_buf_t *db, dmu_tx_t *tx, boolean_t failed);
Expand All @@ -401,9 +393,6 @@ dbuf_dirty_record_t *dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
dbuf_dirty_record_t *dbuf_dirty_lightweight(dnode_t *dn, uint64_t blkid,
dmu_tx_t *tx);
boolean_t dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
void dmu_buf_direct_mixed_io_wait(dmu_buf_impl_t *db, uint64_t txg,
boolean_t read);
void dmu_buf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
blkptr_t *dmu_buf_get_bp_from_dbuf(dmu_buf_impl_t *db);
int dmu_buf_untransform_direct(dmu_buf_impl_t *db, spa_t *spa);
arc_buf_t *dbuf_loan_arcbuf(dmu_buf_impl_t *db);
Expand Down
1 change: 0 additions & 1 deletion include/sys/dmu.h
Original file line number Diff line number Diff line change
Expand Up @@ -1088,7 +1088,6 @@ typedef struct zgd {
struct blkptr *zgd_bp;
dmu_buf_t *zgd_db;
struct zfs_locked_range *zgd_lr;
boolean_t zgd_grabbed_rangelock;
void *zgd_private;
} zgd_t;

Expand Down
210 changes: 56 additions & 154 deletions module/zfs/dbuf.c
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,6 @@ typedef struct dbuf_stats {
*/
kstat_named_t cache_levels[DN_MAX_LEVELS];
kstat_named_t cache_levels_bytes[DN_MAX_LEVELS];
/*
* Statistics for Direct I/O.
*/
kstat_named_t direct_mixed_io_read_wait;
kstat_named_t direct_mixed_io_write_wait;
kstat_named_t direct_sync_wait;
kstat_named_t direct_undirty;
/*
* Statistics about the dbuf hash table.
*/
Expand Down Expand Up @@ -137,10 +130,6 @@ dbuf_stats_t dbuf_stats = {
{ "cache_total_evicts", KSTAT_DATA_UINT64 },
{ { "cache_levels_N", KSTAT_DATA_UINT64 } },
{ { "cache_levels_bytes_N", KSTAT_DATA_UINT64 } },
{ "direct_mixed_io_read_wait", KSTAT_DATA_UINT64 },
{ "direct_mixed_io_write_wait", KSTAT_DATA_UINT64 },
{ "direct_sync_wait", KSTAT_DATA_UINT64 },
{ "direct_undirty", KSTAT_DATA_UINT64 },
{ "hash_hits", KSTAT_DATA_UINT64 },
{ "hash_misses", KSTAT_DATA_UINT64 },
{ "hash_collisions", KSTAT_DATA_UINT64 },
Expand All @@ -162,10 +151,6 @@ struct {
wmsum_t cache_total_evicts;
wmsum_t cache_levels[DN_MAX_LEVELS];
wmsum_t cache_levels_bytes[DN_MAX_LEVELS];
wmsum_t direct_mixed_io_read_wait;
wmsum_t direct_mixed_io_write_wait;
wmsum_t direct_sync_wait;
wmsum_t direct_undirty;
wmsum_t hash_hits;
wmsum_t hash_misses;
wmsum_t hash_collisions;
Expand Down Expand Up @@ -911,14 +896,6 @@ dbuf_kstat_update(kstat_t *ksp, int rw)
ds->cache_levels_bytes[i].value.ui64 =
wmsum_value(&dbuf_sums.cache_levels_bytes[i]);
}
ds->direct_mixed_io_read_wait.value.ui64 =
wmsum_value(&dbuf_sums.direct_mixed_io_read_wait);
ds->direct_mixed_io_write_wait.value.ui64 =
wmsum_value(&dbuf_sums.direct_mixed_io_write_wait);
ds->direct_sync_wait.value.ui64 =
wmsum_value(&dbuf_sums.direct_sync_wait);
ds->direct_undirty.value.ui64 =
wmsum_value(&dbuf_sums.direct_undirty);
ds->hash_hits.value.ui64 =
wmsum_value(&dbuf_sums.hash_hits);
ds->hash_misses.value.ui64 =
Expand Down Expand Up @@ -1021,10 +998,6 @@ dbuf_init(void)
wmsum_init(&dbuf_sums.cache_levels[i], 0);
wmsum_init(&dbuf_sums.cache_levels_bytes[i], 0);
}
wmsum_init(&dbuf_sums.direct_mixed_io_read_wait, 0);
wmsum_init(&dbuf_sums.direct_mixed_io_write_wait, 0);
wmsum_init(&dbuf_sums.direct_sync_wait, 0);
wmsum_init(&dbuf_sums.direct_undirty, 0);
wmsum_init(&dbuf_sums.hash_hits, 0);
wmsum_init(&dbuf_sums.hash_misses, 0);
wmsum_init(&dbuf_sums.hash_collisions, 0);
Expand Down Expand Up @@ -1097,10 +1070,6 @@ dbuf_fini(void)
wmsum_fini(&dbuf_sums.cache_levels[i]);
wmsum_fini(&dbuf_sums.cache_levels_bytes[i]);
}
wmsum_fini(&dbuf_sums.direct_mixed_io_read_wait);
wmsum_fini(&dbuf_sums.direct_mixed_io_write_wait);
wmsum_fini(&dbuf_sums.direct_sync_wait);
wmsum_fini(&dbuf_sums.direct_undirty);
wmsum_fini(&dbuf_sums.hash_hits);
wmsum_fini(&dbuf_sums.hash_misses);
wmsum_fini(&dbuf_sums.hash_collisions);
Expand Down Expand Up @@ -1271,9 +1240,8 @@ dbuf_clear_data(dmu_buf_impl_t *db)
{
ASSERT(MUTEX_HELD(&db->db_mtx));
dbuf_evict_user(db);
/* Direct I/O writes may have data */
if (db->db_buf == NULL)
db->db.db_data = NULL;
ASSERT3P(db->db_buf, ==, NULL);
db->db.db_data = NULL;
if (db->db_state != DB_NOFILL) {
db->db_state = DB_UNCACHED;
DTRACE_SET_STATE(db, "clear data");
Expand Down Expand Up @@ -2789,93 +2757,6 @@ dmu_buf_is_dirty(dmu_buf_t *db_fake, dmu_tx_t *tx)
return (dr != NULL);
}

void
dmu_buf_direct_mixed_io_wait(dmu_buf_impl_t *db, uint64_t txg, boolean_t read)
{
ASSERT(MUTEX_HELD(&db->db_mtx));

if (read == B_TRUE) {
/*
* If a buffered read is in process, a Direct I/O read will
* wait for the buffered I/O to complete.
*/
ASSERT3U(txg, ==, 0);
while (db->db_state == DB_READ) {
DBUF_STAT_BUMP(direct_mixed_io_read_wait);
cv_wait(&db->db_changed, &db->db_mtx);
}
} else {
/*
* There must be an ARC buf associated with this Direct I/O
* write otherwise there is no reason to wait for previous
* dirty records to sync out.
*
* The db_state will temporarily be set to DB_CACHED so that
* that any synchronous writes issued through the ZIL will
* still be handled properly. In particular, the call to
* dbuf_read() in dmu_sync_late_arrival() must account for the
* data still being in the ARC. After waiting here for previous
* TXGs to sync out, dmu_write_direct_done() will update the
* db_state.
*/
ASSERT3P(db->db_buf, !=, NULL);
ASSERT3U(txg, >, 0);
db->db_mixed_io_dio_wait = TRUE;
db->db_state = DB_CACHED;
while (dbuf_find_dirty_lte(db, txg) != NULL) {
DBUF_STAT_BUMP(direct_mixed_io_write_wait);
cv_wait(&db->db_changed, &db->db_mtx);
}
db->db_mixed_io_dio_wait = FALSE;
}
}

/*
* Direct I/O writes may need to undirty the open-context dirty record
* associated with it in the event of an I/O error.
*/
void
dmu_buf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
{
/*
* Direct I/O writes always happen in open-context.
*/
ASSERT(!dmu_tx_is_syncing(tx));
ASSERT(MUTEX_HELD(&db->db_mtx));
ASSERT(db->db_state == DB_NOFILL || db->db_state == DB_UNCACHED);


/*
* In the event of an I/O error we will handle the metaslab clean up in
* zio_done(). Also, the dirty record's dr_overridden_by BP is not
* currently set as that is done in dmu_sync_done(). Since the db_state
* is still set to DB_NOFILL, dbuf_unoverride() will not be called in
* dbuf_undirty() and the dirty record's BP will not be added the SPA's
* spa_free_bplist via zio_free().
*
* This function can also be called in the event that a Direct I/O
* write is overwriting a previous Direct I/O to the same block for
* this TXG. It is important to go ahead and free up the space
* accounting in this case through dbuf_undirty() -> dbuf_unoverride()
* -> zio_free(). This is necessary because the space accounting for
* determining if a write can occur in zfs_write() happens through
* dmu_tx_assign(). This can cause an issue with Direct I/O writes in
* the case of overwrites, because all DVA allocations are being done
* in open-context. Constanstly allowing Direct I/O overwrites to the
* same blocks can exhaust the pools available space leading to ENOSPC
* errors at the DVA allcoation part of the ZIO pipeline, which will
* eventually suspend the pool. By cleaning up space accounting now
* the ENOSPC pool suspend can be avoided.
*
* Since we are undirtying the record for the Direct I/O in
* open-context we must have a hold on the db, so it should never be
* evicted after calling dbuf_undirty().
*/
VERIFY3B(dbuf_undirty(db, tx), ==, B_FALSE);

DBUF_STAT_BUMP(direct_undirty);
}

/*
* Normally the db_blkptr points to the most recent on-disk content for the
* dbuf (and anything newer will be cached in the dbuf). However, a recent
Expand Down Expand Up @@ -2951,22 +2832,49 @@ dmu_buf_untransform_direct(dmu_buf_impl_t *db, spa_t *spa)
}

void
dmu_buf_will_clone(dmu_buf_t *db_fake, dmu_tx_t *tx)
dmu_buf_will_clone_or_dio(dmu_buf_t *db_fake, dmu_tx_t *tx)
{
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
ASSERT0(db->db_level);
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
ASSERT(db->db.db_object != DMU_META_DNODE_OBJECT);

/*
* Block cloning: We are going to clone into this block, so undirty
* modifications done to this block so far in this txg. This includes
* writes and clones into this block.
* Block clones and Direct I/O writes always happen in open-context.
*/
ASSERT(!dmu_tx_is_syncing(tx));
ASSERT0(db->db_level);
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
ASSERT(db->db.db_object != DMU_META_DNODE_OBJECT);

mutex_enter(&db->db_mtx);
DBUF_VERIFY(db);
VERIFY(!dbuf_undirty(db, tx));

/*
* We are going to clone or issue a Direct I/O write on this block, so
* undirty modifications done to this block so far in this txg. This
* includes writes and clones into this block.
*
* If there dirty record associated with this txg from a previous Direct
* I/O write then space accounting cleanup takes place. It is important
* to go ahead free up the space accounting through dbuf_undirty() ->
* dbuf_unoverride() -> zio_free(). Space accountiung for determining
* if a write can occur in zfs_write() happens through dmu_tx_assign().
* This can cuase an issue with Direct I/O writes in the case of
* overwriting the same block, because all DVA allocations are being
* done in open-context. Constantly allowing Direct I/O overwrites to
* the same block can exhaust the pools available space leading to
* ENOSPC errors at the DVA allocation part of the ZIO pipeline, which
* will eventually suspend the pool. By cleaning up sapce acccounting
* now, the ENOSPC error can be avoided.
*
* Since we are undirtying the record in open-context, we must have a
* hold on the db, so it should never be evicted after calling
* dbuf_undirty().
*/
VERIFY3B(dbuf_undirty(db, tx), ==, B_FALSE);
ASSERT0P(dbuf_find_dirty_eq(db, tx->tx_txg));

if (db->db_buf != NULL) {
/*
* If there is an associated ARC buffer with this dbuf we can
Expand All @@ -2977,6 +2885,11 @@ dmu_buf_will_clone(dmu_buf_t *db_fake, dmu_tx_t *tx)
if (dr == NULL || dr->dt.dl.dr_data != db->db_buf)
arc_buf_destroy(db->db_buf, db);

/*
* Setting the dbuf's data pointers to NULL will force all
* future reads down to the devices to get the most up to date
* version of the data after a Direct I/O write has completed.
*/
db->db_buf = NULL;
dbuf_clear_data(db);
}
Expand All @@ -2985,7 +2898,8 @@ dmu_buf_will_clone(dmu_buf_t *db_fake, dmu_tx_t *tx)
ASSERT3P(db->db.db_data, ==, NULL);

db->db_state = DB_NOFILL;
DTRACE_SET_STATE(db, "allocating NOFILL buffer for clone");
DTRACE_SET_STATE(db,
"allocating NOFILL buffer for clone or direct I/O write");

DBUF_VERIFY(db);
mutex_exit(&db->db_mtx);
Expand Down Expand Up @@ -3532,7 +3446,6 @@ dbuf_create(dnode_t *dn, uint8_t level, uint64_t blkid,
db->db_user_immediate_evict = FALSE;
db->db_freed_in_flight = FALSE;
db->db_pending_evict = FALSE;
db->db_mixed_io_dio_wait = FALSE;

if (blkid == DMU_BONUS_BLKID) {
ASSERT3P(parent, ==, dn->dn_dbuf);
Expand Down Expand Up @@ -4788,25 +4701,6 @@ dbuf_sync_leaf(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
dprintf_dbuf_bp(db, db->db_blkptr, "blkptr=%p", db->db_blkptr);

mutex_enter(&db->db_mtx);

/*
* It is possible a buffered read has come in after a Direct I/O
* write and is currently transistioning the db_state from DB_READ
* in dbuf_read_impl() to another state in dbuf_read_done(). We
* have to wait in order for the dbuf state to change from DB_READ
* before syncing the dirty record of the Direct I/O write.
*/
if (db->db_state == DB_READ && !dr->dt.dl.dr_brtwrite) {
ASSERT3P(*datap, ==, NULL);
ASSERT3P(db->db_buf, ==, NULL);
ASSERT3P(db->db.db_data, ==, NULL);
ASSERT3U(dr->dt.dl.dr_override_state, ==, DR_OVERRIDDEN);
while (db->db_state == DB_READ) {
DBUF_STAT_BUMP(direct_sync_wait);
cv_wait(&db->db_changed, &db->db_mtx);
}
}

/*
* To be synced, we must be dirtied. But we might have been freed
* after the dirty.
Expand All @@ -4819,13 +4713,21 @@ dbuf_sync_leaf(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
ASSERT(db->db.db_data != dr->dt.dl.dr_data);
} else if (db->db_state == DB_READ) {
/*
* This buffer has a clone we need to write, and an in-flight
* read on the BP we're about to clone. Its safe to issue the
* write here because the read has already been issued and the
* contents won't change.
* This buffer was either cloned or had a Direct I/O write
* occur and has an in-flgiht read on the BP. It is safe to
* issue the write here, because the read has already been
* issued and the contents won't change.
*
* We can verify the case of both the clone and Direct I/O
* write by making sure the first dirty record for the dbuf
* has no ARC buffer associated with it.
*/
ASSERT(dr->dt.dl.dr_brtwrite &&
dr->dt.dl.dr_override_state == DR_OVERRIDDEN);
dbuf_dirty_record_t *dr_head =
list_head(&db->db_dirty_records);
ASSERT3P(db->db_buf, ==, NULL);
ASSERT3P(db->db.db_data, ==, NULL);
ASSERT3P(dr_head->dt.dl.dr_data, ==, NULL);
ASSERT3U(dr_head->dt.dl.dr_override_state, ==, DR_OVERRIDDEN);
} else {
ASSERT(db->db_state == DB_CACHED || db->db_state == DB_NOFILL);
}
Expand Down Expand Up @@ -5522,7 +5424,7 @@ EXPORT_SYMBOL(dbuf_dirty);
EXPORT_SYMBOL(dmu_buf_set_crypt_params);
EXPORT_SYMBOL(dmu_buf_will_dirty);
EXPORT_SYMBOL(dmu_buf_is_dirty);
EXPORT_SYMBOL(dmu_buf_will_clone);
EXPORT_SYMBOL(dmu_buf_will_clone_or_dio);
EXPORT_SYMBOL(dmu_buf_will_not_fill);
EXPORT_SYMBOL(dmu_buf_will_fill);
EXPORT_SYMBOL(dmu_buf_fill_done);
Expand Down
2 changes: 1 addition & 1 deletion module/zfs/dmu.c
Original file line number Diff line number Diff line change
Expand Up @@ -2676,7 +2676,7 @@ dmu_brt_clone(objset_t *os, uint64_t object, uint64_t offset, uint64_t length,
ASSERT(db->db_blkid != DMU_SPILL_BLKID);
ASSERT(BP_IS_HOLE(bp) || dbuf->db_size == BP_GET_LSIZE(bp));

dmu_buf_will_clone(dbuf, tx);
dmu_buf_will_clone_or_dio(dbuf, tx);

mutex_enter(&db->db_mtx);

Expand Down
Loading

0 comments on commit 93c5c84

Please sign in to comment.