Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Barriers in pre-2.6.24 kernels #1

Closed
behlendorf opened this issue May 14, 2010 · 1 comment
Closed

Use Barriers in pre-2.6.24 kernels #1

behlendorf opened this issue May 14, 2010 · 1 comment
Labels
Type: Feature Feature request or new feature

Comments

@behlendorf
Copy link
Contributor

Barrier support was added as of zfs-0.4.5. This has been implemented in the vdev_disk layer for 2.6.24+ kernels. This is where generic support first appears for empty bios in the kernel which can be submitted as a WRITE_BARRIER io request. This prevents the elevator from reordering requests from one side of this barrier to the other, and the callback will only run once this io (and previous ios) are all physically on disk. For kernels priors to this there is a more primitive barrier mechanism but the code has not been updated to use it and instead returns ENOTSUP to indicate there is no barrier support. This works needs to be done.

@behlendorf
Copy link
Contributor Author

The updated code with Posix layer no longer build with kernels older than 2.6.26 so this is no longer an issue. The new API became available in 2.6.24. Unless, someone makes a very good case older kernels need to be supported this work does not need to happen.

@ghost ghost mentioned this issue Nov 6, 2011
@b333z b333z mentioned this issue Dec 14, 2011
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 9, 2024
Varada (varada.kari@gmail.com) pointed out an issue with O_DIRECT reads
with the following test case:

dd if=/dev/urandom of=/local_zpool/file2 bs=512 count=79
truncate -s 40382 /local_zpool/file2
zpool export local_zpool
zpool import -d ~/ local_zpool
dd if=/local_zpool/file2 of=/dev/null bs=1M iflag=direct

That led to following panic happening:

[  307.769267] VERIFY(IS_P2ALIGNED(size, sizeof (uint32_t))) failed
[  307.782997] PANIC at zfs_fletcher.c:870:abd_fletcher_4_iter()
[  307.788743] Showing stack for process 9665
[  307.792834] CPU: 47 PID: 9665 Comm: z_rd_int_5 Kdump: loaded Tainted:
P           OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[  307.804298] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[  307.811682] Call Trace:
[  307.814131]  dump_stack+0x41/0x60
[  307.817449]  spl_panic+0xd0/0xe8 [spl]
[  307.821210]  ? irq_work_queue+0x9/0x20
[  307.824961]  ? wake_up_klogd.part.30+0x30/0x40
[  307.829407]  ? vprintk_emit+0x125/0x250
[  307.833246]  ? printk+0x58/0x6f
[  307.836391]  spl_assert.constprop.1+0x16/0x20 [zfs]
[  307.841438]  abd_fletcher_4_iter+0x6c/0x101 [zfs]
[  307.846343]  ? abd_fletcher_4_simd2scalar+0x83/0x83 [zfs]
[  307.851922]  abd_iterate_func+0xb1/0x170 [zfs]
[  307.856533]  abd_fletcher_4_impl+0x3f/0xa0 [zfs]
[  307.861334]  abd_fletcher_4_native+0x52/0x70 [zfs]
[  307.866302]  ? enqueue_entity+0xf1/0x6e0
[  307.870226]  ? select_idle_sibling+0x23/0x700
[  307.874587]  ? enqueue_task_fair+0x94/0x710
[  307.878771]  ? select_task_rq_fair+0x351/0x990
[  307.883208]  zio_checksum_error_impl+0xff/0x5f0 [zfs]
[  307.888435]  ? abd_fletcher_4_impl+0xa0/0xa0 [zfs]
[  307.893401]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[  307.898203]  ? __wake_up_common+0x7a/0x190
[  307.902300]  ? __switch_to_asm+0x41/0x70
[  307.906220]  ? __switch_to_asm+0x35/0x70
[  307.910145]  ? __switch_to_asm+0x41/0x70
[  307.914061]  ? __switch_to_asm+0x35/0x70
[  307.917980]  ? __switch_to_asm+0x41/0x70
[  307.921903]  ? __switch_to_asm+0x35/0x70
[  307.925821]  ? __switch_to_asm+0x35/0x70
[  307.929739]  ? __switch_to_asm+0x41/0x70
[  307.933658]  ? __switch_to_asm+0x35/0x70
[  307.937582]  zio_checksum_error+0x47/0xc0 [zfs]
[  307.942288]  raidz_checksum_verify+0x3a/0x70 [zfs]
[  307.947257]  vdev_raidz_io_done+0x4b/0x160 [zfs]
[  307.952049]  zio_vdev_io_done+0x7f/0x200 [zfs]
[  307.956669]  zio_execute+0xee/0x210 [zfs]
[  307.960855]  taskq_thread+0x203/0x420 [spl]
[  307.965048]  ? wake_up_q+0x70/0x70
[  307.968455]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[  307.974807]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[  307.979260]  kthread+0x10a/0x120
[  307.982485]  ? set_kthread_struct+0x40/0x40
[  307.986670]  ret_from_fork+0x35/0x40

The reason this was occuring was because by doing the zpool export that
meant the initial read of O_DIRECT would be forced to go down to disk.
In this case it was still valid as bs=1M is still page size aligned;
howver, the file length was not. So when issuing the O_DIRECT read even
after calling make_abd_for_dbuf() we had an extra page allocated in the
original ABD along with the linear ABD attached at the end of the gang
abd from make_abd_for_dbuf().

This is an issue as it is our expectations with read that the block
sizes being read are page aligned. When we do our check we are only
checking the request but not the actual size of data we may read such as
the entire file.

In order to remedy this situation, I updated zfs_read() to attempt to
read as much as it can using O_DIRECT based on if the length is
page-sized aligned. Any additional bytes that are requested are then
read into the ARC. This still stays with our semantics that IO requests
must be page sized aligned.

There are a bit of draw backs here if there is only a single block being
read. In this case the block will be read twice. Once using O_DIRECT and
then using buffered to fill in the remaining data for the users request.
However, this should not be a big issue most of the time. In the normal
case a user may ask for a lot of data from a file and only the stray
bytes at the end of the file will have to be read using the ARC.

In order to make sure this case was completely covered, I added a new
ZTS test case dio_unaligned_filesize to test this out. The main thing
with that test case is the first O_DIRECT read will issue out two reads
two being O_DIRECT and the third being buffered for the remaining
requested bytes.

As part of this commit, I also updated stride_dd to take an additional
parameter of -e, which says read the entire input file and ingore the
count (-c) option. We need to use stride_dd for FreeBSD as dd does not
make sure the buffer is page aligned. This udpate to stride_dd allows us
to use it to test out this case in dio_unaligned_filesize for both Linux
and FreeBSD.

While this may not be the most elegant solution, it does stick with the
semanatics and still reads all the data the user requested. I am fine
with revisiting this and maybe we just return a short read?

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 9, 2024
We were using the generic Linux calls to make sure that the page cache
was cleaned out before issuing any Direct I/O reads or writes. However,
this only matters in the event the file region being written/read from
using O_DIRECT was mmap'ed. One of stipulations with O_DIRECT is that is
redirected through the ARC in the event the file range is mmap'ed.
Becaues of this, it did not make sense to try and invalidate the page
cache if we were never intending to have O_DIRECT to work with mmap'ed
regions. Also, calls into the generic Linux calls in writes would often
lead to lockups as the page lock is dropped in zfs_putpage(). See the
stack dump below. In order to just prevent this, we no longer will use
the generic linux direct IO wrappers or try and flush out the page
cache.

Instead if we find the file range has been mmap'ed in since the initial
check in zfs_setup_direct() we will just now directly handle that in
zfs_read() and zfs_write(). In most case zfs_setup_direct() will prevent
O_DIRECT to mmap'ed regions of the file that have been page faulted in,
but if that happen when we are issuing the direct I/O request the the
normal parts of the ZFS paths will be taken to account for this.

It is highly suggested not to mmap a region of file and then write or
read directly to the file. In general, that is kind of an isane thing to
do... However, we try our best to still have consistency with the ARC.

Also, before making this decision I did explore if we could just add a
rangelock in zfs_fillpage(), but we can not do that. The reason is when
the page is in zfs_readpage_common() it has already been locked by the
kernel. So, if we try and grab the rangelock anywhere in that path we
can get stuck if another thread is issuing writes to the file region
that was mmap'ed in. The reason is update_pages() holds the rangelock
and then tries to lock the page. In this case zfs_fillpage() holds the
page lock but is stuck in the rangelock waiting and holding the page
lock. Deadlock is unavoidable in this case.

[260136.244332] INFO: task fio:3791107 blocked for more than 120
seconds.
[260136.250867]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260136.258693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260136.266607] task:fio             state:D stack:    0 pid:3791107
ppid:3790841 flags:0x00004080
[260136.275306] Call Trace:
[260136.277845]  __schedule+0x2d1/0x830
[260136.281432]  schedule+0x35/0xa0
[260136.284665]  io_schedule+0x12/0x40
[260136.288157]  wait_on_page_bit+0x123/0x220
[260136.292258]  ? xas_load+0x8/0x80
[260136.295577]  ? file_fdatawait_range+0x20/0x20
[260136.300024]  filemap_page_mkwrite+0x9b/0xb0
[260136.304295]  do_page_mkwrite+0x53/0x90
[260136.308135]  ? vm_normal_page+0x1a/0xc0
[260136.312062]  do_wp_page+0x298/0x350
[260136.315640]  __handle_mm_fault+0x44f/0x6c0
[260136.319826]  ? __switch_to_asm+0x41/0x70
[260136.323839]  handle_mm_fault+0xc1/0x1e0
[260136.327766]  do_user_addr_fault+0x1b5/0x440
[260136.332038]  do_page_fault+0x37/0x130
[260136.335792]  ? page_fault+0x8/0x30
[260136.339284]  page_fault+0x1e/0x30
[260136.342689] RIP: 0033:0x7f6deee7f1b4
[260136.346361] Code: Unable to access opcode bytes at RIP
0x7f6deee7f18a.
[260136.352977] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202
[260136.358288] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX:
00007f6d83148fe0
[260136.365508] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI:
00007f6d8309bfa0
[260136.372730] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09:
0000000000000000
[260136.379946] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12:
0000000000000001
[260136.387167] R13: 0000556b63614ec0 R14: 0000000000100000 R15:
0000556b63614ee8
[260136.394387] INFO: task fio:3791108 blocked for more than 120
seconds.
[260136.400911]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260136.408739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260136.416651] task:fio             state:D stack:    0 pid:3791108
ppid:3790835 flags:0x00004080
[260136.425343] Call Trace:
[260136.427883]  __schedule+0x2d1/0x830
[260136.431463]  ? cv_wait_common+0x12d/0x240 [spl]
[260136.436091]  schedule+0x35/0xa0
[260136.439321]  io_schedule+0x12/0x40
[260136.442814]  __lock_page+0x12d/0x230
[260136.446483]  ? file_fdatawait_range+0x20/0x20
[260136.450929]  zfs_putpage+0x148/0x590 [zfs]
[260136.455322]  ? rmap_walk_file+0x116/0x290
[260136.459421]  ? __mod_memcg_lruvec_state+0x5d/0x160
[260136.464300]  zpl_putpage+0x67/0xd0 [zfs]
[260136.468495]  write_cache_pages+0x197/0x420
[260136.472679]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260136.477732]  zpl_writepages+0x119/0x130 [zfs]
[260136.482352]  do_writepages+0xc2/0x1c0
[260136.486103]  ? flush_tlb_func_common.constprop.9+0x125/0x220
[260136.491850]  __filemap_fdatawrite_range+0xc7/0x100
[260136.496732]  filemap_write_and_wait_range+0x30/0x80
[260136.501695]  generic_file_direct_write+0x120/0x160
[260136.506575]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260136.510779]  zpl_iter_write+0xdd/0x160 [zfs]
[260136.515323]  new_sync_write+0x112/0x160
[260136.519255]  vfs_write+0xa5/0x1a0
[260136.522662]  ksys_write+0x4f/0xb0
[260136.526067]  do_syscall_64+0x5b/0x1a0
[260136.529818]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260136.534959] RIP: 0033:0x7f9d192c7a17
[260136.538625] Code: Unable to access opcode bytes at RIP
0x7f9d192c79ed.
[260136.545236] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[260136.552889] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9d192c7a17
[260136.560108] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI:
0000000000000005
[260136.567329] RBP: 00007f9caea03000 R08: 0000000000000000 R09:
0000000000000000
[260136.574548] R10: 00005558e8975680 R11: 0000000000000293 R12:
0000000000100000
[260136.581767] R13: 00005558e8985ec0 R14: 0000000000100000 R15:
00005558e8985ee8
[260136.588989] INFO: task fio:3791109 blocked for more than 120
seconds.
[260136.595513]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260136.603337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260136.611250] task:fio             state:D stack:    0 pid:3791109
ppid:3790838 flags:0x00004080
[260136.619943] Call Trace:
[260136.622483]  __schedule+0x2d1/0x830
[260136.626064]  ? zfs_znode_held+0xe6/0x140 [zfs]
[260136.630777]  schedule+0x35/0xa0
[260136.634009]  cv_wait_common+0x153/0x240 [spl]
[260136.638466]  ? finish_wait+0x80/0x80
[260136.642129]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[260136.647712]  zfs_rangelock_enter_impl+0xbf/0x170 [zfs]
[260136.653121]  zfs_get_data+0x113/0x770 [zfs]
[260136.657567]  zil_lwb_commit+0x537/0x780 [zfs]
[260136.662187]  zil_process_commit_list+0x14c/0x460 [zfs]
[260136.667585]  zil_commit_writer+0xeb/0x160 [zfs]
[260136.672376]  zil_commit_impl+0x5d/0xa0 [zfs]
[260136.676910]  zfs_putpage+0x516/0x590 [zfs]
[260136.681279]  zpl_putpage+0x67/0xd0 [zfs]
[260136.685467]  write_cache_pages+0x197/0x420
[260136.689649]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260136.694705]  zpl_writepages+0x119/0x130 [zfs]
[260136.699322]  do_writepages+0xc2/0x1c0
[260136.703076]  __filemap_fdatawrite_range+0xc7/0x100
[260136.707952]  filemap_write_and_wait_range+0x30/0x80
[260136.712920]  zpl_iter_read_direct+0x86/0x1b0 [zfs]
[260136.717972]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260136.722174]  zpl_iter_read+0x90/0xb0 [zfs]
[260136.726536]  new_sync_read+0x10f/0x150
[260136.730376]  vfs_read+0x91/0x140
[260136.733693]  ksys_read+0x4f/0xb0
[260136.737012]  do_syscall_64+0x5b/0x1a0
[260136.740764]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260136.745906] RIP: 0033:0x7f1bd4687ab4
[260136.749574] Code: Unable to access opcode bytes at RIP
0x7f1bd4687a8a.
[260136.756181] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[260136.763834] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f1bd4687ab4
[260136.771056] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI:
0000000000000005
[260136.778274] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09:
0000000000000000
[260136.785494] R10: 000000008fd0ea42 R11: 0000000000000246 R12:
0000000000100000
[260136.792714] R13: 000055ca4b405ec0 R14: 0000000000100000 R15:
000055ca4b405ee8
[260259.123003] INFO: task kworker/u128:0:3589938 blocked for more than
120 seconds.
[260259.130487]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.138313] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.146224] task:kworker/u128:0  state:D stack:    0 pid:3589938
ppid:     2 flags:0x80004080
[260259.154832] Workqueue: writeback wb_workfn (flush-zfs-540)
[260259.160411] Call Trace:
[260259.162950]  __schedule+0x2d1/0x830
[260259.166531]  schedule+0x35/0xa0
[260259.169765]  io_schedule+0x12/0x40
[260259.173257]  __lock_page+0x12d/0x230
[260259.176921]  ? file_fdatawait_range+0x20/0x20
[260259.181368]  write_cache_pages+0x1f2/0x420
[260259.185554]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260259.190633]  zpl_writepages+0x98/0x130 [zfs]
[260259.195183]  do_writepages+0xc2/0x1c0
[260259.198935]  __writeback_single_inode+0x39/0x2f0
[260259.203640]  writeback_sb_inodes+0x1e6/0x450
[260259.208002]  __writeback_inodes_wb+0x5f/0xc0
[260259.212359]  wb_writeback+0x247/0x2e0
[260259.216114]  ? get_nr_inodes+0x35/0x50
[260259.219953]  wb_workfn+0x37c/0x4d0
[260259.223443]  ? __switch_to_asm+0x35/0x70
[260259.227456]  ? __switch_to_asm+0x41/0x70
[260259.231469]  ? __switch_to_asm+0x35/0x70
[260259.235481]  ? __switch_to_asm+0x41/0x70
[260259.239495]  ? __switch_to_asm+0x35/0x70
[260259.243505]  ? __switch_to_asm+0x41/0x70
[260259.247518]  ? __switch_to_asm+0x35/0x70
[260259.251533]  ? __switch_to_asm+0x41/0x70
[260259.255545]  process_one_work+0x1a7/0x360
[260259.259645]  worker_thread+0x30/0x390
[260259.263396]  ? create_worker+0x1a0/0x1a0
[260259.267409]  kthread+0x10a/0x120
[260259.270730]  ? set_kthread_struct+0x40/0x40
[260259.275003]  ret_from_fork+0x35/0x40
[260259.278712] INFO: task fio:3791107 blocked for more than 120
seconds.
[260259.285240]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.293064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.300976] task:fio             state:D stack:    0 pid:3791107
ppid:3790841 flags:0x00004080
[260259.309668] Call Trace:
[260259.312210]  __schedule+0x2d1/0x830
[260259.315787]  schedule+0x35/0xa0
[260259.319020]  io_schedule+0x12/0x40
[260259.322511]  wait_on_page_bit+0x123/0x220
[260259.326611]  ? xas_load+0x8/0x80
[260259.329930]  ? file_fdatawait_range+0x20/0x20
[260259.334376]  filemap_page_mkwrite+0x9b/0xb0
[260259.338650]  do_page_mkwrite+0x53/0x90
[260259.342489]  ? vm_normal_page+0x1a/0xc0
[260259.346415]  do_wp_page+0x298/0x350
[260259.349994]  __handle_mm_fault+0x44f/0x6c0
[260259.354181]  ? __switch_to_asm+0x41/0x70
[260259.358193]  handle_mm_fault+0xc1/0x1e0
[260259.362117]  do_user_addr_fault+0x1b5/0x440
[260259.366391]  do_page_fault+0x37/0x130
[260259.370145]  ? page_fault+0x8/0x30
[260259.373639]  page_fault+0x1e/0x30
[260259.377043] RIP: 0033:0x7f6deee7f1b4
[260259.380714] Code: Unable to access opcode bytes at RIP
0x7f6deee7f18a.
[260259.387323] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202
[260259.392633] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX:
00007f6d83148fe0
[260259.399853] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI:
00007f6d8309bfa0
[260259.407074] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09:
0000000000000000
[260259.414291] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12:
0000000000000001
[260259.421512] R13: 0000556b63614ec0 R14: 0000000000100000 R15:
0000556b63614ee8
[260259.428733] INFO: task fio:3791108 blocked for more than 120
seconds.
[260259.435258]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.443085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.450997] task:fio             state:D stack:    0 pid:3791108
ppid:3790835 flags:0x00004080
[260259.459689] Call Trace:
[260259.462228]  __schedule+0x2d1/0x830
[260259.465808]  ? cv_wait_common+0x12d/0x240 [spl]
[260259.470435]  schedule+0x35/0xa0
[260259.473669]  io_schedule+0x12/0x40
[260259.477161]  __lock_page+0x12d/0x230
[260259.480828]  ? file_fdatawait_range+0x20/0x20
[260259.485274]  zfs_putpage+0x148/0x590 [zfs]
[260259.489640]  ? rmap_walk_file+0x116/0x290
[260259.493742]  ? __mod_memcg_lruvec_state+0x5d/0x160
[260259.498619]  zpl_putpage+0x67/0xd0 [zfs]
[260259.502813]  write_cache_pages+0x197/0x420
[260259.506998]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260259.512054]  zpl_writepages+0x119/0x130 [zfs]
[260259.516672]  do_writepages+0xc2/0x1c0
[260259.520423]  ? flush_tlb_func_common.constprop.9+0x125/0x220
[260259.526170]  __filemap_fdatawrite_range+0xc7/0x100
[260259.531050]  filemap_write_and_wait_range+0x30/0x80
[260259.536016]  generic_file_direct_write+0x120/0x160
[260259.540896]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260259.545099]  zpl_iter_write+0xdd/0x160 [zfs]
[260259.549639]  new_sync_write+0x112/0x160
[260259.553566]  vfs_write+0xa5/0x1a0
[260259.556971]  ksys_write+0x4f/0xb0
[260259.560379]  do_syscall_64+0x5b/0x1a0
[260259.564131]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260259.569269] RIP: 0033:0x7f9d192c7a17
[260259.572935] Code: Unable to access opcode bytes at RIP
0x7f9d192c79ed.
[260259.579549] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[260259.587200] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9d192c7a17
[260259.594419] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI:
0000000000000005
[260259.601639] RBP: 00007f9caea03000 R08: 0000000000000000 R09:
0000000000000000
[260259.608859] R10: 00005558e8975680 R11: 0000000000000293 R12:
0000000000100000
[260259.616078] R13: 00005558e8985ec0 R14: 0000000000100000 R15:
00005558e8985ee8
[260259.623298] INFO: task fio:3791109 blocked for more than 120
seconds.
[260259.629827]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.637650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.645564] task:fio             state:D stack:    0 pid:3791109
ppid:3790838 flags:0x00004080
[260259.654254] Call Trace:
[260259.656794]  __schedule+0x2d1/0x830
[260259.660373]  ? zfs_znode_held+0xe6/0x140 [zfs]
[260259.665081]  schedule+0x35/0xa0
[260259.668313]  cv_wait_common+0x153/0x240 [spl]
[260259.672768]  ? finish_wait+0x80/0x80
[260259.676441]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[260259.682026]  zfs_rangelock_enter_impl+0xbf/0x170 [zfs]
[260259.687432]  zfs_get_data+0x113/0x770 [zfs]
[260259.691876]  zil_lwb_commit+0x537/0x780 [zfs]
[260259.696497]  zil_process_commit_list+0x14c/0x460 [zfs]
[260259.701895]  zil_commit_writer+0xeb/0x160 [zfs]
[260259.706689]  zil_commit_impl+0x5d/0xa0 [zfs]
[260259.711228]  zfs_putpage+0x516/0x590 [zfs]
[260259.715589]  zpl_putpage+0x67/0xd0 [zfs]
[260259.719775]  write_cache_pages+0x197/0x420
[260259.723959]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260259.729013]  zpl_writepages+0x119/0x130 [zfs]
[260259.733632]  do_writepages+0xc2/0x1c0
[260259.737384]  __filemap_fdatawrite_range+0xc7/0x100
[260259.742264]  filemap_write_and_wait_range+0x30/0x80
[260259.747229]  zpl_iter_read_direct+0x86/0x1b0 [zfs]
[260259.752286]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260259.756487]  zpl_iter_read+0x90/0xb0 [zfs]
[260259.760855]  new_sync_read+0x10f/0x150
[260259.764696]  vfs_read+0x91/0x140
[260259.768013]  ksys_read+0x4f/0xb0
[260259.771332]  do_syscall_64+0x5b/0x1a0
[260259.775087]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260259.780225] RIP: 0033:0x7f1bd4687ab4
[260259.783893] Code: Unable to access opcode bytes at RIP
0x7f1bd4687a8a.
[260259.790503] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[260259.798157] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f1bd4687ab4
[260259.805377] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI:
0000000000000005
[260259.812592] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09:
0000000000000000
[260259.819814] R10: 000000008fd0ea42 R11: 0000000000000246 R12:
0000000000100000
[260259.827032] R13: 000055ca4b405ec0 R14: 0000000000100000 R15:
000055ca4b405ee8
[260382.001731] INFO: task kworker/u128:0:3589938 blocked for more than
120 seconds.
[260382.009227]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260382.017053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260382.024963] task:kworker/u128:0  state:D stack:    0 pid:3589938
ppid:     2 flags:0x80004080
[260382.033568] Workqueue: writeback wb_workfn (flush-zfs-540)
[260382.039141] Call Trace:
[260382.041683]  __schedule+0x2d1/0x830
[260382.045271]  schedule+0x35/0xa0
[260382.048503]  io_schedule+0x12/0x40
[260382.051994]  __lock_page+0x12d/0x230
[260382.055662]  ? file_fdatawait_range+0x20/0x20
[260382.060107]  write_cache_pages+0x1f2/0x420
[260382.064293]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260382.069379]  zpl_writepages+0x98/0x130 [zfs]
[260382.073919]  do_writepages+0xc2/0x1c0
[260382.077672]  __writeback_single_inode+0x39/0x2f0
[260382.082379]  writeback_sb_inodes+0x1e6/0x450
[260382.086738]  __writeback_inodes_wb+0x5f/0xc0
[260382.091097]  wb_writeback+0x247/0x2e0
[260382.094850]  ? get_nr_inodes+0x35/0x50
[260382.098689]  wb_workfn+0x37c/0x4d0
[260382.102181]  ? __switch_to_asm+0x35/0x70
[260382.106194]  ? __switch_to_asm+0x41/0x70
[260382.110207]  ? __switch_to_asm+0x35/0x70
[260382.114221]  ? __switch_to_asm+0x41/0x70
[260382.118231]  ? __switch_to_asm+0x35/0x70
[260382.122244]  ? __switch_to_asm+0x41/0x70
[260382.126256]  ? __switch_to_asm+0x35/0x70
[260382.130273]  ? __switch_to_asm+0x41/0x70
[260382.134284]  process_one_work+0x1a7/0x360
[260382.138384]  worker_thread+0x30/0x390
[260382.142136]  ? create_worker+0x1a0/0x1a0
[260382.146150]  kthread+0x10a/0x120
[260382.149469]  ? set_kthread_struct+0x40/0x40
[260382.153741]  ret_from_fork+0x35/0x40
[260382.157448] INFO: task fio:3791107 blocked for more than 120
seconds.
[260382.163977]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260382.171802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260382.179715] task:fio             state:D stack:    0 pid:3791107
ppid:3790841 flags:0x00004080
[260382.188409] Call Trace:
[260382.190945]  __schedule+0x2d1/0x830
[260382.194527]  schedule+0x35/0xa0
[260382.197757]  io_schedule+0x12/0x40
[260382.201249]  wait_on_page_bit+0x123/0x220
[260382.205350]  ? xas_load+0x8/0x80
[260382.208668]  ? file_fdatawait_range+0x20/0x20
[260382.213114]  filemap_page_mkwrite+0x9b/0xb0
[260382.217386]  do_page_mkwrite+0x53/0x90
[260382.221227]  ? vm_normal_page+0x1a/0xc0
[260382.225152]  do_wp_page+0x298/0x350
[260382.228733]  __handle_mm_fault+0x44f/0x6c0
[260382.232919]  ? __switch_to_asm+0x41/0x70
[260382.236930]  handle_mm_fault+0xc1/0x1e0
[260382.240856]  do_user_addr_fault+0x1b5/0x440
[260382.245132]  do_page_fault+0x37/0x130
[260382.248883]  ? page_fault+0x8/0x30
[260382.252375]  page_fault+0x1e/0x30
[260382.255781] RIP: 0033:0x7f6deee7f1b4
[260382.259451] Code: Unable to access opcode bytes at RIP
0x7f6deee7f18a.
[260382.266059] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202
[260382.271373] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX:
00007f6d83148fe0
[260382.278591] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI:
00007f6d8309bfa0
[260382.285813] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09:
0000000000000000
[260382.293030] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12:
0000000000000001
[260382.300249] R13: 0000556b63614ec0 R14: 0000000000100000 R15:
0000556b63614ee8
[260382.307472] INFO: task fio:3791108 blocked for more than 120
seconds.
[260382.313997]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260382.321823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260382.329734] task:fio             state:D stack:    0 pid:3791108
ppid:3790835 flags:0x00004080
[260382.338427] Call Trace:
[260382.340967]  __schedule+0x2d1/0x830
[260382.344547]  ? cv_wait_common+0x12d/0x240 [spl]
[260382.349173]  schedule+0x35/0xa0
[260382.352406]  io_schedule+0x12/0x40
[260382.355899]  __lock_page+0x12d/0x230
[260382.359563]  ? file_fdatawait_range+0x20/0x20
[260382.364010]  zfs_putpage+0x148/0x590 [zfs]
[260382.368379]  ? rmap_walk_file+0x116/0x290
[260382.372479]  ? __mod_memcg_lruvec_state+0x5d/0x160
[260382.377358]  zpl_putpage+0x67/0xd0 [zfs]
[260382.381552]  write_cache_pages+0x197/0x420
[260382.385739]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260382.390791]  zpl_writepages+0x119/0x130 [zfs]
[260382.395410]  do_writepages+0xc2/0x1c0
[260382.399161]  ? flush_tlb_func_common.constprop.9+0x125/0x220
[260382.404907]  __filemap_fdatawrite_range+0xc7/0x100
[260382.409790]  filemap_write_and_wait_range+0x30/0x80
[260382.414752]  generic_file_direct_write+0x120/0x160
[260382.419632]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260382.423838]  zpl_iter_write+0xdd/0x160 [zfs]
[260382.428379]  new_sync_write+0x112/0x160
[260382.432304]  vfs_write+0xa5/0x1a0
[260382.435711]  ksys_write+0x4f/0xb0
[260382.439115]  do_syscall_64+0x5b/0x1a0
[260382.442866]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260382.448007] RIP: 0033:0x7f9d192c7a17
[260382.451675] Code: Unable to access opcode bytes at RIP
0x7f9d192c79ed.
[260382.458286] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[260382.465938] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9d192c7a17
[260382.473158] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI:
0000000000000005
[260382.480379] RBP: 00007f9caea03000 R08: 0000000000000000 R09:
0000000000000000
[260382.487597] R10: 00005558e8975680 R11: 0000000000000293 R12:
0000000000100000
[260382.494814] R13: 00005558e8985ec0 R14: 0000000000100000 R15:
00005558e8985ee8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 9, 2024
In commit ba30ec9 I got a little overzealous in code cleanup. While I
was trying to remove all the stable page code for Linux, I
misinterpreted why Brian Behlendorf originally had the try rangelock,
drop page lock, and aquire range lock in zfs_fillpage(). This is still
necessary even without stable pages. This has to occur to avoid a race
condition between direct IO writes and pages being faulted in for mmap
files. If the rangelock is not held, then a direct IO write can set
db->db_data = NULL either in:
 1. dmu_write_direct() -> dmu_buf_will_not_fill() ->
    dmu_buf_will_fill() -> dbuf_noread() -> dbuf_clear_data()
 2. dmu_write_direct_done()

This can cause the panic then, withtout the rangelock as
dmu_read_imp() can get a NULL pointer for db->db_data when trying to do
the memcpy. So this rangelock must be held in zfs_fillpage() not matter
what.

There are further semantics on when the rangelock should be held in
zfs_fillpage(). It must only be held when doing zfs_getpage() ->
zfs_fillpage(). The reason for this is mappedread() can call
zfs_fillpage() if the page is not uptodate. This can occur becaue
filemap_fault() will first add the pages to the inode's address_space
mapping and then drop the page lock. This leaves open a window were
mappedread() can be called. Since this can occur, mappedread() will hold
both the page lock and the rangelock. This is perfectly valid and
correct. However, it is important in this case to never grab the
rangelock in zfs_fillpage(). If this happens, then a dead lock will
occur.

Finally it is important to note that the rangelock is first attempted to
be grabbed with zfs_rangelock_tryenter(). The reason for this is the
page lock must be dropped in order to grab the rangelock in this case.
Otherwise there is a race between zfs_fillpage() and zfs_write() ->
update_pages(). In update_pages() the rangelock is already held and it
then grabs the page lock. So if the page lock is not dropped before
acquiring the rangelock in zfs_fillpage() there can be a deadlock.

Below is a stack trace showing the NULL pointer dereference that was
occuring with the dio_mmap ZTS test case before this commit.

[ 7737.430658] BUG: unable to handle kernel NULL pointer dereference at
0000000000000000
[ 7737.438486] PGD 0 P4D 0
[ 7737.441024] Oops: 0000 [openzfs#1] SMP NOPTI
[ 7737.444692] CPU: 33 PID: 599346 Comm: fio Kdump: loaded Tainted: P
OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[ 7737.455721] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[ 7737.463106] RIP: 0010:__memcpy+0x12/0x20
[ 7737.467032] Code: ff 0f 31 48 c1 e2 20 48 09 c2 48 31 d3 e9 79 ff ff
ff 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2
07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4
[ 7737.485770] RSP: 0000:ffffc1db829e3b60 EFLAGS: 00010246
[ 7737.490987] RAX: ffff9ef195b6f000 RBX: 0000000000001000 RCX:
0000000000000200
[ 7737.498111] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff9ef195b6f000
[ 7737.505235] RBP: ffff9ef195b70000 R08: ffff9eef1d1d0000 R09:
ffff9eef1d1d0000
[ 7737.512358] R10: ffff9eef27968218 R11: 0000000000000000 R12:
0000000000000000
[ 7737.519481] R13: ffff9ef041adb6d8 R14: 00000000005e1000 R15:
0000000000000001
[ 7737.526607] FS:  00007f77fe2bae80(0000) GS:ffff9f0cae840000(0000)
knlGS:0000000000000000
[ 7737.534683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7737.540423] CR2: 0000000000000000 CR3: 00000003387a6000 CR4:
0000000000350ee0
[ 7737.547553] Call Trace:
[ 7737.550003]  dmu_read_impl+0x11a/0x210 [zfs]
[ 7737.554464]  dmu_read+0x56/0x90 [zfs]
[ 7737.558292]  zfs_fillpage+0x76/0x190 [zfs]
[ 7737.562584]  zfs_getpage+0x4c/0x80 [zfs]
[ 7737.566691]  zpl_readpage_common+0x3b/0x80 [zfs]
[ 7737.571485]  filemap_fault+0x5d6/0xa10
[ 7737.575236]  ? obj_cgroup_charge_pages+0xba/0xd0
[ 7737.579856]  ? xas_load+0x8/0x80
[ 7737.583088]  ? xas_find+0x173/0x1b0
[ 7737.586579]  ? filemap_map_pages+0x84/0x410
[ 7737.590759]  __do_fault+0x38/0xb0
[ 7737.594077]  handle_pte_fault+0x559/0x870
[ 7737.598082]  __handle_mm_fault+0x44f/0x6c0
[ 7737.602181]  handle_mm_fault+0xc1/0x1e0
[ 7737.606019]  do_user_addr_fault+0x1b5/0x440
[ 7737.610207]  do_page_fault+0x37/0x130
[ 7737.613873]  ? page_fault+0x8/0x30
[ 7737.617277]  page_fault+0x1e/0x30
[ 7737.620589] RIP: 0033:0x7f77fbce9140

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 9, 2024
It is important to hold the dbuf mutex (db_mtx) when creating ZIO's in
dmu_read_abd(). The BP that is returned dmu_buf_get_gp_from_dbuf() may
come from a previous direct IO write. In this case, it is attached to a
dirty record in the dbuf. When zio_read() is called, a copy of the BP is
made through io_bp_copy to io_bp in zio_create(). Without holding the
db_mtx though, the dirty record may be freed in dbuf_read_done(). This
can result in garbage beening place BP for the ZIO creatd through
zio_read(). By holding the db_mtx, this race can be avoided. Below is a
stack trace of the issue that was occuring in vdev_mirror_child_select()
without holding the db_mtx and creating the the ZIO.

[29882.427056] VERIFY(zio->io_bp == NULL ||
BP_PHYSICAL_BIRTH(zio->io_bp) == txg) failed
[29882.434884] PANIC at vdev_mirror.c:545:vdev_mirror_child_select()
[29882.440976] Showing stack for process 1865540
[29882.445336] CPU: 57 PID: 1865540 Comm: fio Kdump: loaded Tainted: P
OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[29882.456457] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[29882.463844] Call Trace:
[29882.466296]  dump_stack+0x41/0x60
[29882.469618]  spl_panic+0xd0/0xe8 [spl]
[29882.473376]  ? __dprintf+0x10e/0x180 [zfs]
[29882.477674]  ? kfree+0xd3/0x250
[29882.480819]  ? __dprintf+0x10e/0x180 [zfs]
[29882.485103]  ? vdev_mirror_map_alloc+0x29/0x50 [zfs]
[29882.490250]  ? vdev_lookup_top+0x20/0x90 [zfs]
[29882.494878]  spl_assert+0x17/0x20 [zfs]
[29882.498893]  vdev_mirror_child_select+0x279/0x300 [zfs]
[29882.504289]  vdev_mirror_io_start+0x11f/0x2b0 [zfs]
[29882.509336]  zio_vdev_io_start+0x3ee/0x520 [zfs]
[29882.514137]  zio_nowait+0x105/0x290 [zfs]
[29882.518330]  dmu_read_abd+0x196/0x460 [zfs]
[29882.522691]  dmu_read_uio_direct+0x6d/0xf0 [zfs]
[29882.527472]  dmu_read_uio_dnode+0x12a/0x140 [zfs]
[29882.532345]  dmu_read_uio_dbuf+0x3f/0x60 [zfs]
[29882.536953]  zfs_read+0x238/0x3f0 [zfs]
[29882.540976]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[29882.545952]  ? rrw_exit+0xc6/0x200 [zfs]
[29882.550058]  zpl_iter_read+0x90/0xb0 [zfs]
[29882.554340]  new_sync_read+0x10f/0x150
[29882.558094]  vfs_read+0x91/0x140
[29882.561325]  ksys_read+0x4f/0xb0
[29882.564557]  do_syscall_64+0x5b/0x1a0
[29882.568222]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[29882.573267] RIP: 0033:0x7f7fe0fa6ab4

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 9, 2024
There existed a race condition that was discovered through the
dio_random test. When doing fio with --fsync=32, after 32 writes fsync
is called on the file. When this happens, blocks committed to the ZIL
will be sync'ed out. However, the code for the O_DIRECT write was
updated in 31983d2 to always wait if there was an associated ARC buf
with the dbuf for all previous TXG's to sync out.

There was an oversight with this update. When waiting on previous TXG's
to sync out, the O_DIRECT write is holding the rangelock as a writer the
entire time. This causes an issue with the ZIL is commit writes out
though `zfs_get_data()` because it will try and grab the rangelock as
reader. This will lead to a deadlock.

In order to fix this race condition, I updated the `dmu_buf_impl_t`
struct to contain a uint8_t variable that is used to signal if the dbuf
attached to a O_DIRECT write is the wait hold because of mixed direct
and buffered data. Using this new `db_mixed_io_dio_wait` variable in the
in the `dmu_buf_impl_t` the code in `zfs_get_data()` can tell that
rangelock is already being held across the entire block and there is no
need to grab the rangelock at all. Because the rangelock is being held
as a writer across the entire block already, no modifications can take
place against the block as long as the O_DIRECT write is stalled
waiting in `dmu_buf_direct_mixed_io_wait()`.

Also as part of this update, I realized the `db_state` in
`dmu_buf_direct_mixed_io_wait()` needs to be changed temporarily to
`DB_CACHED`. This is necessary so the logic in `dbuf_read()` is correct
if `dmu_sync_late_arrival()` is called by `dmu_sync()`. It is completely
valid to switch the `db_state` back to `DB_CACHED` is there is still an
associated ARC buf that will not be freed till out O_DIRECT write is
completed which will only happen after if leaves
`dmu_buf_direct_mixed_io_wait()`.

Here is the stack trace of the deadlock that happen with
`dio_random.ksh` before this commit:
[ 5513.663402] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 7496.580415] INFO: task z_wr_int:1098000 blocked for more than 120
seconds.
[ 7496.585709]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.593349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.600839] task:z_wr_int        state:D stack:    0 pid:1098000
ppid:     2 flags:0x80004080
[ 7496.608622] Call Trace:
[ 7496.611770]  __schedule+0x2d1/0x870
[ 7496.615404]  schedule+0x55/0xf0
[ 7496.618866]  cv_wait_common+0x16d/0x280 [spl]
[ 7496.622910]  ? finish_wait+0x80/0x80
[ 7496.626601]  dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs]
[ 7496.631327]  dmu_write_direct_done+0x90/0x3a0 [zfs]
[ 7496.635798]  zio_done+0x373/0x1d40 [zfs]
[ 7496.639795]  zio_execute+0xee/0x210 [zfs]
[ 7496.643840]  taskq_thread+0x203/0x420 [spl]
[ 7496.647836]  ? wake_up_q+0x70/0x70
[ 7496.651411]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[ 7496.656489]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 7496.660604]  kthread+0x134/0x150
[ 7496.664092]  ? set_kthread_struct+0x50/0x50
[ 7496.668080]  ret_from_fork+0x35/0x40
[ 7496.671745] INFO: task txg_sync:1098025 blocked for more than 120
seconds.
[ 7496.676991]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.684666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.692060] task:txg_sync        state:D stack:    0 pid:1098025
ppid:     2 flags:0x80004080
[ 7496.699888] Call Trace:
[ 7496.703012]  __schedule+0x2d1/0x870
[ 7496.706658]  schedule+0x55/0xf0
[ 7496.710093]  schedule_timeout+0x197/0x300
[ 7496.713982]  ? __next_timer_interrupt+0xf0/0xf0
[ 7496.718135]  io_schedule_timeout+0x19/0x40
[ 7496.722049]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7496.726349]  ? finish_wait+0x80/0x80
[ 7496.730039]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7496.734100]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7496.738082]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 7496.742205]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7496.746534]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 7496.750842]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[ 7496.755742]  spa_sync+0x362/0x8d0 [zfs]
[ 7496.759689]  txg_sync_thread+0x274/0x3b0 [zfs]
[ 7496.763928]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 7496.768439]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 7496.772799]  thread_generic_wrapper+0x63/0x90 [spl]
[ 7496.777097]  kthread+0x134/0x150
[ 7496.780616]  ? set_kthread_struct+0x50/0x50
[ 7496.784549]  ret_from_fork+0x35/0x40
[ 7496.788204] INFO: task fio:1101750 blocked for more than 120 seconds.
[ 7496.895852]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.903765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.911170] task:fio             state:D stack:    0 pid:1101750
ppid:1101741 flags:0x00004080
[ 7496.919033] Call Trace:
[ 7496.922136]  __schedule+0x2d1/0x870
[ 7496.925769]  schedule+0x55/0xf0
[ 7496.929245]  schedule_timeout+0x197/0x300
[ 7496.933120]  ? __next_timer_interrupt+0xf0/0xf0
[ 7496.937213]  io_schedule_timeout+0x19/0x40
[ 7496.941126]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7496.945444]  ? finish_wait+0x80/0x80
[ 7496.949125]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7496.953191]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7496.957180]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 7496.961319]  dmu_write_uio_direct+0x79/0xf0 [zfs]
[ 7496.965731]  dmu_write_uio_dnode+0xa6/0x2d0 [zfs]
[ 7496.970043]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 7496.974305]  zfs_write+0x55f/0xea0 [zfs]
[ 7496.978325]  ? iov_iter_get_pages+0xe9/0x390
[ 7496.982333]  ? trylock_page+0xd/0x20 [zfs]
[ 7496.986451]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7496.990713]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7496.995031]  zpl_iter_write_direct+0xda/0x170 [zfs]
[ 7496.999489]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.003476]  zpl_iter_write+0xd5/0x110 [zfs]
[ 7497.007610]  new_sync_write+0x112/0x160
[ 7497.011429]  vfs_write+0xa5/0x1b0
[ 7497.014916]  ksys_write+0x4f/0xb0
[ 7497.018443]  do_syscall_64+0x5b/0x1b0
[ 7497.022150]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.026532] RIP: 0033:0x7f8771d72a17
[ 7497.030195] Code: Unable to access opcode bytes at RIP
0x7f8771d729ed.
[ 7497.035263] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 7497.042547] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72a17
[ 7497.047933] RDX: 000000000009b000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7497.053269] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.058660] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000009b000
[ 7497.063960] R13: 000055b390afcac0 R14: 000000000009b000 R15:
000055b390afcae8
[ 7497.069334] INFO: task fio:1101751 blocked for more than 120 seconds.
[ 7497.074308]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.081973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.089371] task:fio             state:D stack:    0 pid:1101751
ppid:1101741 flags:0x00000080
[ 7497.097147] Call Trace:
[ 7497.100263]  __schedule+0x2d1/0x870
[ 7497.103897]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.107878]  schedule+0x55/0xf0
[ 7497.111386]  cv_wait_common+0x16d/0x280 [spl]
[ 7497.115391]  ? finish_wait+0x80/0x80
[ 7497.119028]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7497.123667]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7497.128240]  zfs_read+0xaf/0x3f0 [zfs]
[ 7497.132146]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.136091]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7497.140366]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7497.144679]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[ 7497.149054]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.153040]  zpl_iter_read+0x94/0xb0 [zfs]
[ 7497.157103]  new_sync_read+0x10f/0x160
[ 7497.160855]  vfs_read+0x91/0x150
[ 7497.164336]  ksys_read+0x4f/0xb0
[ 7497.168004]  do_syscall_64+0x5b/0x1b0
[ 7497.171706]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.176105] RIP: 0033:0x7f8771d72ab4
[ 7497.179742] Code: Unable to access opcode bytes at RIP
0x7f8771d72a8a.
[ 7497.184807] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 7497.192129] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72ab4
[ 7497.197485] RDX: 0000000000002000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7497.202922] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.208309] R10: 00000001ffffffff R11: 0000000000000246 R12:
0000000000002000
[ 7497.213694] R13: 000055b390afcac0 R14: 0000000000002000 R15:
000055b390afcae8
[ 7497.219063] INFO: task fio:1101755 blocked for more than 120 seconds.
[ 7497.224098]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.231786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.239165] task:fio             state:D stack:    0 pid:1101755
ppid:1101744 flags:0x00000080
[ 7497.246989] Call Trace:
[ 7497.250121]  __schedule+0x2d1/0x870
[ 7497.253779]  schedule+0x55/0xf0
[ 7497.257240]  schedule_preempt_disabled+0xa/0x10
[ 7497.261344]  __mutex_lock.isra.7+0x349/0x420
[ 7497.265326]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7497.269674]  zil_commit_writer+0x89/0x230 [zfs]
[ 7497.273938]  zil_commit_impl+0x5f/0xd0 [zfs]
[ 7497.278101]  zfs_fsync+0x81/0xa0 [zfs]
[ 7497.282002]  zpl_fsync+0xe5/0x140 [zfs]
[ 7497.285985]  do_fsync+0x38/0x70
[ 7497.289458]  __x64_sys_fsync+0x10/0x20
[ 7497.293208]  do_syscall_64+0x5b/0x1b0
[ 7497.296928]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.301260] RIP: 0033:0x7f9559073027
[ 7497.304920] Code: Unable to access opcode bytes at RIP
0x7f9559072ffd.
[ 7497.310015] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX:
000000000000004a
[ 7497.317346] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9559073027
[ 7497.322722] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI:
0000000000000005
[ 7497.328126] RBP: 00007f94fb858000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.333514] R10: 0000000000008000 R11: 0000000000000293 R12:
0000000000000003
[ 7497.338887] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15:
0000563adcbf2ae8
[ 7497.344247] INFO: task fio:1101756 blocked for more than 120 seconds.
[ 7497.349327]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.357032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.364517] task:fio             state:D stack:    0 pid:1101756
ppid:1101744 flags:0x00004080
[ 7497.372310] Call Trace:
[ 7497.375433]  __schedule+0x2d1/0x870
[ 7497.379004]  schedule+0x55/0xf0
[ 7497.382454]  cv_wait_common+0x16d/0x280 [spl]
[ 7497.386477]  ? finish_wait+0x80/0x80
[ 7497.390137]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7497.394816]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7497.399397]  zfs_get_data+0x1a8/0x7e0 [zfs]
[ 7497.403515]  zil_lwb_commit+0x1a5/0x400 [zfs]
[ 7497.407712]  zil_lwb_write_close+0x408/0x630 [zfs]
[ 7497.412126]  zil_commit_waiter_timeout+0x16d/0x520 [zfs]
[ 7497.416801]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 7497.421139]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 7497.425294]  zfs_fsync+0x81/0xa0 [zfs]
[ 7497.429454]  zpl_fsync+0xe5/0x140 [zfs]
[ 7497.433396]  do_fsync+0x38/0x70
[ 7497.436878]  __x64_sys_fsync+0x10/0x20
[ 7497.440586]  do_syscall_64+0x5b/0x1b0
[ 7497.444313]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.448659] RIP: 0033:0x7f9559073027
[ 7497.452343] Code: Unable to access opcode bytes at RIP
0x7f9559072ffd.
[ 7497.457408] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX:
000000000000004a
[ 7497.464724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9559073027
[ 7497.470106] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI:
0000000000000005
[ 7497.475477] RBP: 00007f94fb89ca18 R08: 0000000000000000 R09:
0000000000000000
[ 7497.480806] R10: 00000000000b4cc0 R11: 0000000000000293 R12:
0000000000000003
[ 7497.486158] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15:
0000563adcbf2ae8
[ 7619.459402] INFO: task z_wr_int:1098000 blocked for more than 120
seconds.
[ 7619.464605]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.472233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.479659] task:z_wr_int        state:D stack:    0 pid:1098000
ppid:     2 flags:0x80004080
[ 7619.487518] Call Trace:
[ 7619.490650]  __schedule+0x2d1/0x870
[ 7619.494246]  schedule+0x55/0xf0
[ 7619.497719]  cv_wait_common+0x16d/0x280 [spl]
[ 7619.501749]  ? finish_wait+0x80/0x80
[ 7619.505411]  dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs]
[ 7619.510143]  dmu_write_direct_done+0x90/0x3a0 [zfs]
[ 7619.514603]  zio_done+0x373/0x1d40 [zfs]
[ 7619.518594]  zio_execute+0xee/0x210 [zfs]
[ 7619.522619]  taskq_thread+0x203/0x420 [spl]
[ 7619.526567]  ? wake_up_q+0x70/0x70
[ 7619.530208]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[ 7619.535302]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 7619.539385]  kthread+0x134/0x150
[ 7619.542873]  ? set_kthread_struct+0x50/0x50
[ 7619.546810]  ret_from_fork+0x35/0x40
[ 7619.550477] INFO: task txg_sync:1098025 blocked for more than 120
seconds.
[ 7619.555715]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.563415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.570851] task:txg_sync        state:D stack:    0 pid:1098025
ppid:     2 flags:0x80004080
[ 7619.578606] Call Trace:
[ 7619.581742]  __schedule+0x2d1/0x870
[ 7619.585396]  schedule+0x55/0xf0
[ 7619.589006]  schedule_timeout+0x197/0x300
[ 7619.592916]  ? __next_timer_interrupt+0xf0/0xf0
[ 7619.597027]  io_schedule_timeout+0x19/0x40
[ 7619.600947]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7619.709878]  ? finish_wait+0x80/0x80
[ 7619.713565]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7619.717596]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7619.721567]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 7619.725657]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7619.730050]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 7619.734415]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[ 7619.739268]  spa_sync+0x362/0x8d0 [zfs]
[ 7619.743270]  txg_sync_thread+0x274/0x3b0 [zfs]
[ 7619.747494]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 7619.751939]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 7619.756279]  thread_generic_wrapper+0x63/0x90 [spl]
[ 7619.760569]  kthread+0x134/0x150
[ 7619.764050]  ? set_kthread_struct+0x50/0x50
[ 7619.767978]  ret_from_fork+0x35/0x40
[ 7619.771639] INFO: task fio:1101750 blocked for more than 120 seconds.
[ 7619.776678]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.784324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.791914] task:fio             state:D stack:    0 pid:1101750
ppid:1101741 flags:0x00004080
[ 7619.799712] Call Trace:
[ 7619.802816]  __schedule+0x2d1/0x870
[ 7619.806427]  schedule+0x55/0xf0
[ 7619.809867]  schedule_timeout+0x197/0x300
[ 7619.813760]  ? __next_timer_interrupt+0xf0/0xf0
[ 7619.817848]  io_schedule_timeout+0x19/0x40
[ 7619.821766]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7619.826097]  ? finish_wait+0x80/0x80
[ 7619.829780]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7619.833857]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7619.837838]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 7619.842015]  dmu_write_uio_direct+0x79/0xf0 [zfs]
[ 7619.846388]  dmu_write_uio_dnode+0xa6/0x2d0 [zfs]
[ 7619.850760]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 7619.855011]  zfs_write+0x55f/0xea0 [zfs]
[ 7619.859008]  ? iov_iter_get_pages+0xe9/0x390
[ 7619.863036]  ? trylock_page+0xd/0x20 [zfs]
[ 7619.867084]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7619.871366]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7619.875715]  zpl_iter_write_direct+0xda/0x170 [zfs]
[ 7619.880164]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7619.884174]  zpl_iter_write+0xd5/0x110 [zfs]
[ 7619.888492]  new_sync_write+0x112/0x160
[ 7619.892285]  vfs_write+0xa5/0x1b0
[ 7619.895829]  ksys_write+0x4f/0xb0
[ 7619.899384]  do_syscall_64+0x5b/0x1b0
[ 7619.903071]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7619.907394] RIP: 0033:0x7f8771d72a17
[ 7619.911026] Code: Unable to access opcode bytes at RIP
0x7f8771d729ed.
[ 7619.916073] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 7619.923363] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72a17
[ 7619.928675] RDX: 000000000009b000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7619.934019] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7619.939354] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000009b000
[ 7619.944775] R13: 000055b390afcac0 R14: 000000000009b000 R15:
000055b390afcae8
[ 7619.950175] INFO: task fio:1101751 blocked for more than 120 seconds.
[ 7619.955232]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.962889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.970301] task:fio             state:D stack:    0 pid:1101751
ppid:1101741 flags:0x00000080
[ 7619.978139] Call Trace:
[ 7619.981278]  __schedule+0x2d1/0x870
[ 7619.984872]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7619.989260]  schedule+0x55/0xf0
[ 7619.992725]  cv_wait_common+0x16d/0x280 [spl]
[ 7619.996754]  ? finish_wait+0x80/0x80
[ 7620.000414]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7620.005050]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7620.009617]  zfs_read+0xaf/0x3f0 [zfs]
[ 7620.013503]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7620.017489]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7620.021774]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7620.026091]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[ 7620.030508]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7620.034497]  zpl_iter_read+0x94/0xb0 [zfs]
[ 7620.038579]  new_sync_read+0x10f/0x160
[ 7620.042325]  vfs_read+0x91/0x150
[ 7620.045809]  ksys_read+0x4f/0xb0
[ 7620.049273]  do_syscall_64+0x5b/0x1b0
[ 7620.052965]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7620.057354] RIP: 0033:0x7f8771d72ab4
[ 7620.060988] Code: Unable to access opcode bytes at RIP
0x7f8771d72a8a.
[ 7620.066041] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 7620.073256] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72ab4
[ 7620.078553] RDX: 0000000000002000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7620.083878] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7620.089353] R10: 00000001ffffffff R11: 0000000000000246 R12:
0000000000002000
[ 7620.094697] R13: 000055b390afcac0 R14: 0000000000002000 R15:
000055b390afcae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 9, 2024
995734e added a test for block cloning with mmap files. As a result I
began hitting a panic in that test in dbuf_unoverride(). The was if the
dirty record was from a cloned block, then the dr_data must be set to
NULL. This ASSERT was added in 86e115e. The point of that commit was to
make sure that if a cloned block is read before it is synced out, then
the associated ARC buffer is set in the dirty record.

This became an issue with the O_DIRECT code, because dr_data was set to
the ARC buf in dbuf_set_data() after the read. This is the incorrect
logic though for the cloned block. In order to fix this issue, I refined
how to determine if the dirty record is in fact from a O_DIRECT write by
make sure that dr_brtwrite is false. I created the function
dbuf_dirty_is_direct_write() to perform the proper check.

As part of this, I also cleaned up other code that did the exact same
check for an O_DIRECT write to make sure the proper check is taking
place everywhere.

The trace of the ASSERT that was being tripped before this change is
below:
[3649972.811039] VERIFY0P(dr->dt.dl.dr_data) failed (NULL ==
ffff8d58e8183c80)
[3649972.817999] PANIC at dbuf.c:1999:dbuf_unoverride()
[3649972.822968] Showing stack for process 2365657
[3649972.827502] CPU: 0 PID: 2365657 Comm: clone_mmap_writ Kdump: loaded
Tainted: P           OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[3649972.839749] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS
R21 10/08/2020
[3649972.847315] Call Trace:
[3649972.849935]  dump_stack+0x41/0x60
[3649972.853428]  spl_panic+0xd0/0xe8 [spl]
[3649972.857370]  ? cityhash4+0x75/0x90 [zfs]
[3649972.861649]  ? _cond_resched+0x15/0x30
[3649972.865577]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[3649972.870548]  ? __kmalloc_node+0x10d/0x300
[3649972.874735]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[3649972.879702]  ? __list_add+0x12/0x30 [zfs]
[3649972.884061]  dbuf_unoverride+0x1c1/0x1d0 [zfs]
[3649972.888856]  dbuf_redirty+0x3b/0xd0 [zfs]
[3649972.893204]  dbuf_dirty+0xeb1/0x1330 [zfs]
[3649972.897643]  ? _cond_resched+0x15/0x30
[3649972.901569]  ? mutex_lock+0xe/0x30
[3649972.905148]  ? dbuf_noread+0x117/0x240 [zfs]
[3649972.909760]  dmu_write_uio_dnode+0x1d2/0x320 [zfs]
[3649972.914900]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[3649972.919777]  zfs_write+0x57d/0xe00 [zfs]
[3649972.924076]  ? alloc_set_pte+0xb8/0x3e0
[3649972.928088]  zpl_iter_write_buffered+0xb2/0x120 [zfs]
[3649972.933507]  ? rrw_exit+0xc6/0x200 [zfs]
[3649972.937796]  zpl_iter_write+0xba/0x110 [zfs]
[3649972.942433]  new_sync_write+0x112/0x160
[3649972.946445]  vfs_write+0xa5/0x1a0
[3649972.949935]  ksys_pwrite64+0x61/0xa0
[3649972.953681]  do_syscall_64+0x5b/0x1a0
[3649972.957519]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[3649972.962745] RIP: 0033:0x7f610616f01b

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 9, 2024
Originally I was checking dr->dr_dbuf->db_level == 0 in
dbuf_dirty_is_direct_write(). Howver, this can lead to a NULL ponter
dereference if the dr_dbuf is no longer set.

I updated dbuf_dirty_is_direct_write() to now also take a dmu_buf_impl_t
to check if db->db_level == 0. This failure was caught on the Fedora 37
CI running in test enospc_rm. Below is the stack trace.

[ 9851.511608] BUG: kernel NULL pointer dereference, address:
0000000000000068
[ 9851.515922] #PF: supervisor read access in kernel mode
[ 9851.519462] #PF: error_code(0x0000) - not-present page
[ 9851.522992] PGD 0 P4D 0
[ 9851.525684] Oops: 0000 [openzfs#1] PREEMPT SMP PTI
[ 9851.528878] CPU: 0 PID: 1272993 Comm: fio Tainted: P           OE
6.5.12-100.fc37.x86_64 openzfs#1
[ 9851.535266] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[ 9851.539226] RIP: 0010:dbuf_dirty_is_direct_write+0xb/0x40 [zfs]
[ 9851.543379] Code: 10 74 02 31 c0 5b c3 cc cc cc cc 0f 1f 40 00 90 90
90 90 90 90 90 90 90 90 90 90 90 90 90 90 31 c0 48 85 ff 74 31 48 8b 57
20 <80> 7a 68 00 75 27 8b 87 64 01 00 00 85 c0 75 1b 83 bf 58 01 00 00
[ 9851.554719] RSP: 0018:ffff9b5b8305f8e8 EFLAGS: 00010286
[ 9851.558276] RAX: 0000000000000000 RBX: ffff9b5b8569b0b8 RCX:
0000000000000000
[ 9851.562481] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff8f2e97de9e00
[ 9851.566672] RBP: 0000000000020000 R08: 0000000000000000 R09:
ffff8f2f70e94000
[ 9851.570851] R10: 0000000000000001 R11: 0000000000000110 R12:
ffff8f2f774ae4c0
[ 9851.575032] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
[ 9851.579209] FS:  00007f57c5542240(0000) GS:ffff8f2faa800000(0000)
knlGS:0000000000000000
[ 9851.585357] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9851.589064] CR2: 0000000000000068 CR3: 00000001f9a38001 CR4:
00000000007706f0
[ 9851.593256] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 9851.597440] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 9851.601618] PKRU: 55555554
[ 9851.604341] Call Trace:
[ 9851.606981]  <TASK>
[ 9851.609515]  ? __die+0x23/0x70
[ 9851.612388]  ? page_fault_oops+0x171/0x4e0
[ 9851.615571]  ? exc_page_fault+0x77/0x170
[ 9851.618704]  ? asm_exc_page_fault+0x26/0x30
[ 9851.621900]  ? dbuf_dirty_is_direct_write+0xb/0x40 [zfs]
[ 9851.625828]  zfs_get_data+0x407/0x820 [zfs]
[ 9851.629400]  zil_lwb_commit+0x18d/0x3f0 [zfs]
[ 9851.633026]  zil_lwb_write_issue+0x92/0xbb0 [zfs]
[ 9851.636758]  zil_commit_waiter_timeout+0x1f3/0x580 [zfs]
[ 9851.640696]  zil_commit_waiter+0x1ff/0x3a0 [zfs]
[ 9851.644402]  zil_commit_impl+0x71/0xd0 [zfs]
[ 9851.647998]  zfs_write+0xb51/0xdc0 [zfs]
[ 9851.651467]  zpl_iter_write_buffered+0xc9/0x140 [zfs]
[ 9851.655337]  zpl_iter_write+0xc0/0x110 [zfs]
[ 9851.658920]  vfs_write+0x23e/0x420
[ 9851.661871]  __x64_sys_pwrite64+0x98/0xd0
[ 9851.665013]  do_syscall_64+0x5f/0x90
[ 9851.668027]  ? ksys_fadvise64_64+0x57/0xa0
[ 9851.671212]  ? syscall_exit_to_user_mode+0x2b/0x40
[ 9851.674594]  ? do_syscall_64+0x6b/0x90
[ 9851.677655]  ? syscall_exit_to_user_mode+0x2b/0x40
[ 9851.681051]  ? do_syscall_64+0x6b/0x90
[ 9851.684128]  ? exc_page_fault+0x77/0x170
[ 9851.687256]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 9851.690759] RIP: 0033:0x7f57c563c377

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 9, 2024
There existed a race condition between when a Direct I/O write could
complete and if a sync operation was issued. This was due to the fact
that a Direct I/O would sleep waiting on previous TXG's to sync out
their dirty records assosciated with a dbuf if there was an ARC buffer
associated with the dbuf. This was necessay to safely destroy the ARC
buffer in case previous dirty records dr_data as pointed at that the
db_buf. The main issue with this approach is a Direct I/o write holds
the rangelock across the entire block, so when a sync on that same block
was issued and tried to grab the rangelock as reader, it would be
blocked indefinitely because the Direct I/O that was now sleeping was
holding that same rangelock as writer. This led to a complete deadlock.

This commit fixes this issue and removes the wait in
dmu_write_direct_done().

The way this is now handled is the ARC buffer is destroyed, if there an
associated one with dbuf, before ever issuing the Direct I/O write.
This implemenation heavily borrows from the block cloning
implementation.

A new function dmu_buf_wil_clone_or_dio() is called in both
dmu_write_direct() and dmu_brt_clone() that does the following:
1. Undirties a dirty record for that db if there one currently
   associated with the current TXG.
2. Destroys the ARC buffer if the previous dirty record dr_data does not
   point at the dbufs ARC buffer (db_buf).
3. Sets the dbufs data pointers to NULL.
4. Redirties the dbuf using db_state = DB_NOFILL.

As part of this commit, the dmu_write_direct_done() function was also
cleaned up. Now dmu_sync_done() is called before undirtying the dbuf
dirty record associated with a failed Direct I/O write. This is correct
logic and how it always should have been.

The additional benefits of these modifications is there is no longer a
stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT
together. Also it unifies the block cloning and Direct I/O write path as
they both need to call dbuf_fix_old_data() before destroying the ARC
buffer.

As part of this commit, there is also just general code cleanup. Various
dbuf stats were removed because they are not necesary any longer.
Additionally, useless functions were removed to make the code paths
cleaner for Direct I/O.

Below is the race condtion stack trace that was being consistently
observed in the CI runs for the dio_random test case that prompted
these changes:
trace:
[ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[ 9954.770512]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.773848] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[ 9954.775512] Call Trace:
[ 9954.776406]  __schedule+0x2d1/0x870
[ 9954.777386]  ? free_one_page+0x204/0x530
[ 9954.778466]  schedule+0x55/0xf0
[ 9954.779355]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.780491]  ? finish_wait+0x80/0x80
[ 9954.781450]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[ 9954.782889]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[ 9954.784255]  zio_done+0x373/0x1d50 [zfs]
[ 9954.785410]  zio_execute+0xee/0x210 [zfs]
[ 9954.786588]  taskq_thread+0x205/0x3f0 [spl]
[ 9954.787673]  ? wake_up_q+0x60/0x60
[ 9954.788571]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[ 9954.790079]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 9954.791199]  kthread+0x134/0x150
[ 9954.792082]  ? set_kthread_struct+0x50/0x50
[ 9954.793189]  ret_from_fork+0x35/0x40
[ 9954.794108] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[ 9954.795535]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.798669] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[ 9954.800267] Call Trace:
[ 9954.801096]  __schedule+0x2d1/0x870
[ 9954.801972]  ? __wake_up_common+0x7a/0x190
[ 9954.802963]  schedule+0x55/0xf0
[ 9954.803884]  schedule_timeout+0x19f/0x320
[ 9954.804837]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.805932]  ? taskq_dispatch+0xab/0x280 [spl]
[ 9954.806959]  io_schedule_timeout+0x19/0x40
[ 9954.807989]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.809110]  ? finish_wait+0x80/0x80
[ 9954.810068]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.811103]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.812255]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 9954.813442]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 9954.814648]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[ 9954.816023]  spa_sync+0x362/0x8f0 [zfs]
[ 9954.817110]  txg_sync_thread+0x27a/0x3b0 [zfs]
[ 9954.818267]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 9954.819510]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 9954.820643]  thread_generic_wrapper+0x63/0x90 [spl]
[ 9954.821709]  kthread+0x134/0x150
[ 9954.822590]  ? set_kthread_struct+0x50/0x50
[ 9954.823584]  ret_from_fork+0x35/0x40
[ 9954.824444] INFO: task fio:1055501 blocked for more than 120
seconds.
[ 9954.825781]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.828871] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[ 9954.830463] Call Trace:
[ 9954.831280]  __schedule+0x2d1/0x870
[ 9954.832159]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[ 9954.833396]  schedule+0x55/0xf0
[ 9954.834286]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.835291]  ? finish_wait+0x80/0x80
[ 9954.836235]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 9954.837543]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 9954.838838]  zfs_get_data+0x566/0x810 [zfs]
[ 9954.840034]  zil_lwb_commit+0x194/0x3f0 [zfs]
[ 9954.841154]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[ 9954.842367]  ? __list_add+0x12/0x30 [zfs]
[ 9954.843496]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.844665]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[ 9954.845852]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[ 9954.847203]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 9954.848380]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.849550]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.850640]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.851729]  do_fsync+0x38/0x70
[ 9954.852585]  __x64_sys_fsync+0x10/0x20
[ 9954.853486]  do_syscall_64+0x5b/0x1b0
[ 9954.854416]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.855466] RIP: 0033:0x7eff236bb057
[ 9954.856388] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[ 9954.866149] INFO: task fio:1055502 blocked for more than 120
seconds.
[ 9954.867490]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.870571] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[ 9954.872162] Call Trace:
[ 9954.872947]  __schedule+0x2d1/0x870
[ 9954.873844]  schedule+0x55/0xf0
[ 9954.874716]  schedule_timeout+0x19f/0x320
[ 9954.875645]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.876722]  io_schedule_timeout+0x19/0x40
[ 9954.877677]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.878822]  ? finish_wait+0x80/0x80
[ 9954.879694]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.880763]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.881865]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 9954.883074]  dmu_write_uio_direct+0x79/0x100 [zfs]
[ 9954.884285]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[ 9954.885507]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 9954.886687]  zfs_write+0x581/0xe20 [zfs]
[ 9954.887822]  ? iov_iter_get_pages+0xe9/0x390
[ 9954.888862]  ? trylock_page+0xd/0x20 [zfs]
[ 9954.890005]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.891217]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 9954.892391]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[ 9954.893663]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.894764]  zpl_iter_write+0xd5/0x110 [zfs]
[ 9954.895911]  new_sync_write+0x112/0x160
[ 9954.896881]  vfs_write+0xa5/0x1b0
[ 9954.897701]  ksys_write+0x4f/0xb0
[ 9954.898569]  do_syscall_64+0x5b/0x1b0
[ 9954.899417]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.900515] RIP: 0033:0x7eff236baa47
[ 9954.901363] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8
[ 9954.911129] INFO: task fio:1055504 blocked for more than 120
seconds.
[ 9954.912381]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.915434] task:fio             state:D stack:0
pid:1055504
ppid:1055493 flags:0x00000080
[ 9954.917082] Call Trace:
[ 9954.917773]  __schedule+0x2d1/0x870
[ 9954.918648]  ? zilog_dirty+0x4f/0xc0 [zfs]
[ 9954.919831]  schedule+0x55/0xf0
[ 9954.920717]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.921704]  ? finish_wait+0x80/0x80
[ 9954.922639]  zfs_rangelock_enter_writer+0x46/0x1c0 [zfs]
[ 9954.923940]  zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs]
[ 9954.925306]  zfs_write+0x703/0xe20 [zfs]
[ 9954.926406]  zpl_iter_write_buffered+0xb2/0x120 [zfs]
[ 9954.927687]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.928821]  zpl_iter_write+0xbe/0x110 [zfs]
[ 9954.930028]  new_sync_write+0x112/0x160
[ 9954.930913]  vfs_write+0xa5/0x1b0
[ 9954.931758]  ksys_write+0x4f/0xb0
[ 9954.932666]  do_syscall_64+0x5b/0x1b0
[ 9954.933544]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.934689] RIP: 0033:0x7fcaee8f0a47
[ 9954.935551] Code: Unable to access opcode bytes at RIP
0x7fcaee8f0a1d.
[ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007fcaee8f0a47
[ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI:
0000000000000006
[ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09:
0000000000000000
[ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000001d000
[ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15:
0000557a2006bae8
[ 9954.945525] INFO: task fio:1055505 blocked for more than 120
seconds.
[ 9954.946819]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.949959] task:fio             state:D stack:0
pid:1055505
ppid:1055493 flags:0x00004080
[ 9954.951653] Call Trace:
[ 9954.952417]  __schedule+0x2d1/0x870
[ 9954.953393]  ? finish_wait+0x3e/0x80
[ 9954.954315]  schedule+0x55/0xf0
[ 9954.955212]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.956211]  ? finish_wait+0x80/0x80
[ 9954.957159]  zil_commit_waiter+0xfa/0x3b0 [zfs]
[ 9954.958343]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.959524]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.960626]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.961763]  do_fsync+0x38/0x70
[ 9954.962638]  __x64_sys_fsync+0x10/0x20
[ 9954.963520]  do_syscall_64+0x5b/0x1b0
[ 9954.964470]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.965567] RIP: 0033:0x7fcaee8f1057
[ 9954.966490] Code: Unable to access opcode bytes at RIP
0x7fcaee8f102d.
[ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007fcaee8f1057
[ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI:
0000000000000005
[ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09:
0000000000000000
[ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15:
0000557a2006bae8
[10077.648150] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[10077.649541]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.652782] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[10077.654420] Call Trace:
[10077.655267]  __schedule+0x2d1/0x870
[10077.656179]  ? free_one_page+0x204/0x530
[10077.657192]  schedule+0x55/0xf0
[10077.658004]  cv_wait_common+0x16d/0x280 [spl]
[10077.659018]  ? finish_wait+0x80/0x80
[10077.660013]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[10077.661396]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[10077.662617]  zio_done+0x373/0x1d50 [zfs]
[10077.663783]  zio_execute+0xee/0x210 [zfs]
[10077.664921]  taskq_thread+0x205/0x3f0 [spl]
[10077.665982]  ? wake_up_q+0x60/0x60
[10077.666842]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[10077.668295]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[10077.669360]  kthread+0x134/0x150
[10077.670191]  ? set_kthread_struct+0x50/0x50
[10077.671209]  ret_from_fork+0x35/0x40
[10077.672076] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[10077.673467]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.676612] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[10077.678288] Call Trace:
[10077.679024]  __schedule+0x2d1/0x870
[10077.679948]  ? __wake_up_common+0x7a/0x190
[10077.681042]  schedule+0x55/0xf0
[10077.681899]  schedule_timeout+0x19f/0x320
[10077.682951]  ? __next_timer_interrupt+0xf0/0xf0
[10077.684005]  ? taskq_dispatch+0xab/0x280 [spl]
[10077.685085]  io_schedule_timeout+0x19/0x40
[10077.686080]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.687227]  ? finish_wait+0x80/0x80
[10077.688123]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.689206]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.690300]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[10077.691435]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[10077.692636]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[10077.693997]  spa_sync+0x362/0x8f0 [zfs]
[10077.695112]  txg_sync_thread+0x27a/0x3b0 [zfs]
[10077.696239]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[10077.697512]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[10077.698639]  thread_generic_wrapper+0x63/0x90 [spl]
[10077.699687]  kthread+0x134/0x150
[10077.700567]  ? set_kthread_struct+0x50/0x50
[10077.701502]  ret_from_fork+0x35/0x40
[10077.702430] INFO: task fio:1055501 blocked for more than 120
seconds.
[10077.703697]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.706780] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[10077.708479] Call Trace:
[10077.709231]  __schedule+0x2d1/0x870
[10077.710190]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[10077.711368]  schedule+0x55/0xf0
[10077.712286]  cv_wait_common+0x16d/0x280 [spl]
[10077.713316]  ? finish_wait+0x80/0x80
[10077.714262]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[10077.715566]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[10077.716878]  zfs_get_data+0x566/0x810 [zfs]
[10077.718032]  zil_lwb_commit+0x194/0x3f0 [zfs]
[10077.719234]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[10077.720413]  ? __list_add+0x12/0x30 [zfs]
[10077.721525]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.722708]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[10077.723931]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[10077.725273]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[10077.726438]  zil_commit_impl+0x6d/0xd0 [zfs]
[10077.727586]  zfs_fsync+0x66/0x90 [zfs]
[10077.728675]  zpl_fsync+0xe5/0x140 [zfs]
[10077.729755]  do_fsync+0x38/0x70
[10077.730607]  __x64_sys_fsync+0x10/0x20
[10077.731482]  do_syscall_64+0x5b/0x1b0
[10077.732415]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.733487] RIP: 0033:0x7eff236bb057
[10077.734399] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[10077.744168] INFO: task fio:1055502 blocked for more than 120
seconds.
[10077.745505]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.748642] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[10077.750233] Call Trace:
[10077.751011]  __schedule+0x2d1/0x870
[10077.751915]  schedule+0x55/0xf0
[10077.752811]  schedule_timeout+0x19f/0x320
[10077.753762]  ? __next_timer_interrupt+0xf0/0xf0
[10077.754824]  io_schedule_timeout+0x19/0x40
[10077.755782]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.756922]  ? finish_wait+0x80/0x80
[10077.757788]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.758845]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.759941]  dmu_write_abd+0x174/0x1c0 [zfs]
[10077.761144]  dmu_write_uio_direct+0x79/0x100 [zfs]
[10077.762327]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[10077.763523]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[10077.764749]  zfs_write+0x581/0xe20 [zfs]
[10077.765825]  ? iov_iter_get_pages+0xe9/0x390
[10077.766842]  ? trylock_page+0xd/0x20 [zfs]
[10077.767956]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.769189]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[10077.770343]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[10077.771570]  ? rrw_exit+0xc6/0x200 [zfs]
[10077.772674]  zpl_iter_write+0xd5/0x110 [zfs]
[10077.773834]  new_sync_write+0x112/0x160
[10077.774805]  vfs_write+0xa5/0x1b0
[10077.775634]  ksys_write+0x4f/0xb0
[10077.776526]  do_syscall_64+0x5b/0x1b0
[10077.777386]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.778488] RIP: 0033:0x7eff236baa47
[10077.779339] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 10, 2024
Varada (varada.kari@gmail.com) pointed out an issue with O_DIRECT reads
with the following test case:

dd if=/dev/urandom of=/local_zpool/file2 bs=512 count=79
truncate -s 40382 /local_zpool/file2
zpool export local_zpool
zpool import -d ~/ local_zpool
dd if=/local_zpool/file2 of=/dev/null bs=1M iflag=direct

That led to following panic happening:

[  307.769267] VERIFY(IS_P2ALIGNED(size, sizeof (uint32_t))) failed
[  307.782997] PANIC at zfs_fletcher.c:870:abd_fletcher_4_iter()
[  307.788743] Showing stack for process 9665
[  307.792834] CPU: 47 PID: 9665 Comm: z_rd_int_5 Kdump: loaded Tainted:
P           OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[  307.804298] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[  307.811682] Call Trace:
[  307.814131]  dump_stack+0x41/0x60
[  307.817449]  spl_panic+0xd0/0xe8 [spl]
[  307.821210]  ? irq_work_queue+0x9/0x20
[  307.824961]  ? wake_up_klogd.part.30+0x30/0x40
[  307.829407]  ? vprintk_emit+0x125/0x250
[  307.833246]  ? printk+0x58/0x6f
[  307.836391]  spl_assert.constprop.1+0x16/0x20 [zfs]
[  307.841438]  abd_fletcher_4_iter+0x6c/0x101 [zfs]
[  307.846343]  ? abd_fletcher_4_simd2scalar+0x83/0x83 [zfs]
[  307.851922]  abd_iterate_func+0xb1/0x170 [zfs]
[  307.856533]  abd_fletcher_4_impl+0x3f/0xa0 [zfs]
[  307.861334]  abd_fletcher_4_native+0x52/0x70 [zfs]
[  307.866302]  ? enqueue_entity+0xf1/0x6e0
[  307.870226]  ? select_idle_sibling+0x23/0x700
[  307.874587]  ? enqueue_task_fair+0x94/0x710
[  307.878771]  ? select_task_rq_fair+0x351/0x990
[  307.883208]  zio_checksum_error_impl+0xff/0x5f0 [zfs]
[  307.888435]  ? abd_fletcher_4_impl+0xa0/0xa0 [zfs]
[  307.893401]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[  307.898203]  ? __wake_up_common+0x7a/0x190
[  307.902300]  ? __switch_to_asm+0x41/0x70
[  307.906220]  ? __switch_to_asm+0x35/0x70
[  307.910145]  ? __switch_to_asm+0x41/0x70
[  307.914061]  ? __switch_to_asm+0x35/0x70
[  307.917980]  ? __switch_to_asm+0x41/0x70
[  307.921903]  ? __switch_to_asm+0x35/0x70
[  307.925821]  ? __switch_to_asm+0x35/0x70
[  307.929739]  ? __switch_to_asm+0x41/0x70
[  307.933658]  ? __switch_to_asm+0x35/0x70
[  307.937582]  zio_checksum_error+0x47/0xc0 [zfs]
[  307.942288]  raidz_checksum_verify+0x3a/0x70 [zfs]
[  307.947257]  vdev_raidz_io_done+0x4b/0x160 [zfs]
[  307.952049]  zio_vdev_io_done+0x7f/0x200 [zfs]
[  307.956669]  zio_execute+0xee/0x210 [zfs]
[  307.960855]  taskq_thread+0x203/0x420 [spl]
[  307.965048]  ? wake_up_q+0x70/0x70
[  307.968455]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[  307.974807]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[  307.979260]  kthread+0x10a/0x120
[  307.982485]  ? set_kthread_struct+0x40/0x40
[  307.986670]  ret_from_fork+0x35/0x40

The reason this was occuring was because by doing the zpool export that
meant the initial read of O_DIRECT would be forced to go down to disk.
In this case it was still valid as bs=1M is still page size aligned;
howver, the file length was not. So when issuing the O_DIRECT read even
after calling make_abd_for_dbuf() we had an extra page allocated in the
original ABD along with the linear ABD attached at the end of the gang
abd from make_abd_for_dbuf().

This is an issue as it is our expectations with read that the block
sizes being read are page aligned. When we do our check we are only
checking the request but not the actual size of data we may read such as
the entire file.

In order to remedy this situation, I updated zfs_read() to attempt to
read as much as it can using O_DIRECT based on if the length is
page-sized aligned. Any additional bytes that are requested are then
read into the ARC. This still stays with our semantics that IO requests
must be page sized aligned.

There are a bit of draw backs here if there is only a single block being
read. In this case the block will be read twice. Once using O_DIRECT and
then using buffered to fill in the remaining data for the users request.
However, this should not be a big issue most of the time. In the normal
case a user may ask for a lot of data from a file and only the stray
bytes at the end of the file will have to be read using the ARC.

In order to make sure this case was completely covered, I added a new
ZTS test case dio_unaligned_filesize to test this out. The main thing
with that test case is the first O_DIRECT read will issue out two reads
two being O_DIRECT and the third being buffered for the remaining
requested bytes.

As part of this commit, I also updated stride_dd to take an additional
parameter of -e, which says read the entire input file and ingore the
count (-c) option. We need to use stride_dd for FreeBSD as dd does not
make sure the buffer is page aligned. This udpate to stride_dd allows us
to use it to test out this case in dio_unaligned_filesize for both Linux
and FreeBSD.

While this may not be the most elegant solution, it does stick with the
semanatics and still reads all the data the user requested. I am fine
with revisiting this and maybe we just return a short read?

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 10, 2024
We were using the generic Linux calls to make sure that the page cache
was cleaned out before issuing any Direct I/O reads or writes. However,
this only matters in the event the file region being written/read from
using O_DIRECT was mmap'ed. One of stipulations with O_DIRECT is that is
redirected through the ARC in the event the file range is mmap'ed.
Becaues of this, it did not make sense to try and invalidate the page
cache if we were never intending to have O_DIRECT to work with mmap'ed
regions. Also, calls into the generic Linux calls in writes would often
lead to lockups as the page lock is dropped in zfs_putpage(). See the
stack dump below. In order to just prevent this, we no longer will use
the generic linux direct IO wrappers or try and flush out the page
cache.

Instead if we find the file range has been mmap'ed in since the initial
check in zfs_setup_direct() we will just now directly handle that in
zfs_read() and zfs_write(). In most case zfs_setup_direct() will prevent
O_DIRECT to mmap'ed regions of the file that have been page faulted in,
but if that happen when we are issuing the direct I/O request the the
normal parts of the ZFS paths will be taken to account for this.

It is highly suggested not to mmap a region of file and then write or
read directly to the file. In general, that is kind of an isane thing to
do... However, we try our best to still have consistency with the ARC.

Also, before making this decision I did explore if we could just add a
rangelock in zfs_fillpage(), but we can not do that. The reason is when
the page is in zfs_readpage_common() it has already been locked by the
kernel. So, if we try and grab the rangelock anywhere in that path we
can get stuck if another thread is issuing writes to the file region
that was mmap'ed in. The reason is update_pages() holds the rangelock
and then tries to lock the page. In this case zfs_fillpage() holds the
page lock but is stuck in the rangelock waiting and holding the page
lock. Deadlock is unavoidable in this case.

[260136.244332] INFO: task fio:3791107 blocked for more than 120
seconds.
[260136.250867]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260136.258693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260136.266607] task:fio             state:D stack:    0 pid:3791107
ppid:3790841 flags:0x00004080
[260136.275306] Call Trace:
[260136.277845]  __schedule+0x2d1/0x830
[260136.281432]  schedule+0x35/0xa0
[260136.284665]  io_schedule+0x12/0x40
[260136.288157]  wait_on_page_bit+0x123/0x220
[260136.292258]  ? xas_load+0x8/0x80
[260136.295577]  ? file_fdatawait_range+0x20/0x20
[260136.300024]  filemap_page_mkwrite+0x9b/0xb0
[260136.304295]  do_page_mkwrite+0x53/0x90
[260136.308135]  ? vm_normal_page+0x1a/0xc0
[260136.312062]  do_wp_page+0x298/0x350
[260136.315640]  __handle_mm_fault+0x44f/0x6c0
[260136.319826]  ? __switch_to_asm+0x41/0x70
[260136.323839]  handle_mm_fault+0xc1/0x1e0
[260136.327766]  do_user_addr_fault+0x1b5/0x440
[260136.332038]  do_page_fault+0x37/0x130
[260136.335792]  ? page_fault+0x8/0x30
[260136.339284]  page_fault+0x1e/0x30
[260136.342689] RIP: 0033:0x7f6deee7f1b4
[260136.346361] Code: Unable to access opcode bytes at RIP
0x7f6deee7f18a.
[260136.352977] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202
[260136.358288] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX:
00007f6d83148fe0
[260136.365508] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI:
00007f6d8309bfa0
[260136.372730] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09:
0000000000000000
[260136.379946] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12:
0000000000000001
[260136.387167] R13: 0000556b63614ec0 R14: 0000000000100000 R15:
0000556b63614ee8
[260136.394387] INFO: task fio:3791108 blocked for more than 120
seconds.
[260136.400911]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260136.408739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260136.416651] task:fio             state:D stack:    0 pid:3791108
ppid:3790835 flags:0x00004080
[260136.425343] Call Trace:
[260136.427883]  __schedule+0x2d1/0x830
[260136.431463]  ? cv_wait_common+0x12d/0x240 [spl]
[260136.436091]  schedule+0x35/0xa0
[260136.439321]  io_schedule+0x12/0x40
[260136.442814]  __lock_page+0x12d/0x230
[260136.446483]  ? file_fdatawait_range+0x20/0x20
[260136.450929]  zfs_putpage+0x148/0x590 [zfs]
[260136.455322]  ? rmap_walk_file+0x116/0x290
[260136.459421]  ? __mod_memcg_lruvec_state+0x5d/0x160
[260136.464300]  zpl_putpage+0x67/0xd0 [zfs]
[260136.468495]  write_cache_pages+0x197/0x420
[260136.472679]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260136.477732]  zpl_writepages+0x119/0x130 [zfs]
[260136.482352]  do_writepages+0xc2/0x1c0
[260136.486103]  ? flush_tlb_func_common.constprop.9+0x125/0x220
[260136.491850]  __filemap_fdatawrite_range+0xc7/0x100
[260136.496732]  filemap_write_and_wait_range+0x30/0x80
[260136.501695]  generic_file_direct_write+0x120/0x160
[260136.506575]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260136.510779]  zpl_iter_write+0xdd/0x160 [zfs]
[260136.515323]  new_sync_write+0x112/0x160
[260136.519255]  vfs_write+0xa5/0x1a0
[260136.522662]  ksys_write+0x4f/0xb0
[260136.526067]  do_syscall_64+0x5b/0x1a0
[260136.529818]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260136.534959] RIP: 0033:0x7f9d192c7a17
[260136.538625] Code: Unable to access opcode bytes at RIP
0x7f9d192c79ed.
[260136.545236] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[260136.552889] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9d192c7a17
[260136.560108] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI:
0000000000000005
[260136.567329] RBP: 00007f9caea03000 R08: 0000000000000000 R09:
0000000000000000
[260136.574548] R10: 00005558e8975680 R11: 0000000000000293 R12:
0000000000100000
[260136.581767] R13: 00005558e8985ec0 R14: 0000000000100000 R15:
00005558e8985ee8
[260136.588989] INFO: task fio:3791109 blocked for more than 120
seconds.
[260136.595513]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260136.603337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260136.611250] task:fio             state:D stack:    0 pid:3791109
ppid:3790838 flags:0x00004080
[260136.619943] Call Trace:
[260136.622483]  __schedule+0x2d1/0x830
[260136.626064]  ? zfs_znode_held+0xe6/0x140 [zfs]
[260136.630777]  schedule+0x35/0xa0
[260136.634009]  cv_wait_common+0x153/0x240 [spl]
[260136.638466]  ? finish_wait+0x80/0x80
[260136.642129]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[260136.647712]  zfs_rangelock_enter_impl+0xbf/0x170 [zfs]
[260136.653121]  zfs_get_data+0x113/0x770 [zfs]
[260136.657567]  zil_lwb_commit+0x537/0x780 [zfs]
[260136.662187]  zil_process_commit_list+0x14c/0x460 [zfs]
[260136.667585]  zil_commit_writer+0xeb/0x160 [zfs]
[260136.672376]  zil_commit_impl+0x5d/0xa0 [zfs]
[260136.676910]  zfs_putpage+0x516/0x590 [zfs]
[260136.681279]  zpl_putpage+0x67/0xd0 [zfs]
[260136.685467]  write_cache_pages+0x197/0x420
[260136.689649]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260136.694705]  zpl_writepages+0x119/0x130 [zfs]
[260136.699322]  do_writepages+0xc2/0x1c0
[260136.703076]  __filemap_fdatawrite_range+0xc7/0x100
[260136.707952]  filemap_write_and_wait_range+0x30/0x80
[260136.712920]  zpl_iter_read_direct+0x86/0x1b0 [zfs]
[260136.717972]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260136.722174]  zpl_iter_read+0x90/0xb0 [zfs]
[260136.726536]  new_sync_read+0x10f/0x150
[260136.730376]  vfs_read+0x91/0x140
[260136.733693]  ksys_read+0x4f/0xb0
[260136.737012]  do_syscall_64+0x5b/0x1a0
[260136.740764]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260136.745906] RIP: 0033:0x7f1bd4687ab4
[260136.749574] Code: Unable to access opcode bytes at RIP
0x7f1bd4687a8a.
[260136.756181] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[260136.763834] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f1bd4687ab4
[260136.771056] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI:
0000000000000005
[260136.778274] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09:
0000000000000000
[260136.785494] R10: 000000008fd0ea42 R11: 0000000000000246 R12:
0000000000100000
[260136.792714] R13: 000055ca4b405ec0 R14: 0000000000100000 R15:
000055ca4b405ee8
[260259.123003] INFO: task kworker/u128:0:3589938 blocked for more than
120 seconds.
[260259.130487]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.138313] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.146224] task:kworker/u128:0  state:D stack:    0 pid:3589938
ppid:     2 flags:0x80004080
[260259.154832] Workqueue: writeback wb_workfn (flush-zfs-540)
[260259.160411] Call Trace:
[260259.162950]  __schedule+0x2d1/0x830
[260259.166531]  schedule+0x35/0xa0
[260259.169765]  io_schedule+0x12/0x40
[260259.173257]  __lock_page+0x12d/0x230
[260259.176921]  ? file_fdatawait_range+0x20/0x20
[260259.181368]  write_cache_pages+0x1f2/0x420
[260259.185554]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260259.190633]  zpl_writepages+0x98/0x130 [zfs]
[260259.195183]  do_writepages+0xc2/0x1c0
[260259.198935]  __writeback_single_inode+0x39/0x2f0
[260259.203640]  writeback_sb_inodes+0x1e6/0x450
[260259.208002]  __writeback_inodes_wb+0x5f/0xc0
[260259.212359]  wb_writeback+0x247/0x2e0
[260259.216114]  ? get_nr_inodes+0x35/0x50
[260259.219953]  wb_workfn+0x37c/0x4d0
[260259.223443]  ? __switch_to_asm+0x35/0x70
[260259.227456]  ? __switch_to_asm+0x41/0x70
[260259.231469]  ? __switch_to_asm+0x35/0x70
[260259.235481]  ? __switch_to_asm+0x41/0x70
[260259.239495]  ? __switch_to_asm+0x35/0x70
[260259.243505]  ? __switch_to_asm+0x41/0x70
[260259.247518]  ? __switch_to_asm+0x35/0x70
[260259.251533]  ? __switch_to_asm+0x41/0x70
[260259.255545]  process_one_work+0x1a7/0x360
[260259.259645]  worker_thread+0x30/0x390
[260259.263396]  ? create_worker+0x1a0/0x1a0
[260259.267409]  kthread+0x10a/0x120
[260259.270730]  ? set_kthread_struct+0x40/0x40
[260259.275003]  ret_from_fork+0x35/0x40
[260259.278712] INFO: task fio:3791107 blocked for more than 120
seconds.
[260259.285240]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.293064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.300976] task:fio             state:D stack:    0 pid:3791107
ppid:3790841 flags:0x00004080
[260259.309668] Call Trace:
[260259.312210]  __schedule+0x2d1/0x830
[260259.315787]  schedule+0x35/0xa0
[260259.319020]  io_schedule+0x12/0x40
[260259.322511]  wait_on_page_bit+0x123/0x220
[260259.326611]  ? xas_load+0x8/0x80
[260259.329930]  ? file_fdatawait_range+0x20/0x20
[260259.334376]  filemap_page_mkwrite+0x9b/0xb0
[260259.338650]  do_page_mkwrite+0x53/0x90
[260259.342489]  ? vm_normal_page+0x1a/0xc0
[260259.346415]  do_wp_page+0x298/0x350
[260259.349994]  __handle_mm_fault+0x44f/0x6c0
[260259.354181]  ? __switch_to_asm+0x41/0x70
[260259.358193]  handle_mm_fault+0xc1/0x1e0
[260259.362117]  do_user_addr_fault+0x1b5/0x440
[260259.366391]  do_page_fault+0x37/0x130
[260259.370145]  ? page_fault+0x8/0x30
[260259.373639]  page_fault+0x1e/0x30
[260259.377043] RIP: 0033:0x7f6deee7f1b4
[260259.380714] Code: Unable to access opcode bytes at RIP
0x7f6deee7f18a.
[260259.387323] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202
[260259.392633] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX:
00007f6d83148fe0
[260259.399853] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI:
00007f6d8309bfa0
[260259.407074] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09:
0000000000000000
[260259.414291] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12:
0000000000000001
[260259.421512] R13: 0000556b63614ec0 R14: 0000000000100000 R15:
0000556b63614ee8
[260259.428733] INFO: task fio:3791108 blocked for more than 120
seconds.
[260259.435258]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.443085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.450997] task:fio             state:D stack:    0 pid:3791108
ppid:3790835 flags:0x00004080
[260259.459689] Call Trace:
[260259.462228]  __schedule+0x2d1/0x830
[260259.465808]  ? cv_wait_common+0x12d/0x240 [spl]
[260259.470435]  schedule+0x35/0xa0
[260259.473669]  io_schedule+0x12/0x40
[260259.477161]  __lock_page+0x12d/0x230
[260259.480828]  ? file_fdatawait_range+0x20/0x20
[260259.485274]  zfs_putpage+0x148/0x590 [zfs]
[260259.489640]  ? rmap_walk_file+0x116/0x290
[260259.493742]  ? __mod_memcg_lruvec_state+0x5d/0x160
[260259.498619]  zpl_putpage+0x67/0xd0 [zfs]
[260259.502813]  write_cache_pages+0x197/0x420
[260259.506998]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260259.512054]  zpl_writepages+0x119/0x130 [zfs]
[260259.516672]  do_writepages+0xc2/0x1c0
[260259.520423]  ? flush_tlb_func_common.constprop.9+0x125/0x220
[260259.526170]  __filemap_fdatawrite_range+0xc7/0x100
[260259.531050]  filemap_write_and_wait_range+0x30/0x80
[260259.536016]  generic_file_direct_write+0x120/0x160
[260259.540896]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260259.545099]  zpl_iter_write+0xdd/0x160 [zfs]
[260259.549639]  new_sync_write+0x112/0x160
[260259.553566]  vfs_write+0xa5/0x1a0
[260259.556971]  ksys_write+0x4f/0xb0
[260259.560379]  do_syscall_64+0x5b/0x1a0
[260259.564131]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260259.569269] RIP: 0033:0x7f9d192c7a17
[260259.572935] Code: Unable to access opcode bytes at RIP
0x7f9d192c79ed.
[260259.579549] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[260259.587200] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9d192c7a17
[260259.594419] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI:
0000000000000005
[260259.601639] RBP: 00007f9caea03000 R08: 0000000000000000 R09:
0000000000000000
[260259.608859] R10: 00005558e8975680 R11: 0000000000000293 R12:
0000000000100000
[260259.616078] R13: 00005558e8985ec0 R14: 0000000000100000 R15:
00005558e8985ee8
[260259.623298] INFO: task fio:3791109 blocked for more than 120
seconds.
[260259.629827]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.637650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.645564] task:fio             state:D stack:    0 pid:3791109
ppid:3790838 flags:0x00004080
[260259.654254] Call Trace:
[260259.656794]  __schedule+0x2d1/0x830
[260259.660373]  ? zfs_znode_held+0xe6/0x140 [zfs]
[260259.665081]  schedule+0x35/0xa0
[260259.668313]  cv_wait_common+0x153/0x240 [spl]
[260259.672768]  ? finish_wait+0x80/0x80
[260259.676441]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[260259.682026]  zfs_rangelock_enter_impl+0xbf/0x170 [zfs]
[260259.687432]  zfs_get_data+0x113/0x770 [zfs]
[260259.691876]  zil_lwb_commit+0x537/0x780 [zfs]
[260259.696497]  zil_process_commit_list+0x14c/0x460 [zfs]
[260259.701895]  zil_commit_writer+0xeb/0x160 [zfs]
[260259.706689]  zil_commit_impl+0x5d/0xa0 [zfs]
[260259.711228]  zfs_putpage+0x516/0x590 [zfs]
[260259.715589]  zpl_putpage+0x67/0xd0 [zfs]
[260259.719775]  write_cache_pages+0x197/0x420
[260259.723959]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260259.729013]  zpl_writepages+0x119/0x130 [zfs]
[260259.733632]  do_writepages+0xc2/0x1c0
[260259.737384]  __filemap_fdatawrite_range+0xc7/0x100
[260259.742264]  filemap_write_and_wait_range+0x30/0x80
[260259.747229]  zpl_iter_read_direct+0x86/0x1b0 [zfs]
[260259.752286]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260259.756487]  zpl_iter_read+0x90/0xb0 [zfs]
[260259.760855]  new_sync_read+0x10f/0x150
[260259.764696]  vfs_read+0x91/0x140
[260259.768013]  ksys_read+0x4f/0xb0
[260259.771332]  do_syscall_64+0x5b/0x1a0
[260259.775087]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260259.780225] RIP: 0033:0x7f1bd4687ab4
[260259.783893] Code: Unable to access opcode bytes at RIP
0x7f1bd4687a8a.
[260259.790503] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[260259.798157] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f1bd4687ab4
[260259.805377] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI:
0000000000000005
[260259.812592] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09:
0000000000000000
[260259.819814] R10: 000000008fd0ea42 R11: 0000000000000246 R12:
0000000000100000
[260259.827032] R13: 000055ca4b405ec0 R14: 0000000000100000 R15:
000055ca4b405ee8
[260382.001731] INFO: task kworker/u128:0:3589938 blocked for more than
120 seconds.
[260382.009227]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260382.017053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260382.024963] task:kworker/u128:0  state:D stack:    0 pid:3589938
ppid:     2 flags:0x80004080
[260382.033568] Workqueue: writeback wb_workfn (flush-zfs-540)
[260382.039141] Call Trace:
[260382.041683]  __schedule+0x2d1/0x830
[260382.045271]  schedule+0x35/0xa0
[260382.048503]  io_schedule+0x12/0x40
[260382.051994]  __lock_page+0x12d/0x230
[260382.055662]  ? file_fdatawait_range+0x20/0x20
[260382.060107]  write_cache_pages+0x1f2/0x420
[260382.064293]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260382.069379]  zpl_writepages+0x98/0x130 [zfs]
[260382.073919]  do_writepages+0xc2/0x1c0
[260382.077672]  __writeback_single_inode+0x39/0x2f0
[260382.082379]  writeback_sb_inodes+0x1e6/0x450
[260382.086738]  __writeback_inodes_wb+0x5f/0xc0
[260382.091097]  wb_writeback+0x247/0x2e0
[260382.094850]  ? get_nr_inodes+0x35/0x50
[260382.098689]  wb_workfn+0x37c/0x4d0
[260382.102181]  ? __switch_to_asm+0x35/0x70
[260382.106194]  ? __switch_to_asm+0x41/0x70
[260382.110207]  ? __switch_to_asm+0x35/0x70
[260382.114221]  ? __switch_to_asm+0x41/0x70
[260382.118231]  ? __switch_to_asm+0x35/0x70
[260382.122244]  ? __switch_to_asm+0x41/0x70
[260382.126256]  ? __switch_to_asm+0x35/0x70
[260382.130273]  ? __switch_to_asm+0x41/0x70
[260382.134284]  process_one_work+0x1a7/0x360
[260382.138384]  worker_thread+0x30/0x390
[260382.142136]  ? create_worker+0x1a0/0x1a0
[260382.146150]  kthread+0x10a/0x120
[260382.149469]  ? set_kthread_struct+0x40/0x40
[260382.153741]  ret_from_fork+0x35/0x40
[260382.157448] INFO: task fio:3791107 blocked for more than 120
seconds.
[260382.163977]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260382.171802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260382.179715] task:fio             state:D stack:    0 pid:3791107
ppid:3790841 flags:0x00004080
[260382.188409] Call Trace:
[260382.190945]  __schedule+0x2d1/0x830
[260382.194527]  schedule+0x35/0xa0
[260382.197757]  io_schedule+0x12/0x40
[260382.201249]  wait_on_page_bit+0x123/0x220
[260382.205350]  ? xas_load+0x8/0x80
[260382.208668]  ? file_fdatawait_range+0x20/0x20
[260382.213114]  filemap_page_mkwrite+0x9b/0xb0
[260382.217386]  do_page_mkwrite+0x53/0x90
[260382.221227]  ? vm_normal_page+0x1a/0xc0
[260382.225152]  do_wp_page+0x298/0x350
[260382.228733]  __handle_mm_fault+0x44f/0x6c0
[260382.232919]  ? __switch_to_asm+0x41/0x70
[260382.236930]  handle_mm_fault+0xc1/0x1e0
[260382.240856]  do_user_addr_fault+0x1b5/0x440
[260382.245132]  do_page_fault+0x37/0x130
[260382.248883]  ? page_fault+0x8/0x30
[260382.252375]  page_fault+0x1e/0x30
[260382.255781] RIP: 0033:0x7f6deee7f1b4
[260382.259451] Code: Unable to access opcode bytes at RIP
0x7f6deee7f18a.
[260382.266059] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202
[260382.271373] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX:
00007f6d83148fe0
[260382.278591] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI:
00007f6d8309bfa0
[260382.285813] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09:
0000000000000000
[260382.293030] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12:
0000000000000001
[260382.300249] R13: 0000556b63614ec0 R14: 0000000000100000 R15:
0000556b63614ee8
[260382.307472] INFO: task fio:3791108 blocked for more than 120
seconds.
[260382.313997]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260382.321823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260382.329734] task:fio             state:D stack:    0 pid:3791108
ppid:3790835 flags:0x00004080
[260382.338427] Call Trace:
[260382.340967]  __schedule+0x2d1/0x830
[260382.344547]  ? cv_wait_common+0x12d/0x240 [spl]
[260382.349173]  schedule+0x35/0xa0
[260382.352406]  io_schedule+0x12/0x40
[260382.355899]  __lock_page+0x12d/0x230
[260382.359563]  ? file_fdatawait_range+0x20/0x20
[260382.364010]  zfs_putpage+0x148/0x590 [zfs]
[260382.368379]  ? rmap_walk_file+0x116/0x290
[260382.372479]  ? __mod_memcg_lruvec_state+0x5d/0x160
[260382.377358]  zpl_putpage+0x67/0xd0 [zfs]
[260382.381552]  write_cache_pages+0x197/0x420
[260382.385739]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260382.390791]  zpl_writepages+0x119/0x130 [zfs]
[260382.395410]  do_writepages+0xc2/0x1c0
[260382.399161]  ? flush_tlb_func_common.constprop.9+0x125/0x220
[260382.404907]  __filemap_fdatawrite_range+0xc7/0x100
[260382.409790]  filemap_write_and_wait_range+0x30/0x80
[260382.414752]  generic_file_direct_write+0x120/0x160
[260382.419632]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260382.423838]  zpl_iter_write+0xdd/0x160 [zfs]
[260382.428379]  new_sync_write+0x112/0x160
[260382.432304]  vfs_write+0xa5/0x1a0
[260382.435711]  ksys_write+0x4f/0xb0
[260382.439115]  do_syscall_64+0x5b/0x1a0
[260382.442866]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260382.448007] RIP: 0033:0x7f9d192c7a17
[260382.451675] Code: Unable to access opcode bytes at RIP
0x7f9d192c79ed.
[260382.458286] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[260382.465938] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9d192c7a17
[260382.473158] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI:
0000000000000005
[260382.480379] RBP: 00007f9caea03000 R08: 0000000000000000 R09:
0000000000000000
[260382.487597] R10: 00005558e8975680 R11: 0000000000000293 R12:
0000000000100000
[260382.494814] R13: 00005558e8985ec0 R14: 0000000000100000 R15:
00005558e8985ee8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 10, 2024
In commit ba30ec9 I got a little overzealous in code cleanup. While I
was trying to remove all the stable page code for Linux, I
misinterpreted why Brian Behlendorf originally had the try rangelock,
drop page lock, and aquire range lock in zfs_fillpage(). This is still
necessary even without stable pages. This has to occur to avoid a race
condition between direct IO writes and pages being faulted in for mmap
files. If the rangelock is not held, then a direct IO write can set
db->db_data = NULL either in:
 1. dmu_write_direct() -> dmu_buf_will_not_fill() ->
    dmu_buf_will_fill() -> dbuf_noread() -> dbuf_clear_data()
 2. dmu_write_direct_done()

This can cause the panic then, withtout the rangelock as
dmu_read_imp() can get a NULL pointer for db->db_data when trying to do
the memcpy. So this rangelock must be held in zfs_fillpage() not matter
what.

There are further semantics on when the rangelock should be held in
zfs_fillpage(). It must only be held when doing zfs_getpage() ->
zfs_fillpage(). The reason for this is mappedread() can call
zfs_fillpage() if the page is not uptodate. This can occur becaue
filemap_fault() will first add the pages to the inode's address_space
mapping and then drop the page lock. This leaves open a window were
mappedread() can be called. Since this can occur, mappedread() will hold
both the page lock and the rangelock. This is perfectly valid and
correct. However, it is important in this case to never grab the
rangelock in zfs_fillpage(). If this happens, then a dead lock will
occur.

Finally it is important to note that the rangelock is first attempted to
be grabbed with zfs_rangelock_tryenter(). The reason for this is the
page lock must be dropped in order to grab the rangelock in this case.
Otherwise there is a race between zfs_fillpage() and zfs_write() ->
update_pages(). In update_pages() the rangelock is already held and it
then grabs the page lock. So if the page lock is not dropped before
acquiring the rangelock in zfs_fillpage() there can be a deadlock.

Below is a stack trace showing the NULL pointer dereference that was
occuring with the dio_mmap ZTS test case before this commit.

[ 7737.430658] BUG: unable to handle kernel NULL pointer dereference at
0000000000000000
[ 7737.438486] PGD 0 P4D 0
[ 7737.441024] Oops: 0000 [openzfs#1] SMP NOPTI
[ 7737.444692] CPU: 33 PID: 599346 Comm: fio Kdump: loaded Tainted: P
OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[ 7737.455721] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[ 7737.463106] RIP: 0010:__memcpy+0x12/0x20
[ 7737.467032] Code: ff 0f 31 48 c1 e2 20 48 09 c2 48 31 d3 e9 79 ff ff
ff 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2
07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4
[ 7737.485770] RSP: 0000:ffffc1db829e3b60 EFLAGS: 00010246
[ 7737.490987] RAX: ffff9ef195b6f000 RBX: 0000000000001000 RCX:
0000000000000200
[ 7737.498111] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff9ef195b6f000
[ 7737.505235] RBP: ffff9ef195b70000 R08: ffff9eef1d1d0000 R09:
ffff9eef1d1d0000
[ 7737.512358] R10: ffff9eef27968218 R11: 0000000000000000 R12:
0000000000000000
[ 7737.519481] R13: ffff9ef041adb6d8 R14: 00000000005e1000 R15:
0000000000000001
[ 7737.526607] FS:  00007f77fe2bae80(0000) GS:ffff9f0cae840000(0000)
knlGS:0000000000000000
[ 7737.534683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7737.540423] CR2: 0000000000000000 CR3: 00000003387a6000 CR4:
0000000000350ee0
[ 7737.547553] Call Trace:
[ 7737.550003]  dmu_read_impl+0x11a/0x210 [zfs]
[ 7737.554464]  dmu_read+0x56/0x90 [zfs]
[ 7737.558292]  zfs_fillpage+0x76/0x190 [zfs]
[ 7737.562584]  zfs_getpage+0x4c/0x80 [zfs]
[ 7737.566691]  zpl_readpage_common+0x3b/0x80 [zfs]
[ 7737.571485]  filemap_fault+0x5d6/0xa10
[ 7737.575236]  ? obj_cgroup_charge_pages+0xba/0xd0
[ 7737.579856]  ? xas_load+0x8/0x80
[ 7737.583088]  ? xas_find+0x173/0x1b0
[ 7737.586579]  ? filemap_map_pages+0x84/0x410
[ 7737.590759]  __do_fault+0x38/0xb0
[ 7737.594077]  handle_pte_fault+0x559/0x870
[ 7737.598082]  __handle_mm_fault+0x44f/0x6c0
[ 7737.602181]  handle_mm_fault+0xc1/0x1e0
[ 7737.606019]  do_user_addr_fault+0x1b5/0x440
[ 7737.610207]  do_page_fault+0x37/0x130
[ 7737.613873]  ? page_fault+0x8/0x30
[ 7737.617277]  page_fault+0x1e/0x30
[ 7737.620589] RIP: 0033:0x7f77fbce9140

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 10, 2024
It is important to hold the dbuf mutex (db_mtx) when creating ZIO's in
dmu_read_abd(). The BP that is returned dmu_buf_get_gp_from_dbuf() may
come from a previous direct IO write. In this case, it is attached to a
dirty record in the dbuf. When zio_read() is called, a copy of the BP is
made through io_bp_copy to io_bp in zio_create(). Without holding the
db_mtx though, the dirty record may be freed in dbuf_read_done(). This
can result in garbage beening place BP for the ZIO creatd through
zio_read(). By holding the db_mtx, this race can be avoided. Below is a
stack trace of the issue that was occuring in vdev_mirror_child_select()
without holding the db_mtx and creating the the ZIO.

[29882.427056] VERIFY(zio->io_bp == NULL ||
BP_PHYSICAL_BIRTH(zio->io_bp) == txg) failed
[29882.434884] PANIC at vdev_mirror.c:545:vdev_mirror_child_select()
[29882.440976] Showing stack for process 1865540
[29882.445336] CPU: 57 PID: 1865540 Comm: fio Kdump: loaded Tainted: P
OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[29882.456457] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[29882.463844] Call Trace:
[29882.466296]  dump_stack+0x41/0x60
[29882.469618]  spl_panic+0xd0/0xe8 [spl]
[29882.473376]  ? __dprintf+0x10e/0x180 [zfs]
[29882.477674]  ? kfree+0xd3/0x250
[29882.480819]  ? __dprintf+0x10e/0x180 [zfs]
[29882.485103]  ? vdev_mirror_map_alloc+0x29/0x50 [zfs]
[29882.490250]  ? vdev_lookup_top+0x20/0x90 [zfs]
[29882.494878]  spl_assert+0x17/0x20 [zfs]
[29882.498893]  vdev_mirror_child_select+0x279/0x300 [zfs]
[29882.504289]  vdev_mirror_io_start+0x11f/0x2b0 [zfs]
[29882.509336]  zio_vdev_io_start+0x3ee/0x520 [zfs]
[29882.514137]  zio_nowait+0x105/0x290 [zfs]
[29882.518330]  dmu_read_abd+0x196/0x460 [zfs]
[29882.522691]  dmu_read_uio_direct+0x6d/0xf0 [zfs]
[29882.527472]  dmu_read_uio_dnode+0x12a/0x140 [zfs]
[29882.532345]  dmu_read_uio_dbuf+0x3f/0x60 [zfs]
[29882.536953]  zfs_read+0x238/0x3f0 [zfs]
[29882.540976]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[29882.545952]  ? rrw_exit+0xc6/0x200 [zfs]
[29882.550058]  zpl_iter_read+0x90/0xb0 [zfs]
[29882.554340]  new_sync_read+0x10f/0x150
[29882.558094]  vfs_read+0x91/0x140
[29882.561325]  ksys_read+0x4f/0xb0
[29882.564557]  do_syscall_64+0x5b/0x1a0
[29882.568222]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[29882.573267] RIP: 0033:0x7f7fe0fa6ab4

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 10, 2024
There existed a race condition that was discovered through the
dio_random test. When doing fio with --fsync=32, after 32 writes fsync
is called on the file. When this happens, blocks committed to the ZIL
will be sync'ed out. However, the code for the O_DIRECT write was
updated in 31983d2 to always wait if there was an associated ARC buf
with the dbuf for all previous TXG's to sync out.

There was an oversight with this update. When waiting on previous TXG's
to sync out, the O_DIRECT write is holding the rangelock as a writer the
entire time. This causes an issue with the ZIL is commit writes out
though `zfs_get_data()` because it will try and grab the rangelock as
reader. This will lead to a deadlock.

In order to fix this race condition, I updated the `dmu_buf_impl_t`
struct to contain a uint8_t variable that is used to signal if the dbuf
attached to a O_DIRECT write is the wait hold because of mixed direct
and buffered data. Using this new `db_mixed_io_dio_wait` variable in the
in the `dmu_buf_impl_t` the code in `zfs_get_data()` can tell that
rangelock is already being held across the entire block and there is no
need to grab the rangelock at all. Because the rangelock is being held
as a writer across the entire block already, no modifications can take
place against the block as long as the O_DIRECT write is stalled
waiting in `dmu_buf_direct_mixed_io_wait()`.

Also as part of this update, I realized the `db_state` in
`dmu_buf_direct_mixed_io_wait()` needs to be changed temporarily to
`DB_CACHED`. This is necessary so the logic in `dbuf_read()` is correct
if `dmu_sync_late_arrival()` is called by `dmu_sync()`. It is completely
valid to switch the `db_state` back to `DB_CACHED` is there is still an
associated ARC buf that will not be freed till out O_DIRECT write is
completed which will only happen after if leaves
`dmu_buf_direct_mixed_io_wait()`.

Here is the stack trace of the deadlock that happen with
`dio_random.ksh` before this commit:
[ 5513.663402] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 7496.580415] INFO: task z_wr_int:1098000 blocked for more than 120
seconds.
[ 7496.585709]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.593349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.600839] task:z_wr_int        state:D stack:    0 pid:1098000
ppid:     2 flags:0x80004080
[ 7496.608622] Call Trace:
[ 7496.611770]  __schedule+0x2d1/0x870
[ 7496.615404]  schedule+0x55/0xf0
[ 7496.618866]  cv_wait_common+0x16d/0x280 [spl]
[ 7496.622910]  ? finish_wait+0x80/0x80
[ 7496.626601]  dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs]
[ 7496.631327]  dmu_write_direct_done+0x90/0x3a0 [zfs]
[ 7496.635798]  zio_done+0x373/0x1d40 [zfs]
[ 7496.639795]  zio_execute+0xee/0x210 [zfs]
[ 7496.643840]  taskq_thread+0x203/0x420 [spl]
[ 7496.647836]  ? wake_up_q+0x70/0x70
[ 7496.651411]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[ 7496.656489]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 7496.660604]  kthread+0x134/0x150
[ 7496.664092]  ? set_kthread_struct+0x50/0x50
[ 7496.668080]  ret_from_fork+0x35/0x40
[ 7496.671745] INFO: task txg_sync:1098025 blocked for more than 120
seconds.
[ 7496.676991]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.684666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.692060] task:txg_sync        state:D stack:    0 pid:1098025
ppid:     2 flags:0x80004080
[ 7496.699888] Call Trace:
[ 7496.703012]  __schedule+0x2d1/0x870
[ 7496.706658]  schedule+0x55/0xf0
[ 7496.710093]  schedule_timeout+0x197/0x300
[ 7496.713982]  ? __next_timer_interrupt+0xf0/0xf0
[ 7496.718135]  io_schedule_timeout+0x19/0x40
[ 7496.722049]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7496.726349]  ? finish_wait+0x80/0x80
[ 7496.730039]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7496.734100]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7496.738082]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 7496.742205]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7496.746534]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 7496.750842]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[ 7496.755742]  spa_sync+0x362/0x8d0 [zfs]
[ 7496.759689]  txg_sync_thread+0x274/0x3b0 [zfs]
[ 7496.763928]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 7496.768439]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 7496.772799]  thread_generic_wrapper+0x63/0x90 [spl]
[ 7496.777097]  kthread+0x134/0x150
[ 7496.780616]  ? set_kthread_struct+0x50/0x50
[ 7496.784549]  ret_from_fork+0x35/0x40
[ 7496.788204] INFO: task fio:1101750 blocked for more than 120 seconds.
[ 7496.895852]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.903765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.911170] task:fio             state:D stack:    0 pid:1101750
ppid:1101741 flags:0x00004080
[ 7496.919033] Call Trace:
[ 7496.922136]  __schedule+0x2d1/0x870
[ 7496.925769]  schedule+0x55/0xf0
[ 7496.929245]  schedule_timeout+0x197/0x300
[ 7496.933120]  ? __next_timer_interrupt+0xf0/0xf0
[ 7496.937213]  io_schedule_timeout+0x19/0x40
[ 7496.941126]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7496.945444]  ? finish_wait+0x80/0x80
[ 7496.949125]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7496.953191]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7496.957180]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 7496.961319]  dmu_write_uio_direct+0x79/0xf0 [zfs]
[ 7496.965731]  dmu_write_uio_dnode+0xa6/0x2d0 [zfs]
[ 7496.970043]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 7496.974305]  zfs_write+0x55f/0xea0 [zfs]
[ 7496.978325]  ? iov_iter_get_pages+0xe9/0x390
[ 7496.982333]  ? trylock_page+0xd/0x20 [zfs]
[ 7496.986451]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7496.990713]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7496.995031]  zpl_iter_write_direct+0xda/0x170 [zfs]
[ 7496.999489]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.003476]  zpl_iter_write+0xd5/0x110 [zfs]
[ 7497.007610]  new_sync_write+0x112/0x160
[ 7497.011429]  vfs_write+0xa5/0x1b0
[ 7497.014916]  ksys_write+0x4f/0xb0
[ 7497.018443]  do_syscall_64+0x5b/0x1b0
[ 7497.022150]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.026532] RIP: 0033:0x7f8771d72a17
[ 7497.030195] Code: Unable to access opcode bytes at RIP
0x7f8771d729ed.
[ 7497.035263] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 7497.042547] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72a17
[ 7497.047933] RDX: 000000000009b000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7497.053269] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.058660] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000009b000
[ 7497.063960] R13: 000055b390afcac0 R14: 000000000009b000 R15:
000055b390afcae8
[ 7497.069334] INFO: task fio:1101751 blocked for more than 120 seconds.
[ 7497.074308]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.081973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.089371] task:fio             state:D stack:    0 pid:1101751
ppid:1101741 flags:0x00000080
[ 7497.097147] Call Trace:
[ 7497.100263]  __schedule+0x2d1/0x870
[ 7497.103897]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.107878]  schedule+0x55/0xf0
[ 7497.111386]  cv_wait_common+0x16d/0x280 [spl]
[ 7497.115391]  ? finish_wait+0x80/0x80
[ 7497.119028]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7497.123667]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7497.128240]  zfs_read+0xaf/0x3f0 [zfs]
[ 7497.132146]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.136091]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7497.140366]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7497.144679]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[ 7497.149054]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.153040]  zpl_iter_read+0x94/0xb0 [zfs]
[ 7497.157103]  new_sync_read+0x10f/0x160
[ 7497.160855]  vfs_read+0x91/0x150
[ 7497.164336]  ksys_read+0x4f/0xb0
[ 7497.168004]  do_syscall_64+0x5b/0x1b0
[ 7497.171706]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.176105] RIP: 0033:0x7f8771d72ab4
[ 7497.179742] Code: Unable to access opcode bytes at RIP
0x7f8771d72a8a.
[ 7497.184807] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 7497.192129] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72ab4
[ 7497.197485] RDX: 0000000000002000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7497.202922] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.208309] R10: 00000001ffffffff R11: 0000000000000246 R12:
0000000000002000
[ 7497.213694] R13: 000055b390afcac0 R14: 0000000000002000 R15:
000055b390afcae8
[ 7497.219063] INFO: task fio:1101755 blocked for more than 120 seconds.
[ 7497.224098]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.231786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.239165] task:fio             state:D stack:    0 pid:1101755
ppid:1101744 flags:0x00000080
[ 7497.246989] Call Trace:
[ 7497.250121]  __schedule+0x2d1/0x870
[ 7497.253779]  schedule+0x55/0xf0
[ 7497.257240]  schedule_preempt_disabled+0xa/0x10
[ 7497.261344]  __mutex_lock.isra.7+0x349/0x420
[ 7497.265326]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7497.269674]  zil_commit_writer+0x89/0x230 [zfs]
[ 7497.273938]  zil_commit_impl+0x5f/0xd0 [zfs]
[ 7497.278101]  zfs_fsync+0x81/0xa0 [zfs]
[ 7497.282002]  zpl_fsync+0xe5/0x140 [zfs]
[ 7497.285985]  do_fsync+0x38/0x70
[ 7497.289458]  __x64_sys_fsync+0x10/0x20
[ 7497.293208]  do_syscall_64+0x5b/0x1b0
[ 7497.296928]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.301260] RIP: 0033:0x7f9559073027
[ 7497.304920] Code: Unable to access opcode bytes at RIP
0x7f9559072ffd.
[ 7497.310015] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX:
000000000000004a
[ 7497.317346] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9559073027
[ 7497.322722] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI:
0000000000000005
[ 7497.328126] RBP: 00007f94fb858000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.333514] R10: 0000000000008000 R11: 0000000000000293 R12:
0000000000000003
[ 7497.338887] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15:
0000563adcbf2ae8
[ 7497.344247] INFO: task fio:1101756 blocked for more than 120 seconds.
[ 7497.349327]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.357032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.364517] task:fio             state:D stack:    0 pid:1101756
ppid:1101744 flags:0x00004080
[ 7497.372310] Call Trace:
[ 7497.375433]  __schedule+0x2d1/0x870
[ 7497.379004]  schedule+0x55/0xf0
[ 7497.382454]  cv_wait_common+0x16d/0x280 [spl]
[ 7497.386477]  ? finish_wait+0x80/0x80
[ 7497.390137]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7497.394816]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7497.399397]  zfs_get_data+0x1a8/0x7e0 [zfs]
[ 7497.403515]  zil_lwb_commit+0x1a5/0x400 [zfs]
[ 7497.407712]  zil_lwb_write_close+0x408/0x630 [zfs]
[ 7497.412126]  zil_commit_waiter_timeout+0x16d/0x520 [zfs]
[ 7497.416801]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 7497.421139]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 7497.425294]  zfs_fsync+0x81/0xa0 [zfs]
[ 7497.429454]  zpl_fsync+0xe5/0x140 [zfs]
[ 7497.433396]  do_fsync+0x38/0x70
[ 7497.436878]  __x64_sys_fsync+0x10/0x20
[ 7497.440586]  do_syscall_64+0x5b/0x1b0
[ 7497.444313]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.448659] RIP: 0033:0x7f9559073027
[ 7497.452343] Code: Unable to access opcode bytes at RIP
0x7f9559072ffd.
[ 7497.457408] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX:
000000000000004a
[ 7497.464724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9559073027
[ 7497.470106] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI:
0000000000000005
[ 7497.475477] RBP: 00007f94fb89ca18 R08: 0000000000000000 R09:
0000000000000000
[ 7497.480806] R10: 00000000000b4cc0 R11: 0000000000000293 R12:
0000000000000003
[ 7497.486158] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15:
0000563adcbf2ae8
[ 7619.459402] INFO: task z_wr_int:1098000 blocked for more than 120
seconds.
[ 7619.464605]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.472233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.479659] task:z_wr_int        state:D stack:    0 pid:1098000
ppid:     2 flags:0x80004080
[ 7619.487518] Call Trace:
[ 7619.490650]  __schedule+0x2d1/0x870
[ 7619.494246]  schedule+0x55/0xf0
[ 7619.497719]  cv_wait_common+0x16d/0x280 [spl]
[ 7619.501749]  ? finish_wait+0x80/0x80
[ 7619.505411]  dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs]
[ 7619.510143]  dmu_write_direct_done+0x90/0x3a0 [zfs]
[ 7619.514603]  zio_done+0x373/0x1d40 [zfs]
[ 7619.518594]  zio_execute+0xee/0x210 [zfs]
[ 7619.522619]  taskq_thread+0x203/0x420 [spl]
[ 7619.526567]  ? wake_up_q+0x70/0x70
[ 7619.530208]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[ 7619.535302]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 7619.539385]  kthread+0x134/0x150
[ 7619.542873]  ? set_kthread_struct+0x50/0x50
[ 7619.546810]  ret_from_fork+0x35/0x40
[ 7619.550477] INFO: task txg_sync:1098025 blocked for more than 120
seconds.
[ 7619.555715]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.563415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.570851] task:txg_sync        state:D stack:    0 pid:1098025
ppid:     2 flags:0x80004080
[ 7619.578606] Call Trace:
[ 7619.581742]  __schedule+0x2d1/0x870
[ 7619.585396]  schedule+0x55/0xf0
[ 7619.589006]  schedule_timeout+0x197/0x300
[ 7619.592916]  ? __next_timer_interrupt+0xf0/0xf0
[ 7619.597027]  io_schedule_timeout+0x19/0x40
[ 7619.600947]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7619.709878]  ? finish_wait+0x80/0x80
[ 7619.713565]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7619.717596]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7619.721567]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 7619.725657]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7619.730050]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 7619.734415]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[ 7619.739268]  spa_sync+0x362/0x8d0 [zfs]
[ 7619.743270]  txg_sync_thread+0x274/0x3b0 [zfs]
[ 7619.747494]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 7619.751939]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 7619.756279]  thread_generic_wrapper+0x63/0x90 [spl]
[ 7619.760569]  kthread+0x134/0x150
[ 7619.764050]  ? set_kthread_struct+0x50/0x50
[ 7619.767978]  ret_from_fork+0x35/0x40
[ 7619.771639] INFO: task fio:1101750 blocked for more than 120 seconds.
[ 7619.776678]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.784324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.791914] task:fio             state:D stack:    0 pid:1101750
ppid:1101741 flags:0x00004080
[ 7619.799712] Call Trace:
[ 7619.802816]  __schedule+0x2d1/0x870
[ 7619.806427]  schedule+0x55/0xf0
[ 7619.809867]  schedule_timeout+0x197/0x300
[ 7619.813760]  ? __next_timer_interrupt+0xf0/0xf0
[ 7619.817848]  io_schedule_timeout+0x19/0x40
[ 7619.821766]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7619.826097]  ? finish_wait+0x80/0x80
[ 7619.829780]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7619.833857]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7619.837838]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 7619.842015]  dmu_write_uio_direct+0x79/0xf0 [zfs]
[ 7619.846388]  dmu_write_uio_dnode+0xa6/0x2d0 [zfs]
[ 7619.850760]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 7619.855011]  zfs_write+0x55f/0xea0 [zfs]
[ 7619.859008]  ? iov_iter_get_pages+0xe9/0x390
[ 7619.863036]  ? trylock_page+0xd/0x20 [zfs]
[ 7619.867084]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7619.871366]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7619.875715]  zpl_iter_write_direct+0xda/0x170 [zfs]
[ 7619.880164]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7619.884174]  zpl_iter_write+0xd5/0x110 [zfs]
[ 7619.888492]  new_sync_write+0x112/0x160
[ 7619.892285]  vfs_write+0xa5/0x1b0
[ 7619.895829]  ksys_write+0x4f/0xb0
[ 7619.899384]  do_syscall_64+0x5b/0x1b0
[ 7619.903071]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7619.907394] RIP: 0033:0x7f8771d72a17
[ 7619.911026] Code: Unable to access opcode bytes at RIP
0x7f8771d729ed.
[ 7619.916073] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 7619.923363] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72a17
[ 7619.928675] RDX: 000000000009b000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7619.934019] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7619.939354] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000009b000
[ 7619.944775] R13: 000055b390afcac0 R14: 000000000009b000 R15:
000055b390afcae8
[ 7619.950175] INFO: task fio:1101751 blocked for more than 120 seconds.
[ 7619.955232]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.962889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.970301] task:fio             state:D stack:    0 pid:1101751
ppid:1101741 flags:0x00000080
[ 7619.978139] Call Trace:
[ 7619.981278]  __schedule+0x2d1/0x870
[ 7619.984872]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7619.989260]  schedule+0x55/0xf0
[ 7619.992725]  cv_wait_common+0x16d/0x280 [spl]
[ 7619.996754]  ? finish_wait+0x80/0x80
[ 7620.000414]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7620.005050]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7620.009617]  zfs_read+0xaf/0x3f0 [zfs]
[ 7620.013503]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7620.017489]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7620.021774]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7620.026091]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[ 7620.030508]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7620.034497]  zpl_iter_read+0x94/0xb0 [zfs]
[ 7620.038579]  new_sync_read+0x10f/0x160
[ 7620.042325]  vfs_read+0x91/0x150
[ 7620.045809]  ksys_read+0x4f/0xb0
[ 7620.049273]  do_syscall_64+0x5b/0x1b0
[ 7620.052965]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7620.057354] RIP: 0033:0x7f8771d72ab4
[ 7620.060988] Code: Unable to access opcode bytes at RIP
0x7f8771d72a8a.
[ 7620.066041] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 7620.073256] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72ab4
[ 7620.078553] RDX: 0000000000002000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7620.083878] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7620.089353] R10: 00000001ffffffff R11: 0000000000000246 R12:
0000000000002000
[ 7620.094697] R13: 000055b390afcac0 R14: 0000000000002000 R15:
000055b390afcae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 10, 2024
995734e added a test for block cloning with mmap files. As a result I
began hitting a panic in that test in dbuf_unoverride(). The was if the
dirty record was from a cloned block, then the dr_data must be set to
NULL. This ASSERT was added in 86e115e. The point of that commit was to
make sure that if a cloned block is read before it is synced out, then
the associated ARC buffer is set in the dirty record.

This became an issue with the O_DIRECT code, because dr_data was set to
the ARC buf in dbuf_set_data() after the read. This is the incorrect
logic though for the cloned block. In order to fix this issue, I refined
how to determine if the dirty record is in fact from a O_DIRECT write by
make sure that dr_brtwrite is false. I created the function
dbuf_dirty_is_direct_write() to perform the proper check.

As part of this, I also cleaned up other code that did the exact same
check for an O_DIRECT write to make sure the proper check is taking
place everywhere.

The trace of the ASSERT that was being tripped before this change is
below:
[3649972.811039] VERIFY0P(dr->dt.dl.dr_data) failed (NULL ==
ffff8d58e8183c80)
[3649972.817999] PANIC at dbuf.c:1999:dbuf_unoverride()
[3649972.822968] Showing stack for process 2365657
[3649972.827502] CPU: 0 PID: 2365657 Comm: clone_mmap_writ Kdump: loaded
Tainted: P           OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[3649972.839749] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS
R21 10/08/2020
[3649972.847315] Call Trace:
[3649972.849935]  dump_stack+0x41/0x60
[3649972.853428]  spl_panic+0xd0/0xe8 [spl]
[3649972.857370]  ? cityhash4+0x75/0x90 [zfs]
[3649972.861649]  ? _cond_resched+0x15/0x30
[3649972.865577]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[3649972.870548]  ? __kmalloc_node+0x10d/0x300
[3649972.874735]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[3649972.879702]  ? __list_add+0x12/0x30 [zfs]
[3649972.884061]  dbuf_unoverride+0x1c1/0x1d0 [zfs]
[3649972.888856]  dbuf_redirty+0x3b/0xd0 [zfs]
[3649972.893204]  dbuf_dirty+0xeb1/0x1330 [zfs]
[3649972.897643]  ? _cond_resched+0x15/0x30
[3649972.901569]  ? mutex_lock+0xe/0x30
[3649972.905148]  ? dbuf_noread+0x117/0x240 [zfs]
[3649972.909760]  dmu_write_uio_dnode+0x1d2/0x320 [zfs]
[3649972.914900]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[3649972.919777]  zfs_write+0x57d/0xe00 [zfs]
[3649972.924076]  ? alloc_set_pte+0xb8/0x3e0
[3649972.928088]  zpl_iter_write_buffered+0xb2/0x120 [zfs]
[3649972.933507]  ? rrw_exit+0xc6/0x200 [zfs]
[3649972.937796]  zpl_iter_write+0xba/0x110 [zfs]
[3649972.942433]  new_sync_write+0x112/0x160
[3649972.946445]  vfs_write+0xa5/0x1a0
[3649972.949935]  ksys_pwrite64+0x61/0xa0
[3649972.953681]  do_syscall_64+0x5b/0x1a0
[3649972.957519]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[3649972.962745] RIP: 0033:0x7f610616f01b

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 10, 2024
Originally I was checking dr->dr_dbuf->db_level == 0 in
dbuf_dirty_is_direct_write(). Howver, this can lead to a NULL ponter
dereference if the dr_dbuf is no longer set.

I updated dbuf_dirty_is_direct_write() to now also take a dmu_buf_impl_t
to check if db->db_level == 0. This failure was caught on the Fedora 37
CI running in test enospc_rm. Below is the stack trace.

[ 9851.511608] BUG: kernel NULL pointer dereference, address:
0000000000000068
[ 9851.515922] #PF: supervisor read access in kernel mode
[ 9851.519462] #PF: error_code(0x0000) - not-present page
[ 9851.522992] PGD 0 P4D 0
[ 9851.525684] Oops: 0000 [openzfs#1] PREEMPT SMP PTI
[ 9851.528878] CPU: 0 PID: 1272993 Comm: fio Tainted: P           OE
6.5.12-100.fc37.x86_64 openzfs#1
[ 9851.535266] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[ 9851.539226] RIP: 0010:dbuf_dirty_is_direct_write+0xb/0x40 [zfs]
[ 9851.543379] Code: 10 74 02 31 c0 5b c3 cc cc cc cc 0f 1f 40 00 90 90
90 90 90 90 90 90 90 90 90 90 90 90 90 90 31 c0 48 85 ff 74 31 48 8b 57
20 <80> 7a 68 00 75 27 8b 87 64 01 00 00 85 c0 75 1b 83 bf 58 01 00 00
[ 9851.554719] RSP: 0018:ffff9b5b8305f8e8 EFLAGS: 00010286
[ 9851.558276] RAX: 0000000000000000 RBX: ffff9b5b8569b0b8 RCX:
0000000000000000
[ 9851.562481] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff8f2e97de9e00
[ 9851.566672] RBP: 0000000000020000 R08: 0000000000000000 R09:
ffff8f2f70e94000
[ 9851.570851] R10: 0000000000000001 R11: 0000000000000110 R12:
ffff8f2f774ae4c0
[ 9851.575032] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
[ 9851.579209] FS:  00007f57c5542240(0000) GS:ffff8f2faa800000(0000)
knlGS:0000000000000000
[ 9851.585357] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9851.589064] CR2: 0000000000000068 CR3: 00000001f9a38001 CR4:
00000000007706f0
[ 9851.593256] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 9851.597440] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 9851.601618] PKRU: 55555554
[ 9851.604341] Call Trace:
[ 9851.606981]  <TASK>
[ 9851.609515]  ? __die+0x23/0x70
[ 9851.612388]  ? page_fault_oops+0x171/0x4e0
[ 9851.615571]  ? exc_page_fault+0x77/0x170
[ 9851.618704]  ? asm_exc_page_fault+0x26/0x30
[ 9851.621900]  ? dbuf_dirty_is_direct_write+0xb/0x40 [zfs]
[ 9851.625828]  zfs_get_data+0x407/0x820 [zfs]
[ 9851.629400]  zil_lwb_commit+0x18d/0x3f0 [zfs]
[ 9851.633026]  zil_lwb_write_issue+0x92/0xbb0 [zfs]
[ 9851.636758]  zil_commit_waiter_timeout+0x1f3/0x580 [zfs]
[ 9851.640696]  zil_commit_waiter+0x1ff/0x3a0 [zfs]
[ 9851.644402]  zil_commit_impl+0x71/0xd0 [zfs]
[ 9851.647998]  zfs_write+0xb51/0xdc0 [zfs]
[ 9851.651467]  zpl_iter_write_buffered+0xc9/0x140 [zfs]
[ 9851.655337]  zpl_iter_write+0xc0/0x110 [zfs]
[ 9851.658920]  vfs_write+0x23e/0x420
[ 9851.661871]  __x64_sys_pwrite64+0x98/0xd0
[ 9851.665013]  do_syscall_64+0x5f/0x90
[ 9851.668027]  ? ksys_fadvise64_64+0x57/0xa0
[ 9851.671212]  ? syscall_exit_to_user_mode+0x2b/0x40
[ 9851.674594]  ? do_syscall_64+0x6b/0x90
[ 9851.677655]  ? syscall_exit_to_user_mode+0x2b/0x40
[ 9851.681051]  ? do_syscall_64+0x6b/0x90
[ 9851.684128]  ? exc_page_fault+0x77/0x170
[ 9851.687256]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 9851.690759] RIP: 0033:0x7f57c563c377

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 10, 2024
There existed a race condition between when a Direct I/O write could
complete and if a sync operation was issued. This was due to the fact
that a Direct I/O would sleep waiting on previous TXG's to sync out
their dirty records assosciated with a dbuf if there was an ARC buffer
associated with the dbuf. This was necessay to safely destroy the ARC
buffer in case previous dirty records dr_data as pointed at that the
db_buf. The main issue with this approach is a Direct I/o write holds
the rangelock across the entire block, so when a sync on that same block
was issued and tried to grab the rangelock as reader, it would be
blocked indefinitely because the Direct I/O that was now sleeping was
holding that same rangelock as writer. This led to a complete deadlock.

This commit fixes this issue and removes the wait in
dmu_write_direct_done().

The way this is now handled is the ARC buffer is destroyed, if there an
associated one with dbuf, before ever issuing the Direct I/O write.
This implemenation heavily borrows from the block cloning
implementation.

A new function dmu_buf_wil_clone_or_dio() is called in both
dmu_write_direct() and dmu_brt_clone() that does the following:
1. Undirties a dirty record for that db if there one currently
   associated with the current TXG.
2. Destroys the ARC buffer if the previous dirty record dr_data does not
   point at the dbufs ARC buffer (db_buf).
3. Sets the dbufs data pointers to NULL.
4. Redirties the dbuf using db_state = DB_NOFILL.

As part of this commit, the dmu_write_direct_done() function was also
cleaned up. Now dmu_sync_done() is called before undirtying the dbuf
dirty record associated with a failed Direct I/O write. This is correct
logic and how it always should have been.

The additional benefits of these modifications is there is no longer a
stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT
together. Also it unifies the block cloning and Direct I/O write path as
they both need to call dbuf_fix_old_data() before destroying the ARC
buffer.

As part of this commit, there is also just general code cleanup. Various
dbuf stats were removed because they are not necesary any longer.
Additionally, useless functions were removed to make the code paths
cleaner for Direct I/O.

Below is the race condtion stack trace that was being consistently
observed in the CI runs for the dio_random test case that prompted
these changes:
trace:
[ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[ 9954.770512]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.773848] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[ 9954.775512] Call Trace:
[ 9954.776406]  __schedule+0x2d1/0x870
[ 9954.777386]  ? free_one_page+0x204/0x530
[ 9954.778466]  schedule+0x55/0xf0
[ 9954.779355]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.780491]  ? finish_wait+0x80/0x80
[ 9954.781450]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[ 9954.782889]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[ 9954.784255]  zio_done+0x373/0x1d50 [zfs]
[ 9954.785410]  zio_execute+0xee/0x210 [zfs]
[ 9954.786588]  taskq_thread+0x205/0x3f0 [spl]
[ 9954.787673]  ? wake_up_q+0x60/0x60
[ 9954.788571]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[ 9954.790079]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 9954.791199]  kthread+0x134/0x150
[ 9954.792082]  ? set_kthread_struct+0x50/0x50
[ 9954.793189]  ret_from_fork+0x35/0x40
[ 9954.794108] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[ 9954.795535]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.798669] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[ 9954.800267] Call Trace:
[ 9954.801096]  __schedule+0x2d1/0x870
[ 9954.801972]  ? __wake_up_common+0x7a/0x190
[ 9954.802963]  schedule+0x55/0xf0
[ 9954.803884]  schedule_timeout+0x19f/0x320
[ 9954.804837]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.805932]  ? taskq_dispatch+0xab/0x280 [spl]
[ 9954.806959]  io_schedule_timeout+0x19/0x40
[ 9954.807989]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.809110]  ? finish_wait+0x80/0x80
[ 9954.810068]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.811103]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.812255]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 9954.813442]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 9954.814648]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[ 9954.816023]  spa_sync+0x362/0x8f0 [zfs]
[ 9954.817110]  txg_sync_thread+0x27a/0x3b0 [zfs]
[ 9954.818267]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 9954.819510]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 9954.820643]  thread_generic_wrapper+0x63/0x90 [spl]
[ 9954.821709]  kthread+0x134/0x150
[ 9954.822590]  ? set_kthread_struct+0x50/0x50
[ 9954.823584]  ret_from_fork+0x35/0x40
[ 9954.824444] INFO: task fio:1055501 blocked for more than 120
seconds.
[ 9954.825781]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.828871] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[ 9954.830463] Call Trace:
[ 9954.831280]  __schedule+0x2d1/0x870
[ 9954.832159]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[ 9954.833396]  schedule+0x55/0xf0
[ 9954.834286]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.835291]  ? finish_wait+0x80/0x80
[ 9954.836235]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 9954.837543]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 9954.838838]  zfs_get_data+0x566/0x810 [zfs]
[ 9954.840034]  zil_lwb_commit+0x194/0x3f0 [zfs]
[ 9954.841154]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[ 9954.842367]  ? __list_add+0x12/0x30 [zfs]
[ 9954.843496]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.844665]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[ 9954.845852]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[ 9954.847203]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 9954.848380]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.849550]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.850640]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.851729]  do_fsync+0x38/0x70
[ 9954.852585]  __x64_sys_fsync+0x10/0x20
[ 9954.853486]  do_syscall_64+0x5b/0x1b0
[ 9954.854416]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.855466] RIP: 0033:0x7eff236bb057
[ 9954.856388] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[ 9954.866149] INFO: task fio:1055502 blocked for more than 120
seconds.
[ 9954.867490]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.870571] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[ 9954.872162] Call Trace:
[ 9954.872947]  __schedule+0x2d1/0x870
[ 9954.873844]  schedule+0x55/0xf0
[ 9954.874716]  schedule_timeout+0x19f/0x320
[ 9954.875645]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.876722]  io_schedule_timeout+0x19/0x40
[ 9954.877677]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.878822]  ? finish_wait+0x80/0x80
[ 9954.879694]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.880763]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.881865]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 9954.883074]  dmu_write_uio_direct+0x79/0x100 [zfs]
[ 9954.884285]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[ 9954.885507]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 9954.886687]  zfs_write+0x581/0xe20 [zfs]
[ 9954.887822]  ? iov_iter_get_pages+0xe9/0x390
[ 9954.888862]  ? trylock_page+0xd/0x20 [zfs]
[ 9954.890005]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.891217]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 9954.892391]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[ 9954.893663]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.894764]  zpl_iter_write+0xd5/0x110 [zfs]
[ 9954.895911]  new_sync_write+0x112/0x160
[ 9954.896881]  vfs_write+0xa5/0x1b0
[ 9954.897701]  ksys_write+0x4f/0xb0
[ 9954.898569]  do_syscall_64+0x5b/0x1b0
[ 9954.899417]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.900515] RIP: 0033:0x7eff236baa47
[ 9954.901363] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8
[ 9954.911129] INFO: task fio:1055504 blocked for more than 120
seconds.
[ 9954.912381]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.915434] task:fio             state:D stack:0
pid:1055504
ppid:1055493 flags:0x00000080
[ 9954.917082] Call Trace:
[ 9954.917773]  __schedule+0x2d1/0x870
[ 9954.918648]  ? zilog_dirty+0x4f/0xc0 [zfs]
[ 9954.919831]  schedule+0x55/0xf0
[ 9954.920717]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.921704]  ? finish_wait+0x80/0x80
[ 9954.922639]  zfs_rangelock_enter_writer+0x46/0x1c0 [zfs]
[ 9954.923940]  zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs]
[ 9954.925306]  zfs_write+0x703/0xe20 [zfs]
[ 9954.926406]  zpl_iter_write_buffered+0xb2/0x120 [zfs]
[ 9954.927687]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.928821]  zpl_iter_write+0xbe/0x110 [zfs]
[ 9954.930028]  new_sync_write+0x112/0x160
[ 9954.930913]  vfs_write+0xa5/0x1b0
[ 9954.931758]  ksys_write+0x4f/0xb0
[ 9954.932666]  do_syscall_64+0x5b/0x1b0
[ 9954.933544]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.934689] RIP: 0033:0x7fcaee8f0a47
[ 9954.935551] Code: Unable to access opcode bytes at RIP
0x7fcaee8f0a1d.
[ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007fcaee8f0a47
[ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI:
0000000000000006
[ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09:
0000000000000000
[ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000001d000
[ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15:
0000557a2006bae8
[ 9954.945525] INFO: task fio:1055505 blocked for more than 120
seconds.
[ 9954.946819]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.949959] task:fio             state:D stack:0
pid:1055505
ppid:1055493 flags:0x00004080
[ 9954.951653] Call Trace:
[ 9954.952417]  __schedule+0x2d1/0x870
[ 9954.953393]  ? finish_wait+0x3e/0x80
[ 9954.954315]  schedule+0x55/0xf0
[ 9954.955212]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.956211]  ? finish_wait+0x80/0x80
[ 9954.957159]  zil_commit_waiter+0xfa/0x3b0 [zfs]
[ 9954.958343]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.959524]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.960626]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.961763]  do_fsync+0x38/0x70
[ 9954.962638]  __x64_sys_fsync+0x10/0x20
[ 9954.963520]  do_syscall_64+0x5b/0x1b0
[ 9954.964470]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.965567] RIP: 0033:0x7fcaee8f1057
[ 9954.966490] Code: Unable to access opcode bytes at RIP
0x7fcaee8f102d.
[ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007fcaee8f1057
[ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI:
0000000000000005
[ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09:
0000000000000000
[ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15:
0000557a2006bae8
[10077.648150] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[10077.649541]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.652782] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[10077.654420] Call Trace:
[10077.655267]  __schedule+0x2d1/0x870
[10077.656179]  ? free_one_page+0x204/0x530
[10077.657192]  schedule+0x55/0xf0
[10077.658004]  cv_wait_common+0x16d/0x280 [spl]
[10077.659018]  ? finish_wait+0x80/0x80
[10077.660013]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[10077.661396]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[10077.662617]  zio_done+0x373/0x1d50 [zfs]
[10077.663783]  zio_execute+0xee/0x210 [zfs]
[10077.664921]  taskq_thread+0x205/0x3f0 [spl]
[10077.665982]  ? wake_up_q+0x60/0x60
[10077.666842]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[10077.668295]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[10077.669360]  kthread+0x134/0x150
[10077.670191]  ? set_kthread_struct+0x50/0x50
[10077.671209]  ret_from_fork+0x35/0x40
[10077.672076] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[10077.673467]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.676612] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[10077.678288] Call Trace:
[10077.679024]  __schedule+0x2d1/0x870
[10077.679948]  ? __wake_up_common+0x7a/0x190
[10077.681042]  schedule+0x55/0xf0
[10077.681899]  schedule_timeout+0x19f/0x320
[10077.682951]  ? __next_timer_interrupt+0xf0/0xf0
[10077.684005]  ? taskq_dispatch+0xab/0x280 [spl]
[10077.685085]  io_schedule_timeout+0x19/0x40
[10077.686080]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.687227]  ? finish_wait+0x80/0x80
[10077.688123]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.689206]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.690300]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[10077.691435]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[10077.692636]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[10077.693997]  spa_sync+0x362/0x8f0 [zfs]
[10077.695112]  txg_sync_thread+0x27a/0x3b0 [zfs]
[10077.696239]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[10077.697512]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[10077.698639]  thread_generic_wrapper+0x63/0x90 [spl]
[10077.699687]  kthread+0x134/0x150
[10077.700567]  ? set_kthread_struct+0x50/0x50
[10077.701502]  ret_from_fork+0x35/0x40
[10077.702430] INFO: task fio:1055501 blocked for more than 120
seconds.
[10077.703697]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.706780] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[10077.708479] Call Trace:
[10077.709231]  __schedule+0x2d1/0x870
[10077.710190]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[10077.711368]  schedule+0x55/0xf0
[10077.712286]  cv_wait_common+0x16d/0x280 [spl]
[10077.713316]  ? finish_wait+0x80/0x80
[10077.714262]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[10077.715566]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[10077.716878]  zfs_get_data+0x566/0x810 [zfs]
[10077.718032]  zil_lwb_commit+0x194/0x3f0 [zfs]
[10077.719234]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[10077.720413]  ? __list_add+0x12/0x30 [zfs]
[10077.721525]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.722708]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[10077.723931]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[10077.725273]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[10077.726438]  zil_commit_impl+0x6d/0xd0 [zfs]
[10077.727586]  zfs_fsync+0x66/0x90 [zfs]
[10077.728675]  zpl_fsync+0xe5/0x140 [zfs]
[10077.729755]  do_fsync+0x38/0x70
[10077.730607]  __x64_sys_fsync+0x10/0x20
[10077.731482]  do_syscall_64+0x5b/0x1b0
[10077.732415]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.733487] RIP: 0033:0x7eff236bb057
[10077.734399] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[10077.744168] INFO: task fio:1055502 blocked for more than 120
seconds.
[10077.745505]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.748642] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[10077.750233] Call Trace:
[10077.751011]  __schedule+0x2d1/0x870
[10077.751915]  schedule+0x55/0xf0
[10077.752811]  schedule_timeout+0x19f/0x320
[10077.753762]  ? __next_timer_interrupt+0xf0/0xf0
[10077.754824]  io_schedule_timeout+0x19/0x40
[10077.755782]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.756922]  ? finish_wait+0x80/0x80
[10077.757788]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.758845]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.759941]  dmu_write_abd+0x174/0x1c0 [zfs]
[10077.761144]  dmu_write_uio_direct+0x79/0x100 [zfs]
[10077.762327]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[10077.763523]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[10077.764749]  zfs_write+0x581/0xe20 [zfs]
[10077.765825]  ? iov_iter_get_pages+0xe9/0x390
[10077.766842]  ? trylock_page+0xd/0x20 [zfs]
[10077.767956]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.769189]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[10077.770343]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[10077.771570]  ? rrw_exit+0xc6/0x200 [zfs]
[10077.772674]  zpl_iter_write+0xd5/0x110 [zfs]
[10077.773834]  new_sync_write+0x112/0x160
[10077.774805]  vfs_write+0xa5/0x1b0
[10077.775634]  ksys_write+0x4f/0xb0
[10077.776526]  do_syscall_64+0x5b/0x1b0
[10077.777386]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.778488] RIP: 0033:0x7eff236baa47
[10077.779339] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 12, 2024
Varada (varada.kari@gmail.com) pointed out an issue with O_DIRECT reads
with the following test case:

dd if=/dev/urandom of=/local_zpool/file2 bs=512 count=79
truncate -s 40382 /local_zpool/file2
zpool export local_zpool
zpool import -d ~/ local_zpool
dd if=/local_zpool/file2 of=/dev/null bs=1M iflag=direct

That led to following panic happening:

[  307.769267] VERIFY(IS_P2ALIGNED(size, sizeof (uint32_t))) failed
[  307.782997] PANIC at zfs_fletcher.c:870:abd_fletcher_4_iter()
[  307.788743] Showing stack for process 9665
[  307.792834] CPU: 47 PID: 9665 Comm: z_rd_int_5 Kdump: loaded Tainted:
P           OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[  307.804298] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[  307.811682] Call Trace:
[  307.814131]  dump_stack+0x41/0x60
[  307.817449]  spl_panic+0xd0/0xe8 [spl]
[  307.821210]  ? irq_work_queue+0x9/0x20
[  307.824961]  ? wake_up_klogd.part.30+0x30/0x40
[  307.829407]  ? vprintk_emit+0x125/0x250
[  307.833246]  ? printk+0x58/0x6f
[  307.836391]  spl_assert.constprop.1+0x16/0x20 [zfs]
[  307.841438]  abd_fletcher_4_iter+0x6c/0x101 [zfs]
[  307.846343]  ? abd_fletcher_4_simd2scalar+0x83/0x83 [zfs]
[  307.851922]  abd_iterate_func+0xb1/0x170 [zfs]
[  307.856533]  abd_fletcher_4_impl+0x3f/0xa0 [zfs]
[  307.861334]  abd_fletcher_4_native+0x52/0x70 [zfs]
[  307.866302]  ? enqueue_entity+0xf1/0x6e0
[  307.870226]  ? select_idle_sibling+0x23/0x700
[  307.874587]  ? enqueue_task_fair+0x94/0x710
[  307.878771]  ? select_task_rq_fair+0x351/0x990
[  307.883208]  zio_checksum_error_impl+0xff/0x5f0 [zfs]
[  307.888435]  ? abd_fletcher_4_impl+0xa0/0xa0 [zfs]
[  307.893401]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[  307.898203]  ? __wake_up_common+0x7a/0x190
[  307.902300]  ? __switch_to_asm+0x41/0x70
[  307.906220]  ? __switch_to_asm+0x35/0x70
[  307.910145]  ? __switch_to_asm+0x41/0x70
[  307.914061]  ? __switch_to_asm+0x35/0x70
[  307.917980]  ? __switch_to_asm+0x41/0x70
[  307.921903]  ? __switch_to_asm+0x35/0x70
[  307.925821]  ? __switch_to_asm+0x35/0x70
[  307.929739]  ? __switch_to_asm+0x41/0x70
[  307.933658]  ? __switch_to_asm+0x35/0x70
[  307.937582]  zio_checksum_error+0x47/0xc0 [zfs]
[  307.942288]  raidz_checksum_verify+0x3a/0x70 [zfs]
[  307.947257]  vdev_raidz_io_done+0x4b/0x160 [zfs]
[  307.952049]  zio_vdev_io_done+0x7f/0x200 [zfs]
[  307.956669]  zio_execute+0xee/0x210 [zfs]
[  307.960855]  taskq_thread+0x203/0x420 [spl]
[  307.965048]  ? wake_up_q+0x70/0x70
[  307.968455]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[  307.974807]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[  307.979260]  kthread+0x10a/0x120
[  307.982485]  ? set_kthread_struct+0x40/0x40
[  307.986670]  ret_from_fork+0x35/0x40

The reason this was occuring was because by doing the zpool export that
meant the initial read of O_DIRECT would be forced to go down to disk.
In this case it was still valid as bs=1M is still page size aligned;
howver, the file length was not. So when issuing the O_DIRECT read even
after calling make_abd_for_dbuf() we had an extra page allocated in the
original ABD along with the linear ABD attached at the end of the gang
abd from make_abd_for_dbuf().

This is an issue as it is our expectations with read that the block
sizes being read are page aligned. When we do our check we are only
checking the request but not the actual size of data we may read such as
the entire file.

In order to remedy this situation, I updated zfs_read() to attempt to
read as much as it can using O_DIRECT based on if the length is
page-sized aligned. Any additional bytes that are requested are then
read into the ARC. This still stays with our semantics that IO requests
must be page sized aligned.

There are a bit of draw backs here if there is only a single block being
read. In this case the block will be read twice. Once using O_DIRECT and
then using buffered to fill in the remaining data for the users request.
However, this should not be a big issue most of the time. In the normal
case a user may ask for a lot of data from a file and only the stray
bytes at the end of the file will have to be read using the ARC.

In order to make sure this case was completely covered, I added a new
ZTS test case dio_unaligned_filesize to test this out. The main thing
with that test case is the first O_DIRECT read will issue out two reads
two being O_DIRECT and the third being buffered for the remaining
requested bytes.

As part of this commit, I also updated stride_dd to take an additional
parameter of -e, which says read the entire input file and ingore the
count (-c) option. We need to use stride_dd for FreeBSD as dd does not
make sure the buffer is page aligned. This udpate to stride_dd allows us
to use it to test out this case in dio_unaligned_filesize for both Linux
and FreeBSD.

While this may not be the most elegant solution, it does stick with the
semanatics and still reads all the data the user requested. I am fine
with revisiting this and maybe we just return a short read?

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 12, 2024
We were using the generic Linux calls to make sure that the page cache
was cleaned out before issuing any Direct I/O reads or writes. However,
this only matters in the event the file region being written/read from
using O_DIRECT was mmap'ed. One of stipulations with O_DIRECT is that is
redirected through the ARC in the event the file range is mmap'ed.
Becaues of this, it did not make sense to try and invalidate the page
cache if we were never intending to have O_DIRECT to work with mmap'ed
regions. Also, calls into the generic Linux calls in writes would often
lead to lockups as the page lock is dropped in zfs_putpage(). See the
stack dump below. In order to just prevent this, we no longer will use
the generic linux direct IO wrappers or try and flush out the page
cache.

Instead if we find the file range has been mmap'ed in since the initial
check in zfs_setup_direct() we will just now directly handle that in
zfs_read() and zfs_write(). In most case zfs_setup_direct() will prevent
O_DIRECT to mmap'ed regions of the file that have been page faulted in,
but if that happen when we are issuing the direct I/O request the the
normal parts of the ZFS paths will be taken to account for this.

It is highly suggested not to mmap a region of file and then write or
read directly to the file. In general, that is kind of an isane thing to
do... However, we try our best to still have consistency with the ARC.

Also, before making this decision I did explore if we could just add a
rangelock in zfs_fillpage(), but we can not do that. The reason is when
the page is in zfs_readpage_common() it has already been locked by the
kernel. So, if we try and grab the rangelock anywhere in that path we
can get stuck if another thread is issuing writes to the file region
that was mmap'ed in. The reason is update_pages() holds the rangelock
and then tries to lock the page. In this case zfs_fillpage() holds the
page lock but is stuck in the rangelock waiting and holding the page
lock. Deadlock is unavoidable in this case.

[260136.244332] INFO: task fio:3791107 blocked for more than 120
seconds.
[260136.250867]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260136.258693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260136.266607] task:fio             state:D stack:    0 pid:3791107
ppid:3790841 flags:0x00004080
[260136.275306] Call Trace:
[260136.277845]  __schedule+0x2d1/0x830
[260136.281432]  schedule+0x35/0xa0
[260136.284665]  io_schedule+0x12/0x40
[260136.288157]  wait_on_page_bit+0x123/0x220
[260136.292258]  ? xas_load+0x8/0x80
[260136.295577]  ? file_fdatawait_range+0x20/0x20
[260136.300024]  filemap_page_mkwrite+0x9b/0xb0
[260136.304295]  do_page_mkwrite+0x53/0x90
[260136.308135]  ? vm_normal_page+0x1a/0xc0
[260136.312062]  do_wp_page+0x298/0x350
[260136.315640]  __handle_mm_fault+0x44f/0x6c0
[260136.319826]  ? __switch_to_asm+0x41/0x70
[260136.323839]  handle_mm_fault+0xc1/0x1e0
[260136.327766]  do_user_addr_fault+0x1b5/0x440
[260136.332038]  do_page_fault+0x37/0x130
[260136.335792]  ? page_fault+0x8/0x30
[260136.339284]  page_fault+0x1e/0x30
[260136.342689] RIP: 0033:0x7f6deee7f1b4
[260136.346361] Code: Unable to access opcode bytes at RIP
0x7f6deee7f18a.
[260136.352977] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202
[260136.358288] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX:
00007f6d83148fe0
[260136.365508] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI:
00007f6d8309bfa0
[260136.372730] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09:
0000000000000000
[260136.379946] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12:
0000000000000001
[260136.387167] R13: 0000556b63614ec0 R14: 0000000000100000 R15:
0000556b63614ee8
[260136.394387] INFO: task fio:3791108 blocked for more than 120
seconds.
[260136.400911]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260136.408739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260136.416651] task:fio             state:D stack:    0 pid:3791108
ppid:3790835 flags:0x00004080
[260136.425343] Call Trace:
[260136.427883]  __schedule+0x2d1/0x830
[260136.431463]  ? cv_wait_common+0x12d/0x240 [spl]
[260136.436091]  schedule+0x35/0xa0
[260136.439321]  io_schedule+0x12/0x40
[260136.442814]  __lock_page+0x12d/0x230
[260136.446483]  ? file_fdatawait_range+0x20/0x20
[260136.450929]  zfs_putpage+0x148/0x590 [zfs]
[260136.455322]  ? rmap_walk_file+0x116/0x290
[260136.459421]  ? __mod_memcg_lruvec_state+0x5d/0x160
[260136.464300]  zpl_putpage+0x67/0xd0 [zfs]
[260136.468495]  write_cache_pages+0x197/0x420
[260136.472679]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260136.477732]  zpl_writepages+0x119/0x130 [zfs]
[260136.482352]  do_writepages+0xc2/0x1c0
[260136.486103]  ? flush_tlb_func_common.constprop.9+0x125/0x220
[260136.491850]  __filemap_fdatawrite_range+0xc7/0x100
[260136.496732]  filemap_write_and_wait_range+0x30/0x80
[260136.501695]  generic_file_direct_write+0x120/0x160
[260136.506575]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260136.510779]  zpl_iter_write+0xdd/0x160 [zfs]
[260136.515323]  new_sync_write+0x112/0x160
[260136.519255]  vfs_write+0xa5/0x1a0
[260136.522662]  ksys_write+0x4f/0xb0
[260136.526067]  do_syscall_64+0x5b/0x1a0
[260136.529818]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260136.534959] RIP: 0033:0x7f9d192c7a17
[260136.538625] Code: Unable to access opcode bytes at RIP
0x7f9d192c79ed.
[260136.545236] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[260136.552889] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9d192c7a17
[260136.560108] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI:
0000000000000005
[260136.567329] RBP: 00007f9caea03000 R08: 0000000000000000 R09:
0000000000000000
[260136.574548] R10: 00005558e8975680 R11: 0000000000000293 R12:
0000000000100000
[260136.581767] R13: 00005558e8985ec0 R14: 0000000000100000 R15:
00005558e8985ee8
[260136.588989] INFO: task fio:3791109 blocked for more than 120
seconds.
[260136.595513]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260136.603337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260136.611250] task:fio             state:D stack:    0 pid:3791109
ppid:3790838 flags:0x00004080
[260136.619943] Call Trace:
[260136.622483]  __schedule+0x2d1/0x830
[260136.626064]  ? zfs_znode_held+0xe6/0x140 [zfs]
[260136.630777]  schedule+0x35/0xa0
[260136.634009]  cv_wait_common+0x153/0x240 [spl]
[260136.638466]  ? finish_wait+0x80/0x80
[260136.642129]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[260136.647712]  zfs_rangelock_enter_impl+0xbf/0x170 [zfs]
[260136.653121]  zfs_get_data+0x113/0x770 [zfs]
[260136.657567]  zil_lwb_commit+0x537/0x780 [zfs]
[260136.662187]  zil_process_commit_list+0x14c/0x460 [zfs]
[260136.667585]  zil_commit_writer+0xeb/0x160 [zfs]
[260136.672376]  zil_commit_impl+0x5d/0xa0 [zfs]
[260136.676910]  zfs_putpage+0x516/0x590 [zfs]
[260136.681279]  zpl_putpage+0x67/0xd0 [zfs]
[260136.685467]  write_cache_pages+0x197/0x420
[260136.689649]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260136.694705]  zpl_writepages+0x119/0x130 [zfs]
[260136.699322]  do_writepages+0xc2/0x1c0
[260136.703076]  __filemap_fdatawrite_range+0xc7/0x100
[260136.707952]  filemap_write_and_wait_range+0x30/0x80
[260136.712920]  zpl_iter_read_direct+0x86/0x1b0 [zfs]
[260136.717972]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260136.722174]  zpl_iter_read+0x90/0xb0 [zfs]
[260136.726536]  new_sync_read+0x10f/0x150
[260136.730376]  vfs_read+0x91/0x140
[260136.733693]  ksys_read+0x4f/0xb0
[260136.737012]  do_syscall_64+0x5b/0x1a0
[260136.740764]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260136.745906] RIP: 0033:0x7f1bd4687ab4
[260136.749574] Code: Unable to access opcode bytes at RIP
0x7f1bd4687a8a.
[260136.756181] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[260136.763834] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f1bd4687ab4
[260136.771056] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI:
0000000000000005
[260136.778274] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09:
0000000000000000
[260136.785494] R10: 000000008fd0ea42 R11: 0000000000000246 R12:
0000000000100000
[260136.792714] R13: 000055ca4b405ec0 R14: 0000000000100000 R15:
000055ca4b405ee8
[260259.123003] INFO: task kworker/u128:0:3589938 blocked for more than
120 seconds.
[260259.130487]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.138313] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.146224] task:kworker/u128:0  state:D stack:    0 pid:3589938
ppid:     2 flags:0x80004080
[260259.154832] Workqueue: writeback wb_workfn (flush-zfs-540)
[260259.160411] Call Trace:
[260259.162950]  __schedule+0x2d1/0x830
[260259.166531]  schedule+0x35/0xa0
[260259.169765]  io_schedule+0x12/0x40
[260259.173257]  __lock_page+0x12d/0x230
[260259.176921]  ? file_fdatawait_range+0x20/0x20
[260259.181368]  write_cache_pages+0x1f2/0x420
[260259.185554]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260259.190633]  zpl_writepages+0x98/0x130 [zfs]
[260259.195183]  do_writepages+0xc2/0x1c0
[260259.198935]  __writeback_single_inode+0x39/0x2f0
[260259.203640]  writeback_sb_inodes+0x1e6/0x450
[260259.208002]  __writeback_inodes_wb+0x5f/0xc0
[260259.212359]  wb_writeback+0x247/0x2e0
[260259.216114]  ? get_nr_inodes+0x35/0x50
[260259.219953]  wb_workfn+0x37c/0x4d0
[260259.223443]  ? __switch_to_asm+0x35/0x70
[260259.227456]  ? __switch_to_asm+0x41/0x70
[260259.231469]  ? __switch_to_asm+0x35/0x70
[260259.235481]  ? __switch_to_asm+0x41/0x70
[260259.239495]  ? __switch_to_asm+0x35/0x70
[260259.243505]  ? __switch_to_asm+0x41/0x70
[260259.247518]  ? __switch_to_asm+0x35/0x70
[260259.251533]  ? __switch_to_asm+0x41/0x70
[260259.255545]  process_one_work+0x1a7/0x360
[260259.259645]  worker_thread+0x30/0x390
[260259.263396]  ? create_worker+0x1a0/0x1a0
[260259.267409]  kthread+0x10a/0x120
[260259.270730]  ? set_kthread_struct+0x40/0x40
[260259.275003]  ret_from_fork+0x35/0x40
[260259.278712] INFO: task fio:3791107 blocked for more than 120
seconds.
[260259.285240]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.293064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.300976] task:fio             state:D stack:    0 pid:3791107
ppid:3790841 flags:0x00004080
[260259.309668] Call Trace:
[260259.312210]  __schedule+0x2d1/0x830
[260259.315787]  schedule+0x35/0xa0
[260259.319020]  io_schedule+0x12/0x40
[260259.322511]  wait_on_page_bit+0x123/0x220
[260259.326611]  ? xas_load+0x8/0x80
[260259.329930]  ? file_fdatawait_range+0x20/0x20
[260259.334376]  filemap_page_mkwrite+0x9b/0xb0
[260259.338650]  do_page_mkwrite+0x53/0x90
[260259.342489]  ? vm_normal_page+0x1a/0xc0
[260259.346415]  do_wp_page+0x298/0x350
[260259.349994]  __handle_mm_fault+0x44f/0x6c0
[260259.354181]  ? __switch_to_asm+0x41/0x70
[260259.358193]  handle_mm_fault+0xc1/0x1e0
[260259.362117]  do_user_addr_fault+0x1b5/0x440
[260259.366391]  do_page_fault+0x37/0x130
[260259.370145]  ? page_fault+0x8/0x30
[260259.373639]  page_fault+0x1e/0x30
[260259.377043] RIP: 0033:0x7f6deee7f1b4
[260259.380714] Code: Unable to access opcode bytes at RIP
0x7f6deee7f18a.
[260259.387323] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202
[260259.392633] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX:
00007f6d83148fe0
[260259.399853] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI:
00007f6d8309bfa0
[260259.407074] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09:
0000000000000000
[260259.414291] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12:
0000000000000001
[260259.421512] R13: 0000556b63614ec0 R14: 0000000000100000 R15:
0000556b63614ee8
[260259.428733] INFO: task fio:3791108 blocked for more than 120
seconds.
[260259.435258]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.443085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.450997] task:fio             state:D stack:    0 pid:3791108
ppid:3790835 flags:0x00004080
[260259.459689] Call Trace:
[260259.462228]  __schedule+0x2d1/0x830
[260259.465808]  ? cv_wait_common+0x12d/0x240 [spl]
[260259.470435]  schedule+0x35/0xa0
[260259.473669]  io_schedule+0x12/0x40
[260259.477161]  __lock_page+0x12d/0x230
[260259.480828]  ? file_fdatawait_range+0x20/0x20
[260259.485274]  zfs_putpage+0x148/0x590 [zfs]
[260259.489640]  ? rmap_walk_file+0x116/0x290
[260259.493742]  ? __mod_memcg_lruvec_state+0x5d/0x160
[260259.498619]  zpl_putpage+0x67/0xd0 [zfs]
[260259.502813]  write_cache_pages+0x197/0x420
[260259.506998]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260259.512054]  zpl_writepages+0x119/0x130 [zfs]
[260259.516672]  do_writepages+0xc2/0x1c0
[260259.520423]  ? flush_tlb_func_common.constprop.9+0x125/0x220
[260259.526170]  __filemap_fdatawrite_range+0xc7/0x100
[260259.531050]  filemap_write_and_wait_range+0x30/0x80
[260259.536016]  generic_file_direct_write+0x120/0x160
[260259.540896]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260259.545099]  zpl_iter_write+0xdd/0x160 [zfs]
[260259.549639]  new_sync_write+0x112/0x160
[260259.553566]  vfs_write+0xa5/0x1a0
[260259.556971]  ksys_write+0x4f/0xb0
[260259.560379]  do_syscall_64+0x5b/0x1a0
[260259.564131]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260259.569269] RIP: 0033:0x7f9d192c7a17
[260259.572935] Code: Unable to access opcode bytes at RIP
0x7f9d192c79ed.
[260259.579549] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[260259.587200] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9d192c7a17
[260259.594419] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI:
0000000000000005
[260259.601639] RBP: 00007f9caea03000 R08: 0000000000000000 R09:
0000000000000000
[260259.608859] R10: 00005558e8975680 R11: 0000000000000293 R12:
0000000000100000
[260259.616078] R13: 00005558e8985ec0 R14: 0000000000100000 R15:
00005558e8985ee8
[260259.623298] INFO: task fio:3791109 blocked for more than 120
seconds.
[260259.629827]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260259.637650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260259.645564] task:fio             state:D stack:    0 pid:3791109
ppid:3790838 flags:0x00004080
[260259.654254] Call Trace:
[260259.656794]  __schedule+0x2d1/0x830
[260259.660373]  ? zfs_znode_held+0xe6/0x140 [zfs]
[260259.665081]  schedule+0x35/0xa0
[260259.668313]  cv_wait_common+0x153/0x240 [spl]
[260259.672768]  ? finish_wait+0x80/0x80
[260259.676441]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[260259.682026]  zfs_rangelock_enter_impl+0xbf/0x170 [zfs]
[260259.687432]  zfs_get_data+0x113/0x770 [zfs]
[260259.691876]  zil_lwb_commit+0x537/0x780 [zfs]
[260259.696497]  zil_process_commit_list+0x14c/0x460 [zfs]
[260259.701895]  zil_commit_writer+0xeb/0x160 [zfs]
[260259.706689]  zil_commit_impl+0x5d/0xa0 [zfs]
[260259.711228]  zfs_putpage+0x516/0x590 [zfs]
[260259.715589]  zpl_putpage+0x67/0xd0 [zfs]
[260259.719775]  write_cache_pages+0x197/0x420
[260259.723959]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260259.729013]  zpl_writepages+0x119/0x130 [zfs]
[260259.733632]  do_writepages+0xc2/0x1c0
[260259.737384]  __filemap_fdatawrite_range+0xc7/0x100
[260259.742264]  filemap_write_and_wait_range+0x30/0x80
[260259.747229]  zpl_iter_read_direct+0x86/0x1b0 [zfs]
[260259.752286]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260259.756487]  zpl_iter_read+0x90/0xb0 [zfs]
[260259.760855]  new_sync_read+0x10f/0x150
[260259.764696]  vfs_read+0x91/0x140
[260259.768013]  ksys_read+0x4f/0xb0
[260259.771332]  do_syscall_64+0x5b/0x1a0
[260259.775087]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260259.780225] RIP: 0033:0x7f1bd4687ab4
[260259.783893] Code: Unable to access opcode bytes at RIP
0x7f1bd4687a8a.
[260259.790503] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[260259.798157] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f1bd4687ab4
[260259.805377] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI:
0000000000000005
[260259.812592] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09:
0000000000000000
[260259.819814] R10: 000000008fd0ea42 R11: 0000000000000246 R12:
0000000000100000
[260259.827032] R13: 000055ca4b405ec0 R14: 0000000000100000 R15:
000055ca4b405ee8
[260382.001731] INFO: task kworker/u128:0:3589938 blocked for more than
120 seconds.
[260382.009227]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260382.017053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260382.024963] task:kworker/u128:0  state:D stack:    0 pid:3589938
ppid:     2 flags:0x80004080
[260382.033568] Workqueue: writeback wb_workfn (flush-zfs-540)
[260382.039141] Call Trace:
[260382.041683]  __schedule+0x2d1/0x830
[260382.045271]  schedule+0x35/0xa0
[260382.048503]  io_schedule+0x12/0x40
[260382.051994]  __lock_page+0x12d/0x230
[260382.055662]  ? file_fdatawait_range+0x20/0x20
[260382.060107]  write_cache_pages+0x1f2/0x420
[260382.064293]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260382.069379]  zpl_writepages+0x98/0x130 [zfs]
[260382.073919]  do_writepages+0xc2/0x1c0
[260382.077672]  __writeback_single_inode+0x39/0x2f0
[260382.082379]  writeback_sb_inodes+0x1e6/0x450
[260382.086738]  __writeback_inodes_wb+0x5f/0xc0
[260382.091097]  wb_writeback+0x247/0x2e0
[260382.094850]  ? get_nr_inodes+0x35/0x50
[260382.098689]  wb_workfn+0x37c/0x4d0
[260382.102181]  ? __switch_to_asm+0x35/0x70
[260382.106194]  ? __switch_to_asm+0x41/0x70
[260382.110207]  ? __switch_to_asm+0x35/0x70
[260382.114221]  ? __switch_to_asm+0x41/0x70
[260382.118231]  ? __switch_to_asm+0x35/0x70
[260382.122244]  ? __switch_to_asm+0x41/0x70
[260382.126256]  ? __switch_to_asm+0x35/0x70
[260382.130273]  ? __switch_to_asm+0x41/0x70
[260382.134284]  process_one_work+0x1a7/0x360
[260382.138384]  worker_thread+0x30/0x390
[260382.142136]  ? create_worker+0x1a0/0x1a0
[260382.146150]  kthread+0x10a/0x120
[260382.149469]  ? set_kthread_struct+0x40/0x40
[260382.153741]  ret_from_fork+0x35/0x40
[260382.157448] INFO: task fio:3791107 blocked for more than 120
seconds.
[260382.163977]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260382.171802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260382.179715] task:fio             state:D stack:    0 pid:3791107
ppid:3790841 flags:0x00004080
[260382.188409] Call Trace:
[260382.190945]  __schedule+0x2d1/0x830
[260382.194527]  schedule+0x35/0xa0
[260382.197757]  io_schedule+0x12/0x40
[260382.201249]  wait_on_page_bit+0x123/0x220
[260382.205350]  ? xas_load+0x8/0x80
[260382.208668]  ? file_fdatawait_range+0x20/0x20
[260382.213114]  filemap_page_mkwrite+0x9b/0xb0
[260382.217386]  do_page_mkwrite+0x53/0x90
[260382.221227]  ? vm_normal_page+0x1a/0xc0
[260382.225152]  do_wp_page+0x298/0x350
[260382.228733]  __handle_mm_fault+0x44f/0x6c0
[260382.232919]  ? __switch_to_asm+0x41/0x70
[260382.236930]  handle_mm_fault+0xc1/0x1e0
[260382.240856]  do_user_addr_fault+0x1b5/0x440
[260382.245132]  do_page_fault+0x37/0x130
[260382.248883]  ? page_fault+0x8/0x30
[260382.252375]  page_fault+0x1e/0x30
[260382.255781] RIP: 0033:0x7f6deee7f1b4
[260382.259451] Code: Unable to access opcode bytes at RIP
0x7f6deee7f18a.
[260382.266059] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202
[260382.271373] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX:
00007f6d83148fe0
[260382.278591] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI:
00007f6d8309bfa0
[260382.285813] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09:
0000000000000000
[260382.293030] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12:
0000000000000001
[260382.300249] R13: 0000556b63614ec0 R14: 0000000000100000 R15:
0000556b63614ee8
[260382.307472] INFO: task fio:3791108 blocked for more than 120
seconds.
[260382.313997]       Tainted: P           OE    --------- -  -
4.18.0-408.el8.x86_64 openzfs#1
[260382.321823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[260382.329734] task:fio             state:D stack:    0 pid:3791108
ppid:3790835 flags:0x00004080
[260382.338427] Call Trace:
[260382.340967]  __schedule+0x2d1/0x830
[260382.344547]  ? cv_wait_common+0x12d/0x240 [spl]
[260382.349173]  schedule+0x35/0xa0
[260382.352406]  io_schedule+0x12/0x40
[260382.355899]  __lock_page+0x12d/0x230
[260382.359563]  ? file_fdatawait_range+0x20/0x20
[260382.364010]  zfs_putpage+0x148/0x590 [zfs]
[260382.368379]  ? rmap_walk_file+0x116/0x290
[260382.372479]  ? __mod_memcg_lruvec_state+0x5d/0x160
[260382.377358]  zpl_putpage+0x67/0xd0 [zfs]
[260382.381552]  write_cache_pages+0x197/0x420
[260382.385739]  ? zpl_readpage_filler+0x10/0x10 [zfs]
[260382.390791]  zpl_writepages+0x119/0x130 [zfs]
[260382.395410]  do_writepages+0xc2/0x1c0
[260382.399161]  ? flush_tlb_func_common.constprop.9+0x125/0x220
[260382.404907]  __filemap_fdatawrite_range+0xc7/0x100
[260382.409790]  filemap_write_and_wait_range+0x30/0x80
[260382.414752]  generic_file_direct_write+0x120/0x160
[260382.419632]  ? rrw_exit+0xb0/0x1c0 [zfs]
[260382.423838]  zpl_iter_write+0xdd/0x160 [zfs]
[260382.428379]  new_sync_write+0x112/0x160
[260382.432304]  vfs_write+0xa5/0x1a0
[260382.435711]  ksys_write+0x4f/0xb0
[260382.439115]  do_syscall_64+0x5b/0x1a0
[260382.442866]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[260382.448007] RIP: 0033:0x7f9d192c7a17
[260382.451675] Code: Unable to access opcode bytes at RIP
0x7f9d192c79ed.
[260382.458286] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[260382.465938] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9d192c7a17
[260382.473158] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI:
0000000000000005
[260382.480379] RBP: 00007f9caea03000 R08: 0000000000000000 R09:
0000000000000000
[260382.487597] R10: 00005558e8975680 R11: 0000000000000293 R12:
0000000000100000
[260382.494814] R13: 00005558e8985ec0 R14: 0000000000100000 R15:
00005558e8985ee8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 12, 2024
In commit ba30ec9 I got a little overzealous in code cleanup. While I
was trying to remove all the stable page code for Linux, I
misinterpreted why Brian Behlendorf originally had the try rangelock,
drop page lock, and aquire range lock in zfs_fillpage(). This is still
necessary even without stable pages. This has to occur to avoid a race
condition between direct IO writes and pages being faulted in for mmap
files. If the rangelock is not held, then a direct IO write can set
db->db_data = NULL either in:
 1. dmu_write_direct() -> dmu_buf_will_not_fill() ->
    dmu_buf_will_fill() -> dbuf_noread() -> dbuf_clear_data()
 2. dmu_write_direct_done()

This can cause the panic then, withtout the rangelock as
dmu_read_imp() can get a NULL pointer for db->db_data when trying to do
the memcpy. So this rangelock must be held in zfs_fillpage() not matter
what.

There are further semantics on when the rangelock should be held in
zfs_fillpage(). It must only be held when doing zfs_getpage() ->
zfs_fillpage(). The reason for this is mappedread() can call
zfs_fillpage() if the page is not uptodate. This can occur becaue
filemap_fault() will first add the pages to the inode's address_space
mapping and then drop the page lock. This leaves open a window were
mappedread() can be called. Since this can occur, mappedread() will hold
both the page lock and the rangelock. This is perfectly valid and
correct. However, it is important in this case to never grab the
rangelock in zfs_fillpage(). If this happens, then a dead lock will
occur.

Finally it is important to note that the rangelock is first attempted to
be grabbed with zfs_rangelock_tryenter(). The reason for this is the
page lock must be dropped in order to grab the rangelock in this case.
Otherwise there is a race between zfs_fillpage() and zfs_write() ->
update_pages(). In update_pages() the rangelock is already held and it
then grabs the page lock. So if the page lock is not dropped before
acquiring the rangelock in zfs_fillpage() there can be a deadlock.

Below is a stack trace showing the NULL pointer dereference that was
occuring with the dio_mmap ZTS test case before this commit.

[ 7737.430658] BUG: unable to handle kernel NULL pointer dereference at
0000000000000000
[ 7737.438486] PGD 0 P4D 0
[ 7737.441024] Oops: 0000 [openzfs#1] SMP NOPTI
[ 7737.444692] CPU: 33 PID: 599346 Comm: fio Kdump: loaded Tainted: P
OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[ 7737.455721] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[ 7737.463106] RIP: 0010:__memcpy+0x12/0x20
[ 7737.467032] Code: ff 0f 31 48 c1 e2 20 48 09 c2 48 31 d3 e9 79 ff ff
ff 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2
07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4
[ 7737.485770] RSP: 0000:ffffc1db829e3b60 EFLAGS: 00010246
[ 7737.490987] RAX: ffff9ef195b6f000 RBX: 0000000000001000 RCX:
0000000000000200
[ 7737.498111] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff9ef195b6f000
[ 7737.505235] RBP: ffff9ef195b70000 R08: ffff9eef1d1d0000 R09:
ffff9eef1d1d0000
[ 7737.512358] R10: ffff9eef27968218 R11: 0000000000000000 R12:
0000000000000000
[ 7737.519481] R13: ffff9ef041adb6d8 R14: 00000000005e1000 R15:
0000000000000001
[ 7737.526607] FS:  00007f77fe2bae80(0000) GS:ffff9f0cae840000(0000)
knlGS:0000000000000000
[ 7737.534683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7737.540423] CR2: 0000000000000000 CR3: 00000003387a6000 CR4:
0000000000350ee0
[ 7737.547553] Call Trace:
[ 7737.550003]  dmu_read_impl+0x11a/0x210 [zfs]
[ 7737.554464]  dmu_read+0x56/0x90 [zfs]
[ 7737.558292]  zfs_fillpage+0x76/0x190 [zfs]
[ 7737.562584]  zfs_getpage+0x4c/0x80 [zfs]
[ 7737.566691]  zpl_readpage_common+0x3b/0x80 [zfs]
[ 7737.571485]  filemap_fault+0x5d6/0xa10
[ 7737.575236]  ? obj_cgroup_charge_pages+0xba/0xd0
[ 7737.579856]  ? xas_load+0x8/0x80
[ 7737.583088]  ? xas_find+0x173/0x1b0
[ 7737.586579]  ? filemap_map_pages+0x84/0x410
[ 7737.590759]  __do_fault+0x38/0xb0
[ 7737.594077]  handle_pte_fault+0x559/0x870
[ 7737.598082]  __handle_mm_fault+0x44f/0x6c0
[ 7737.602181]  handle_mm_fault+0xc1/0x1e0
[ 7737.606019]  do_user_addr_fault+0x1b5/0x440
[ 7737.610207]  do_page_fault+0x37/0x130
[ 7737.613873]  ? page_fault+0x8/0x30
[ 7737.617277]  page_fault+0x1e/0x30
[ 7737.620589] RIP: 0033:0x7f77fbce9140

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 12, 2024
It is important to hold the dbuf mutex (db_mtx) when creating ZIO's in
dmu_read_abd(). The BP that is returned dmu_buf_get_gp_from_dbuf() may
come from a previous direct IO write. In this case, it is attached to a
dirty record in the dbuf. When zio_read() is called, a copy of the BP is
made through io_bp_copy to io_bp in zio_create(). Without holding the
db_mtx though, the dirty record may be freed in dbuf_read_done(). This
can result in garbage beening place BP for the ZIO creatd through
zio_read(). By holding the db_mtx, this race can be avoided. Below is a
stack trace of the issue that was occuring in vdev_mirror_child_select()
without holding the db_mtx and creating the the ZIO.

[29882.427056] VERIFY(zio->io_bp == NULL ||
BP_PHYSICAL_BIRTH(zio->io_bp) == txg) failed
[29882.434884] PANIC at vdev_mirror.c:545:vdev_mirror_child_select()
[29882.440976] Showing stack for process 1865540
[29882.445336] CPU: 57 PID: 1865540 Comm: fio Kdump: loaded Tainted: P
OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[29882.456457] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[29882.463844] Call Trace:
[29882.466296]  dump_stack+0x41/0x60
[29882.469618]  spl_panic+0xd0/0xe8 [spl]
[29882.473376]  ? __dprintf+0x10e/0x180 [zfs]
[29882.477674]  ? kfree+0xd3/0x250
[29882.480819]  ? __dprintf+0x10e/0x180 [zfs]
[29882.485103]  ? vdev_mirror_map_alloc+0x29/0x50 [zfs]
[29882.490250]  ? vdev_lookup_top+0x20/0x90 [zfs]
[29882.494878]  spl_assert+0x17/0x20 [zfs]
[29882.498893]  vdev_mirror_child_select+0x279/0x300 [zfs]
[29882.504289]  vdev_mirror_io_start+0x11f/0x2b0 [zfs]
[29882.509336]  zio_vdev_io_start+0x3ee/0x520 [zfs]
[29882.514137]  zio_nowait+0x105/0x290 [zfs]
[29882.518330]  dmu_read_abd+0x196/0x460 [zfs]
[29882.522691]  dmu_read_uio_direct+0x6d/0xf0 [zfs]
[29882.527472]  dmu_read_uio_dnode+0x12a/0x140 [zfs]
[29882.532345]  dmu_read_uio_dbuf+0x3f/0x60 [zfs]
[29882.536953]  zfs_read+0x238/0x3f0 [zfs]
[29882.540976]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[29882.545952]  ? rrw_exit+0xc6/0x200 [zfs]
[29882.550058]  zpl_iter_read+0x90/0xb0 [zfs]
[29882.554340]  new_sync_read+0x10f/0x150
[29882.558094]  vfs_read+0x91/0x140
[29882.561325]  ksys_read+0x4f/0xb0
[29882.564557]  do_syscall_64+0x5b/0x1a0
[29882.568222]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[29882.573267] RIP: 0033:0x7f7fe0fa6ab4

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 12, 2024
There existed a race condition that was discovered through the
dio_random test. When doing fio with --fsync=32, after 32 writes fsync
is called on the file. When this happens, blocks committed to the ZIL
will be sync'ed out. However, the code for the O_DIRECT write was
updated in 31983d2 to always wait if there was an associated ARC buf
with the dbuf for all previous TXG's to sync out.

There was an oversight with this update. When waiting on previous TXG's
to sync out, the O_DIRECT write is holding the rangelock as a writer the
entire time. This causes an issue with the ZIL is commit writes out
though `zfs_get_data()` because it will try and grab the rangelock as
reader. This will lead to a deadlock.

In order to fix this race condition, I updated the `dmu_buf_impl_t`
struct to contain a uint8_t variable that is used to signal if the dbuf
attached to a O_DIRECT write is the wait hold because of mixed direct
and buffered data. Using this new `db_mixed_io_dio_wait` variable in the
in the `dmu_buf_impl_t` the code in `zfs_get_data()` can tell that
rangelock is already being held across the entire block and there is no
need to grab the rangelock at all. Because the rangelock is being held
as a writer across the entire block already, no modifications can take
place against the block as long as the O_DIRECT write is stalled
waiting in `dmu_buf_direct_mixed_io_wait()`.

Also as part of this update, I realized the `db_state` in
`dmu_buf_direct_mixed_io_wait()` needs to be changed temporarily to
`DB_CACHED`. This is necessary so the logic in `dbuf_read()` is correct
if `dmu_sync_late_arrival()` is called by `dmu_sync()`. It is completely
valid to switch the `db_state` back to `DB_CACHED` is there is still an
associated ARC buf that will not be freed till out O_DIRECT write is
completed which will only happen after if leaves
`dmu_buf_direct_mixed_io_wait()`.

Here is the stack trace of the deadlock that happen with
`dio_random.ksh` before this commit:
[ 5513.663402] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 7496.580415] INFO: task z_wr_int:1098000 blocked for more than 120
seconds.
[ 7496.585709]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.593349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.600839] task:z_wr_int        state:D stack:    0 pid:1098000
ppid:     2 flags:0x80004080
[ 7496.608622] Call Trace:
[ 7496.611770]  __schedule+0x2d1/0x870
[ 7496.615404]  schedule+0x55/0xf0
[ 7496.618866]  cv_wait_common+0x16d/0x280 [spl]
[ 7496.622910]  ? finish_wait+0x80/0x80
[ 7496.626601]  dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs]
[ 7496.631327]  dmu_write_direct_done+0x90/0x3a0 [zfs]
[ 7496.635798]  zio_done+0x373/0x1d40 [zfs]
[ 7496.639795]  zio_execute+0xee/0x210 [zfs]
[ 7496.643840]  taskq_thread+0x203/0x420 [spl]
[ 7496.647836]  ? wake_up_q+0x70/0x70
[ 7496.651411]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[ 7496.656489]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 7496.660604]  kthread+0x134/0x150
[ 7496.664092]  ? set_kthread_struct+0x50/0x50
[ 7496.668080]  ret_from_fork+0x35/0x40
[ 7496.671745] INFO: task txg_sync:1098025 blocked for more than 120
seconds.
[ 7496.676991]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.684666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.692060] task:txg_sync        state:D stack:    0 pid:1098025
ppid:     2 flags:0x80004080
[ 7496.699888] Call Trace:
[ 7496.703012]  __schedule+0x2d1/0x870
[ 7496.706658]  schedule+0x55/0xf0
[ 7496.710093]  schedule_timeout+0x197/0x300
[ 7496.713982]  ? __next_timer_interrupt+0xf0/0xf0
[ 7496.718135]  io_schedule_timeout+0x19/0x40
[ 7496.722049]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7496.726349]  ? finish_wait+0x80/0x80
[ 7496.730039]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7496.734100]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7496.738082]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 7496.742205]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7496.746534]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 7496.750842]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[ 7496.755742]  spa_sync+0x362/0x8d0 [zfs]
[ 7496.759689]  txg_sync_thread+0x274/0x3b0 [zfs]
[ 7496.763928]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 7496.768439]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 7496.772799]  thread_generic_wrapper+0x63/0x90 [spl]
[ 7496.777097]  kthread+0x134/0x150
[ 7496.780616]  ? set_kthread_struct+0x50/0x50
[ 7496.784549]  ret_from_fork+0x35/0x40
[ 7496.788204] INFO: task fio:1101750 blocked for more than 120 seconds.
[ 7496.895852]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7496.903765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7496.911170] task:fio             state:D stack:    0 pid:1101750
ppid:1101741 flags:0x00004080
[ 7496.919033] Call Trace:
[ 7496.922136]  __schedule+0x2d1/0x870
[ 7496.925769]  schedule+0x55/0xf0
[ 7496.929245]  schedule_timeout+0x197/0x300
[ 7496.933120]  ? __next_timer_interrupt+0xf0/0xf0
[ 7496.937213]  io_schedule_timeout+0x19/0x40
[ 7496.941126]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7496.945444]  ? finish_wait+0x80/0x80
[ 7496.949125]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7496.953191]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7496.957180]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 7496.961319]  dmu_write_uio_direct+0x79/0xf0 [zfs]
[ 7496.965731]  dmu_write_uio_dnode+0xa6/0x2d0 [zfs]
[ 7496.970043]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 7496.974305]  zfs_write+0x55f/0xea0 [zfs]
[ 7496.978325]  ? iov_iter_get_pages+0xe9/0x390
[ 7496.982333]  ? trylock_page+0xd/0x20 [zfs]
[ 7496.986451]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7496.990713]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7496.995031]  zpl_iter_write_direct+0xda/0x170 [zfs]
[ 7496.999489]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.003476]  zpl_iter_write+0xd5/0x110 [zfs]
[ 7497.007610]  new_sync_write+0x112/0x160
[ 7497.011429]  vfs_write+0xa5/0x1b0
[ 7497.014916]  ksys_write+0x4f/0xb0
[ 7497.018443]  do_syscall_64+0x5b/0x1b0
[ 7497.022150]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.026532] RIP: 0033:0x7f8771d72a17
[ 7497.030195] Code: Unable to access opcode bytes at RIP
0x7f8771d729ed.
[ 7497.035263] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 7497.042547] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72a17
[ 7497.047933] RDX: 000000000009b000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7497.053269] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.058660] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000009b000
[ 7497.063960] R13: 000055b390afcac0 R14: 000000000009b000 R15:
000055b390afcae8
[ 7497.069334] INFO: task fio:1101751 blocked for more than 120 seconds.
[ 7497.074308]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.081973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.089371] task:fio             state:D stack:    0 pid:1101751
ppid:1101741 flags:0x00000080
[ 7497.097147] Call Trace:
[ 7497.100263]  __schedule+0x2d1/0x870
[ 7497.103897]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.107878]  schedule+0x55/0xf0
[ 7497.111386]  cv_wait_common+0x16d/0x280 [spl]
[ 7497.115391]  ? finish_wait+0x80/0x80
[ 7497.119028]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7497.123667]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7497.128240]  zfs_read+0xaf/0x3f0 [zfs]
[ 7497.132146]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.136091]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7497.140366]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7497.144679]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[ 7497.149054]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7497.153040]  zpl_iter_read+0x94/0xb0 [zfs]
[ 7497.157103]  new_sync_read+0x10f/0x160
[ 7497.160855]  vfs_read+0x91/0x150
[ 7497.164336]  ksys_read+0x4f/0xb0
[ 7497.168004]  do_syscall_64+0x5b/0x1b0
[ 7497.171706]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.176105] RIP: 0033:0x7f8771d72ab4
[ 7497.179742] Code: Unable to access opcode bytes at RIP
0x7f8771d72a8a.
[ 7497.184807] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 7497.192129] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72ab4
[ 7497.197485] RDX: 0000000000002000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7497.202922] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.208309] R10: 00000001ffffffff R11: 0000000000000246 R12:
0000000000002000
[ 7497.213694] R13: 000055b390afcac0 R14: 0000000000002000 R15:
000055b390afcae8
[ 7497.219063] INFO: task fio:1101755 blocked for more than 120 seconds.
[ 7497.224098]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.231786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.239165] task:fio             state:D stack:    0 pid:1101755
ppid:1101744 flags:0x00000080
[ 7497.246989] Call Trace:
[ 7497.250121]  __schedule+0x2d1/0x870
[ 7497.253779]  schedule+0x55/0xf0
[ 7497.257240]  schedule_preempt_disabled+0xa/0x10
[ 7497.261344]  __mutex_lock.isra.7+0x349/0x420
[ 7497.265326]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7497.269674]  zil_commit_writer+0x89/0x230 [zfs]
[ 7497.273938]  zil_commit_impl+0x5f/0xd0 [zfs]
[ 7497.278101]  zfs_fsync+0x81/0xa0 [zfs]
[ 7497.282002]  zpl_fsync+0xe5/0x140 [zfs]
[ 7497.285985]  do_fsync+0x38/0x70
[ 7497.289458]  __x64_sys_fsync+0x10/0x20
[ 7497.293208]  do_syscall_64+0x5b/0x1b0
[ 7497.296928]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.301260] RIP: 0033:0x7f9559073027
[ 7497.304920] Code: Unable to access opcode bytes at RIP
0x7f9559072ffd.
[ 7497.310015] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX:
000000000000004a
[ 7497.317346] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9559073027
[ 7497.322722] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI:
0000000000000005
[ 7497.328126] RBP: 00007f94fb858000 R08: 0000000000000000 R09:
0000000000000000
[ 7497.333514] R10: 0000000000008000 R11: 0000000000000293 R12:
0000000000000003
[ 7497.338887] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15:
0000563adcbf2ae8
[ 7497.344247] INFO: task fio:1101756 blocked for more than 120 seconds.
[ 7497.349327]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7497.357032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7497.364517] task:fio             state:D stack:    0 pid:1101756
ppid:1101744 flags:0x00004080
[ 7497.372310] Call Trace:
[ 7497.375433]  __schedule+0x2d1/0x870
[ 7497.379004]  schedule+0x55/0xf0
[ 7497.382454]  cv_wait_common+0x16d/0x280 [spl]
[ 7497.386477]  ? finish_wait+0x80/0x80
[ 7497.390137]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7497.394816]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7497.399397]  zfs_get_data+0x1a8/0x7e0 [zfs]
[ 7497.403515]  zil_lwb_commit+0x1a5/0x400 [zfs]
[ 7497.407712]  zil_lwb_write_close+0x408/0x630 [zfs]
[ 7497.412126]  zil_commit_waiter_timeout+0x16d/0x520 [zfs]
[ 7497.416801]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 7497.421139]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 7497.425294]  zfs_fsync+0x81/0xa0 [zfs]
[ 7497.429454]  zpl_fsync+0xe5/0x140 [zfs]
[ 7497.433396]  do_fsync+0x38/0x70
[ 7497.436878]  __x64_sys_fsync+0x10/0x20
[ 7497.440586]  do_syscall_64+0x5b/0x1b0
[ 7497.444313]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7497.448659] RIP: 0033:0x7f9559073027
[ 7497.452343] Code: Unable to access opcode bytes at RIP
0x7f9559072ffd.
[ 7497.457408] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX:
000000000000004a
[ 7497.464724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f9559073027
[ 7497.470106] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI:
0000000000000005
[ 7497.475477] RBP: 00007f94fb89ca18 R08: 0000000000000000 R09:
0000000000000000
[ 7497.480806] R10: 00000000000b4cc0 R11: 0000000000000293 R12:
0000000000000003
[ 7497.486158] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15:
0000563adcbf2ae8
[ 7619.459402] INFO: task z_wr_int:1098000 blocked for more than 120
seconds.
[ 7619.464605]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.472233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.479659] task:z_wr_int        state:D stack:    0 pid:1098000
ppid:     2 flags:0x80004080
[ 7619.487518] Call Trace:
[ 7619.490650]  __schedule+0x2d1/0x870
[ 7619.494246]  schedule+0x55/0xf0
[ 7619.497719]  cv_wait_common+0x16d/0x280 [spl]
[ 7619.501749]  ? finish_wait+0x80/0x80
[ 7619.505411]  dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs]
[ 7619.510143]  dmu_write_direct_done+0x90/0x3a0 [zfs]
[ 7619.514603]  zio_done+0x373/0x1d40 [zfs]
[ 7619.518594]  zio_execute+0xee/0x210 [zfs]
[ 7619.522619]  taskq_thread+0x203/0x420 [spl]
[ 7619.526567]  ? wake_up_q+0x70/0x70
[ 7619.530208]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[ 7619.535302]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 7619.539385]  kthread+0x134/0x150
[ 7619.542873]  ? set_kthread_struct+0x50/0x50
[ 7619.546810]  ret_from_fork+0x35/0x40
[ 7619.550477] INFO: task txg_sync:1098025 blocked for more than 120
seconds.
[ 7619.555715]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.563415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.570851] task:txg_sync        state:D stack:    0 pid:1098025
ppid:     2 flags:0x80004080
[ 7619.578606] Call Trace:
[ 7619.581742]  __schedule+0x2d1/0x870
[ 7619.585396]  schedule+0x55/0xf0
[ 7619.589006]  schedule_timeout+0x197/0x300
[ 7619.592916]  ? __next_timer_interrupt+0xf0/0xf0
[ 7619.597027]  io_schedule_timeout+0x19/0x40
[ 7619.600947]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7619.709878]  ? finish_wait+0x80/0x80
[ 7619.713565]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7619.717596]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7619.721567]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 7619.725657]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7619.730050]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 7619.734415]  spa_sync_iterate_to_convergence+0xcf/0x310 [zfs]
[ 7619.739268]  spa_sync+0x362/0x8d0 [zfs]
[ 7619.743270]  txg_sync_thread+0x274/0x3b0 [zfs]
[ 7619.747494]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 7619.751939]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 7619.756279]  thread_generic_wrapper+0x63/0x90 [spl]
[ 7619.760569]  kthread+0x134/0x150
[ 7619.764050]  ? set_kthread_struct+0x50/0x50
[ 7619.767978]  ret_from_fork+0x35/0x40
[ 7619.771639] INFO: task fio:1101750 blocked for more than 120 seconds.
[ 7619.776678]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.784324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.791914] task:fio             state:D stack:    0 pid:1101750
ppid:1101741 flags:0x00004080
[ 7619.799712] Call Trace:
[ 7619.802816]  __schedule+0x2d1/0x870
[ 7619.806427]  schedule+0x55/0xf0
[ 7619.809867]  schedule_timeout+0x197/0x300
[ 7619.813760]  ? __next_timer_interrupt+0xf0/0xf0
[ 7619.817848]  io_schedule_timeout+0x19/0x40
[ 7619.821766]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 7619.826097]  ? finish_wait+0x80/0x80
[ 7619.829780]  __cv_timedwait_io+0x15/0x20 [spl]
[ 7619.833857]  zio_wait+0x1a2/0x4d0 [zfs]
[ 7619.837838]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 7619.842015]  dmu_write_uio_direct+0x79/0xf0 [zfs]
[ 7619.846388]  dmu_write_uio_dnode+0xa6/0x2d0 [zfs]
[ 7619.850760]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 7619.855011]  zfs_write+0x55f/0xea0 [zfs]
[ 7619.859008]  ? iov_iter_get_pages+0xe9/0x390
[ 7619.863036]  ? trylock_page+0xd/0x20 [zfs]
[ 7619.867084]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7619.871366]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7619.875715]  zpl_iter_write_direct+0xda/0x170 [zfs]
[ 7619.880164]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7619.884174]  zpl_iter_write+0xd5/0x110 [zfs]
[ 7619.888492]  new_sync_write+0x112/0x160
[ 7619.892285]  vfs_write+0xa5/0x1b0
[ 7619.895829]  ksys_write+0x4f/0xb0
[ 7619.899384]  do_syscall_64+0x5b/0x1b0
[ 7619.903071]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7619.907394] RIP: 0033:0x7f8771d72a17
[ 7619.911026] Code: Unable to access opcode bytes at RIP
0x7f8771d729ed.
[ 7619.916073] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 7619.923363] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72a17
[ 7619.928675] RDX: 000000000009b000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7619.934019] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7619.939354] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000009b000
[ 7619.944775] R13: 000055b390afcac0 R14: 000000000009b000 R15:
000055b390afcae8
[ 7619.950175] INFO: task fio:1101751 blocked for more than 120 seconds.
[ 7619.955232]       Tainted: P           OE    --------- -  -
4.18.0-477.15.1.el8_8.x86_64 openzfs#1
[ 7619.962889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 7619.970301] task:fio             state:D stack:    0 pid:1101751
ppid:1101741 flags:0x00000080
[ 7619.978139] Call Trace:
[ 7619.981278]  __schedule+0x2d1/0x870
[ 7619.984872]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7619.989260]  schedule+0x55/0xf0
[ 7619.992725]  cv_wait_common+0x16d/0x280 [spl]
[ 7619.996754]  ? finish_wait+0x80/0x80
[ 7620.000414]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 7620.005050]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 7620.009617]  zfs_read+0xaf/0x3f0 [zfs]
[ 7620.013503]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7620.017489]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 7620.021774]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 7620.026091]  zpl_iter_read_direct+0xe0/0x180 [zfs]
[ 7620.030508]  ? rrw_exit+0xc6/0x200 [zfs]
[ 7620.034497]  zpl_iter_read+0x94/0xb0 [zfs]
[ 7620.038579]  new_sync_read+0x10f/0x160
[ 7620.042325]  vfs_read+0x91/0x150
[ 7620.045809]  ksys_read+0x4f/0xb0
[ 7620.049273]  do_syscall_64+0x5b/0x1b0
[ 7620.052965]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 7620.057354] RIP: 0033:0x7f8771d72ab4
[ 7620.060988] Code: Unable to access opcode bytes at RIP
0x7f8771d72a8a.
[ 7620.066041] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 7620.073256] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007f8771d72ab4
[ 7620.078553] RDX: 0000000000002000 RSI: 00007f8713454000 RDI:
0000000000000005
[ 7620.083878] RBP: 00007f8713454000 R08: 0000000000000000 R09:
0000000000000000
[ 7620.089353] R10: 00000001ffffffff R11: 0000000000000246 R12:
0000000000002000
[ 7620.094697] R13: 000055b390afcac0 R14: 0000000000002000 R15:
000055b390afcae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 12, 2024
995734e added a test for block cloning with mmap files. As a result I
began hitting a panic in that test in dbuf_unoverride(). The was if the
dirty record was from a cloned block, then the dr_data must be set to
NULL. This ASSERT was added in 86e115e. The point of that commit was to
make sure that if a cloned block is read before it is synced out, then
the associated ARC buffer is set in the dirty record.

This became an issue with the O_DIRECT code, because dr_data was set to
the ARC buf in dbuf_set_data() after the read. This is the incorrect
logic though for the cloned block. In order to fix this issue, I refined
how to determine if the dirty record is in fact from a O_DIRECT write by
make sure that dr_brtwrite is false. I created the function
dbuf_dirty_is_direct_write() to perform the proper check.

As part of this, I also cleaned up other code that did the exact same
check for an O_DIRECT write to make sure the proper check is taking
place everywhere.

The trace of the ASSERT that was being tripped before this change is
below:
[3649972.811039] VERIFY0P(dr->dt.dl.dr_data) failed (NULL ==
ffff8d58e8183c80)
[3649972.817999] PANIC at dbuf.c:1999:dbuf_unoverride()
[3649972.822968] Showing stack for process 2365657
[3649972.827502] CPU: 0 PID: 2365657 Comm: clone_mmap_writ Kdump: loaded
Tainted: P           OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[3649972.839749] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS
R21 10/08/2020
[3649972.847315] Call Trace:
[3649972.849935]  dump_stack+0x41/0x60
[3649972.853428]  spl_panic+0xd0/0xe8 [spl]
[3649972.857370]  ? cityhash4+0x75/0x90 [zfs]
[3649972.861649]  ? _cond_resched+0x15/0x30
[3649972.865577]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[3649972.870548]  ? __kmalloc_node+0x10d/0x300
[3649972.874735]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[3649972.879702]  ? __list_add+0x12/0x30 [zfs]
[3649972.884061]  dbuf_unoverride+0x1c1/0x1d0 [zfs]
[3649972.888856]  dbuf_redirty+0x3b/0xd0 [zfs]
[3649972.893204]  dbuf_dirty+0xeb1/0x1330 [zfs]
[3649972.897643]  ? _cond_resched+0x15/0x30
[3649972.901569]  ? mutex_lock+0xe/0x30
[3649972.905148]  ? dbuf_noread+0x117/0x240 [zfs]
[3649972.909760]  dmu_write_uio_dnode+0x1d2/0x320 [zfs]
[3649972.914900]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[3649972.919777]  zfs_write+0x57d/0xe00 [zfs]
[3649972.924076]  ? alloc_set_pte+0xb8/0x3e0
[3649972.928088]  zpl_iter_write_buffered+0xb2/0x120 [zfs]
[3649972.933507]  ? rrw_exit+0xc6/0x200 [zfs]
[3649972.937796]  zpl_iter_write+0xba/0x110 [zfs]
[3649972.942433]  new_sync_write+0x112/0x160
[3649972.946445]  vfs_write+0xa5/0x1a0
[3649972.949935]  ksys_pwrite64+0x61/0xa0
[3649972.953681]  do_syscall_64+0x5b/0x1a0
[3649972.957519]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[3649972.962745] RIP: 0033:0x7f610616f01b

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 12, 2024
Originally I was checking dr->dr_dbuf->db_level == 0 in
dbuf_dirty_is_direct_write(). Howver, this can lead to a NULL ponter
dereference if the dr_dbuf is no longer set.

I updated dbuf_dirty_is_direct_write() to now also take a dmu_buf_impl_t
to check if db->db_level == 0. This failure was caught on the Fedora 37
CI running in test enospc_rm. Below is the stack trace.

[ 9851.511608] BUG: kernel NULL pointer dereference, address:
0000000000000068
[ 9851.515922] #PF: supervisor read access in kernel mode
[ 9851.519462] #PF: error_code(0x0000) - not-present page
[ 9851.522992] PGD 0 P4D 0
[ 9851.525684] Oops: 0000 [openzfs#1] PREEMPT SMP PTI
[ 9851.528878] CPU: 0 PID: 1272993 Comm: fio Tainted: P           OE
6.5.12-100.fc37.x86_64 openzfs#1
[ 9851.535266] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[ 9851.539226] RIP: 0010:dbuf_dirty_is_direct_write+0xb/0x40 [zfs]
[ 9851.543379] Code: 10 74 02 31 c0 5b c3 cc cc cc cc 0f 1f 40 00 90 90
90 90 90 90 90 90 90 90 90 90 90 90 90 90 31 c0 48 85 ff 74 31 48 8b 57
20 <80> 7a 68 00 75 27 8b 87 64 01 00 00 85 c0 75 1b 83 bf 58 01 00 00
[ 9851.554719] RSP: 0018:ffff9b5b8305f8e8 EFLAGS: 00010286
[ 9851.558276] RAX: 0000000000000000 RBX: ffff9b5b8569b0b8 RCX:
0000000000000000
[ 9851.562481] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff8f2e97de9e00
[ 9851.566672] RBP: 0000000000020000 R08: 0000000000000000 R09:
ffff8f2f70e94000
[ 9851.570851] R10: 0000000000000001 R11: 0000000000000110 R12:
ffff8f2f774ae4c0
[ 9851.575032] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
[ 9851.579209] FS:  00007f57c5542240(0000) GS:ffff8f2faa800000(0000)
knlGS:0000000000000000
[ 9851.585357] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9851.589064] CR2: 0000000000000068 CR3: 00000001f9a38001 CR4:
00000000007706f0
[ 9851.593256] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 9851.597440] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 9851.601618] PKRU: 55555554
[ 9851.604341] Call Trace:
[ 9851.606981]  <TASK>
[ 9851.609515]  ? __die+0x23/0x70
[ 9851.612388]  ? page_fault_oops+0x171/0x4e0
[ 9851.615571]  ? exc_page_fault+0x77/0x170
[ 9851.618704]  ? asm_exc_page_fault+0x26/0x30
[ 9851.621900]  ? dbuf_dirty_is_direct_write+0xb/0x40 [zfs]
[ 9851.625828]  zfs_get_data+0x407/0x820 [zfs]
[ 9851.629400]  zil_lwb_commit+0x18d/0x3f0 [zfs]
[ 9851.633026]  zil_lwb_write_issue+0x92/0xbb0 [zfs]
[ 9851.636758]  zil_commit_waiter_timeout+0x1f3/0x580 [zfs]
[ 9851.640696]  zil_commit_waiter+0x1ff/0x3a0 [zfs]
[ 9851.644402]  zil_commit_impl+0x71/0xd0 [zfs]
[ 9851.647998]  zfs_write+0xb51/0xdc0 [zfs]
[ 9851.651467]  zpl_iter_write_buffered+0xc9/0x140 [zfs]
[ 9851.655337]  zpl_iter_write+0xc0/0x110 [zfs]
[ 9851.658920]  vfs_write+0x23e/0x420
[ 9851.661871]  __x64_sys_pwrite64+0x98/0xd0
[ 9851.665013]  do_syscall_64+0x5f/0x90
[ 9851.668027]  ? ksys_fadvise64_64+0x57/0xa0
[ 9851.671212]  ? syscall_exit_to_user_mode+0x2b/0x40
[ 9851.674594]  ? do_syscall_64+0x6b/0x90
[ 9851.677655]  ? syscall_exit_to_user_mode+0x2b/0x40
[ 9851.681051]  ? do_syscall_64+0x6b/0x90
[ 9851.684128]  ? exc_page_fault+0x77/0x170
[ 9851.687256]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 9851.690759] RIP: 0033:0x7f57c563c377

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson added a commit to bwatkinson/zfs that referenced this issue Sep 12, 2024
There existed a race condition between when a Direct I/O write could
complete and if a sync operation was issued. This was due to the fact
that a Direct I/O would sleep waiting on previous TXG's to sync out
their dirty records assosciated with a dbuf if there was an ARC buffer
associated with the dbuf. This was necessay to safely destroy the ARC
buffer in case previous dirty records dr_data as pointed at that the
db_buf. The main issue with this approach is a Direct I/o write holds
the rangelock across the entire block, so when a sync on that same block
was issued and tried to grab the rangelock as reader, it would be
blocked indefinitely because the Direct I/O that was now sleeping was
holding that same rangelock as writer. This led to a complete deadlock.

This commit fixes this issue and removes the wait in
dmu_write_direct_done().

The way this is now handled is the ARC buffer is destroyed, if there an
associated one with dbuf, before ever issuing the Direct I/O write.
This implemenation heavily borrows from the block cloning
implementation.

A new function dmu_buf_wil_clone_or_dio() is called in both
dmu_write_direct() and dmu_brt_clone() that does the following:
1. Undirties a dirty record for that db if there one currently
   associated with the current TXG.
2. Destroys the ARC buffer if the previous dirty record dr_data does not
   point at the dbufs ARC buffer (db_buf).
3. Sets the dbufs data pointers to NULL.
4. Redirties the dbuf using db_state = DB_NOFILL.

As part of this commit, the dmu_write_direct_done() function was also
cleaned up. Now dmu_sync_done() is called before undirtying the dbuf
dirty record associated with a failed Direct I/O write. This is correct
logic and how it always should have been.

The additional benefits of these modifications is there is no longer a
stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT
together. Also it unifies the block cloning and Direct I/O write path as
they both need to call dbuf_fix_old_data() before destroying the ARC
buffer.

As part of this commit, there is also just general code cleanup. Various
dbuf stats were removed because they are not necesary any longer.
Additionally, useless functions were removed to make the code paths
cleaner for Direct I/O.

Below is the race condtion stack trace that was being consistently
observed in the CI runs for the dio_random test case that prompted
these changes:
trace:
[ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[ 9954.770512]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.773848] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[ 9954.775512] Call Trace:
[ 9954.776406]  __schedule+0x2d1/0x870
[ 9954.777386]  ? free_one_page+0x204/0x530
[ 9954.778466]  schedule+0x55/0xf0
[ 9954.779355]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.780491]  ? finish_wait+0x80/0x80
[ 9954.781450]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[ 9954.782889]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[ 9954.784255]  zio_done+0x373/0x1d50 [zfs]
[ 9954.785410]  zio_execute+0xee/0x210 [zfs]
[ 9954.786588]  taskq_thread+0x205/0x3f0 [spl]
[ 9954.787673]  ? wake_up_q+0x60/0x60
[ 9954.788571]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[ 9954.790079]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 9954.791199]  kthread+0x134/0x150
[ 9954.792082]  ? set_kthread_struct+0x50/0x50
[ 9954.793189]  ret_from_fork+0x35/0x40
[ 9954.794108] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[ 9954.795535]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.798669] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[ 9954.800267] Call Trace:
[ 9954.801096]  __schedule+0x2d1/0x870
[ 9954.801972]  ? __wake_up_common+0x7a/0x190
[ 9954.802963]  schedule+0x55/0xf0
[ 9954.803884]  schedule_timeout+0x19f/0x320
[ 9954.804837]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.805932]  ? taskq_dispatch+0xab/0x280 [spl]
[ 9954.806959]  io_schedule_timeout+0x19/0x40
[ 9954.807989]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.809110]  ? finish_wait+0x80/0x80
[ 9954.810068]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.811103]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.812255]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 9954.813442]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 9954.814648]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[ 9954.816023]  spa_sync+0x362/0x8f0 [zfs]
[ 9954.817110]  txg_sync_thread+0x27a/0x3b0 [zfs]
[ 9954.818267]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 9954.819510]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 9954.820643]  thread_generic_wrapper+0x63/0x90 [spl]
[ 9954.821709]  kthread+0x134/0x150
[ 9954.822590]  ? set_kthread_struct+0x50/0x50
[ 9954.823584]  ret_from_fork+0x35/0x40
[ 9954.824444] INFO: task fio:1055501 blocked for more than 120
seconds.
[ 9954.825781]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.828871] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[ 9954.830463] Call Trace:
[ 9954.831280]  __schedule+0x2d1/0x870
[ 9954.832159]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[ 9954.833396]  schedule+0x55/0xf0
[ 9954.834286]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.835291]  ? finish_wait+0x80/0x80
[ 9954.836235]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 9954.837543]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 9954.838838]  zfs_get_data+0x566/0x810 [zfs]
[ 9954.840034]  zil_lwb_commit+0x194/0x3f0 [zfs]
[ 9954.841154]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[ 9954.842367]  ? __list_add+0x12/0x30 [zfs]
[ 9954.843496]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.844665]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[ 9954.845852]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[ 9954.847203]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 9954.848380]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.849550]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.850640]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.851729]  do_fsync+0x38/0x70
[ 9954.852585]  __x64_sys_fsync+0x10/0x20
[ 9954.853486]  do_syscall_64+0x5b/0x1b0
[ 9954.854416]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.855466] RIP: 0033:0x7eff236bb057
[ 9954.856388] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[ 9954.866149] INFO: task fio:1055502 blocked for more than 120
seconds.
[ 9954.867490]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.870571] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[ 9954.872162] Call Trace:
[ 9954.872947]  __schedule+0x2d1/0x870
[ 9954.873844]  schedule+0x55/0xf0
[ 9954.874716]  schedule_timeout+0x19f/0x320
[ 9954.875645]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.876722]  io_schedule_timeout+0x19/0x40
[ 9954.877677]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.878822]  ? finish_wait+0x80/0x80
[ 9954.879694]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.880763]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.881865]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 9954.883074]  dmu_write_uio_direct+0x79/0x100 [zfs]
[ 9954.884285]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[ 9954.885507]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 9954.886687]  zfs_write+0x581/0xe20 [zfs]
[ 9954.887822]  ? iov_iter_get_pages+0xe9/0x390
[ 9954.888862]  ? trylock_page+0xd/0x20 [zfs]
[ 9954.890005]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.891217]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 9954.892391]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[ 9954.893663]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.894764]  zpl_iter_write+0xd5/0x110 [zfs]
[ 9954.895911]  new_sync_write+0x112/0x160
[ 9954.896881]  vfs_write+0xa5/0x1b0
[ 9954.897701]  ksys_write+0x4f/0xb0
[ 9954.898569]  do_syscall_64+0x5b/0x1b0
[ 9954.899417]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.900515] RIP: 0033:0x7eff236baa47
[ 9954.901363] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8
[ 9954.911129] INFO: task fio:1055504 blocked for more than 120
seconds.
[ 9954.912381]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.915434] task:fio             state:D stack:0
pid:1055504
ppid:1055493 flags:0x00000080
[ 9954.917082] Call Trace:
[ 9954.917773]  __schedule+0x2d1/0x870
[ 9954.918648]  ? zilog_dirty+0x4f/0xc0 [zfs]
[ 9954.919831]  schedule+0x55/0xf0
[ 9954.920717]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.921704]  ? finish_wait+0x80/0x80
[ 9954.922639]  zfs_rangelock_enter_writer+0x46/0x1c0 [zfs]
[ 9954.923940]  zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs]
[ 9954.925306]  zfs_write+0x703/0xe20 [zfs]
[ 9954.926406]  zpl_iter_write_buffered+0xb2/0x120 [zfs]
[ 9954.927687]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.928821]  zpl_iter_write+0xbe/0x110 [zfs]
[ 9954.930028]  new_sync_write+0x112/0x160
[ 9954.930913]  vfs_write+0xa5/0x1b0
[ 9954.931758]  ksys_write+0x4f/0xb0
[ 9954.932666]  do_syscall_64+0x5b/0x1b0
[ 9954.933544]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.934689] RIP: 0033:0x7fcaee8f0a47
[ 9954.935551] Code: Unable to access opcode bytes at RIP
0x7fcaee8f0a1d.
[ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007fcaee8f0a47
[ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI:
0000000000000006
[ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09:
0000000000000000
[ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000001d000
[ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15:
0000557a2006bae8
[ 9954.945525] INFO: task fio:1055505 blocked for more than 120
seconds.
[ 9954.946819]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.949959] task:fio             state:D stack:0
pid:1055505
ppid:1055493 flags:0x00004080
[ 9954.951653] Call Trace:
[ 9954.952417]  __schedule+0x2d1/0x870
[ 9954.953393]  ? finish_wait+0x3e/0x80
[ 9954.954315]  schedule+0x55/0xf0
[ 9954.955212]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.956211]  ? finish_wait+0x80/0x80
[ 9954.957159]  zil_commit_waiter+0xfa/0x3b0 [zfs]
[ 9954.958343]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.959524]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.960626]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.961763]  do_fsync+0x38/0x70
[ 9954.962638]  __x64_sys_fsync+0x10/0x20
[ 9954.963520]  do_syscall_64+0x5b/0x1b0
[ 9954.964470]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.965567] RIP: 0033:0x7fcaee8f1057
[ 9954.966490] Code: Unable to access opcode bytes at RIP
0x7fcaee8f102d.
[ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007fcaee8f1057
[ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI:
0000000000000005
[ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09:
0000000000000000
[ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15:
0000557a2006bae8
[10077.648150] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[10077.649541]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.652782] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[10077.654420] Call Trace:
[10077.655267]  __schedule+0x2d1/0x870
[10077.656179]  ? free_one_page+0x204/0x530
[10077.657192]  schedule+0x55/0xf0
[10077.658004]  cv_wait_common+0x16d/0x280 [spl]
[10077.659018]  ? finish_wait+0x80/0x80
[10077.660013]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[10077.661396]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[10077.662617]  zio_done+0x373/0x1d50 [zfs]
[10077.663783]  zio_execute+0xee/0x210 [zfs]
[10077.664921]  taskq_thread+0x205/0x3f0 [spl]
[10077.665982]  ? wake_up_q+0x60/0x60
[10077.666842]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[10077.668295]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[10077.669360]  kthread+0x134/0x150
[10077.670191]  ? set_kthread_struct+0x50/0x50
[10077.671209]  ret_from_fork+0x35/0x40
[10077.672076] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[10077.673467]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.676612] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[10077.678288] Call Trace:
[10077.679024]  __schedule+0x2d1/0x870
[10077.679948]  ? __wake_up_common+0x7a/0x190
[10077.681042]  schedule+0x55/0xf0
[10077.681899]  schedule_timeout+0x19f/0x320
[10077.682951]  ? __next_timer_interrupt+0xf0/0xf0
[10077.684005]  ? taskq_dispatch+0xab/0x280 [spl]
[10077.685085]  io_schedule_timeout+0x19/0x40
[10077.686080]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.687227]  ? finish_wait+0x80/0x80
[10077.688123]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.689206]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.690300]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[10077.691435]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[10077.692636]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[10077.693997]  spa_sync+0x362/0x8f0 [zfs]
[10077.695112]  txg_sync_thread+0x27a/0x3b0 [zfs]
[10077.696239]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[10077.697512]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[10077.698639]  thread_generic_wrapper+0x63/0x90 [spl]
[10077.699687]  kthread+0x134/0x150
[10077.700567]  ? set_kthread_struct+0x50/0x50
[10077.701502]  ret_from_fork+0x35/0x40
[10077.702430] INFO: task fio:1055501 blocked for more than 120
seconds.
[10077.703697]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.706780] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[10077.708479] Call Trace:
[10077.709231]  __schedule+0x2d1/0x870
[10077.710190]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[10077.711368]  schedule+0x55/0xf0
[10077.712286]  cv_wait_common+0x16d/0x280 [spl]
[10077.713316]  ? finish_wait+0x80/0x80
[10077.714262]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[10077.715566]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[10077.716878]  zfs_get_data+0x566/0x810 [zfs]
[10077.718032]  zil_lwb_commit+0x194/0x3f0 [zfs]
[10077.719234]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[10077.720413]  ? __list_add+0x12/0x30 [zfs]
[10077.721525]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.722708]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[10077.723931]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[10077.725273]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[10077.726438]  zil_commit_impl+0x6d/0xd0 [zfs]
[10077.727586]  zfs_fsync+0x66/0x90 [zfs]
[10077.728675]  zpl_fsync+0xe5/0x140 [zfs]
[10077.729755]  do_fsync+0x38/0x70
[10077.730607]  __x64_sys_fsync+0x10/0x20
[10077.731482]  do_syscall_64+0x5b/0x1b0
[10077.732415]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.733487] RIP: 0033:0x7eff236bb057
[10077.734399] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[10077.744168] INFO: task fio:1055502 blocked for more than 120
seconds.
[10077.745505]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 openzfs#1
[10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.748642] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[10077.750233] Call Trace:
[10077.751011]  __schedule+0x2d1/0x870
[10077.751915]  schedule+0x55/0xf0
[10077.752811]  schedule_timeout+0x19f/0x320
[10077.753762]  ? __next_timer_interrupt+0xf0/0xf0
[10077.754824]  io_schedule_timeout+0x19/0x40
[10077.755782]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.756922]  ? finish_wait+0x80/0x80
[10077.757788]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.758845]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.759941]  dmu_write_abd+0x174/0x1c0 [zfs]
[10077.761144]  dmu_write_uio_direct+0x79/0x100 [zfs]
[10077.762327]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[10077.763523]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[10077.764749]  zfs_write+0x581/0xe20 [zfs]
[10077.765825]  ? iov_iter_get_pages+0xe9/0x390
[10077.766842]  ? trylock_page+0xd/0x20 [zfs]
[10077.767956]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.769189]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[10077.770343]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[10077.771570]  ? rrw_exit+0xc6/0x200 [zfs]
[10077.772674]  zpl_iter_write+0xd5/0x110 [zfs]
[10077.773834]  new_sync_write+0x112/0x160
[10077.774805]  vfs_write+0xa5/0x1b0
[10077.775634]  ksys_write+0x4f/0xb0
[10077.776526]  do_syscall_64+0x5b/0x1b0
[10077.777386]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.778488] RIP: 0033:0x7eff236baa47
[10077.779339] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

1 participant