-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Barriers in pre-2.6.24 kernels #1
Labels
Type: Feature
Feature request or new feature
Comments
The updated code with Posix layer no longer build with kernels older than 2.6.26 so this is no longer an issue. The new API became available in 2.6.24. Unless, someone makes a very good case older kernels need to be supported this work does not need to happen. |
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 9, 2024
Varada (varada.kari@gmail.com) pointed out an issue with O_DIRECT reads with the following test case: dd if=/dev/urandom of=/local_zpool/file2 bs=512 count=79 truncate -s 40382 /local_zpool/file2 zpool export local_zpool zpool import -d ~/ local_zpool dd if=/local_zpool/file2 of=/dev/null bs=1M iflag=direct That led to following panic happening: [ 307.769267] VERIFY(IS_P2ALIGNED(size, sizeof (uint32_t))) failed [ 307.782997] PANIC at zfs_fletcher.c:870:abd_fletcher_4_iter() [ 307.788743] Showing stack for process 9665 [ 307.792834] CPU: 47 PID: 9665 Comm: z_rd_int_5 Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [ 307.804298] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 307.811682] Call Trace: [ 307.814131] dump_stack+0x41/0x60 [ 307.817449] spl_panic+0xd0/0xe8 [spl] [ 307.821210] ? irq_work_queue+0x9/0x20 [ 307.824961] ? wake_up_klogd.part.30+0x30/0x40 [ 307.829407] ? vprintk_emit+0x125/0x250 [ 307.833246] ? printk+0x58/0x6f [ 307.836391] spl_assert.constprop.1+0x16/0x20 [zfs] [ 307.841438] abd_fletcher_4_iter+0x6c/0x101 [zfs] [ 307.846343] ? abd_fletcher_4_simd2scalar+0x83/0x83 [zfs] [ 307.851922] abd_iterate_func+0xb1/0x170 [zfs] [ 307.856533] abd_fletcher_4_impl+0x3f/0xa0 [zfs] [ 307.861334] abd_fletcher_4_native+0x52/0x70 [zfs] [ 307.866302] ? enqueue_entity+0xf1/0x6e0 [ 307.870226] ? select_idle_sibling+0x23/0x700 [ 307.874587] ? enqueue_task_fair+0x94/0x710 [ 307.878771] ? select_task_rq_fair+0x351/0x990 [ 307.883208] zio_checksum_error_impl+0xff/0x5f0 [zfs] [ 307.888435] ? abd_fletcher_4_impl+0xa0/0xa0 [zfs] [ 307.893401] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [ 307.898203] ? __wake_up_common+0x7a/0x190 [ 307.902300] ? __switch_to_asm+0x41/0x70 [ 307.906220] ? __switch_to_asm+0x35/0x70 [ 307.910145] ? __switch_to_asm+0x41/0x70 [ 307.914061] ? __switch_to_asm+0x35/0x70 [ 307.917980] ? __switch_to_asm+0x41/0x70 [ 307.921903] ? __switch_to_asm+0x35/0x70 [ 307.925821] ? __switch_to_asm+0x35/0x70 [ 307.929739] ? __switch_to_asm+0x41/0x70 [ 307.933658] ? __switch_to_asm+0x35/0x70 [ 307.937582] zio_checksum_error+0x47/0xc0 [zfs] [ 307.942288] raidz_checksum_verify+0x3a/0x70 [zfs] [ 307.947257] vdev_raidz_io_done+0x4b/0x160 [zfs] [ 307.952049] zio_vdev_io_done+0x7f/0x200 [zfs] [ 307.956669] zio_execute+0xee/0x210 [zfs] [ 307.960855] taskq_thread+0x203/0x420 [spl] [ 307.965048] ? wake_up_q+0x70/0x70 [ 307.968455] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 307.974807] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 307.979260] kthread+0x10a/0x120 [ 307.982485] ? set_kthread_struct+0x40/0x40 [ 307.986670] ret_from_fork+0x35/0x40 The reason this was occuring was because by doing the zpool export that meant the initial read of O_DIRECT would be forced to go down to disk. In this case it was still valid as bs=1M is still page size aligned; howver, the file length was not. So when issuing the O_DIRECT read even after calling make_abd_for_dbuf() we had an extra page allocated in the original ABD along with the linear ABD attached at the end of the gang abd from make_abd_for_dbuf(). This is an issue as it is our expectations with read that the block sizes being read are page aligned. When we do our check we are only checking the request but not the actual size of data we may read such as the entire file. In order to remedy this situation, I updated zfs_read() to attempt to read as much as it can using O_DIRECT based on if the length is page-sized aligned. Any additional bytes that are requested are then read into the ARC. This still stays with our semantics that IO requests must be page sized aligned. There are a bit of draw backs here if there is only a single block being read. In this case the block will be read twice. Once using O_DIRECT and then using buffered to fill in the remaining data for the users request. However, this should not be a big issue most of the time. In the normal case a user may ask for a lot of data from a file and only the stray bytes at the end of the file will have to be read using the ARC. In order to make sure this case was completely covered, I added a new ZTS test case dio_unaligned_filesize to test this out. The main thing with that test case is the first O_DIRECT read will issue out two reads two being O_DIRECT and the third being buffered for the remaining requested bytes. As part of this commit, I also updated stride_dd to take an additional parameter of -e, which says read the entire input file and ingore the count (-c) option. We need to use stride_dd for FreeBSD as dd does not make sure the buffer is page aligned. This udpate to stride_dd allows us to use it to test out this case in dio_unaligned_filesize for both Linux and FreeBSD. While this may not be the most elegant solution, it does stick with the semanatics and still reads all the data the user requested. I am fine with revisiting this and maybe we just return a short read? Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 9, 2024
We were using the generic Linux calls to make sure that the page cache was cleaned out before issuing any Direct I/O reads or writes. However, this only matters in the event the file region being written/read from using O_DIRECT was mmap'ed. One of stipulations with O_DIRECT is that is redirected through the ARC in the event the file range is mmap'ed. Becaues of this, it did not make sense to try and invalidate the page cache if we were never intending to have O_DIRECT to work with mmap'ed regions. Also, calls into the generic Linux calls in writes would often lead to lockups as the page lock is dropped in zfs_putpage(). See the stack dump below. In order to just prevent this, we no longer will use the generic linux direct IO wrappers or try and flush out the page cache. Instead if we find the file range has been mmap'ed in since the initial check in zfs_setup_direct() we will just now directly handle that in zfs_read() and zfs_write(). In most case zfs_setup_direct() will prevent O_DIRECT to mmap'ed regions of the file that have been page faulted in, but if that happen when we are issuing the direct I/O request the the normal parts of the ZFS paths will be taken to account for this. It is highly suggested not to mmap a region of file and then write or read directly to the file. In general, that is kind of an isane thing to do... However, we try our best to still have consistency with the ARC. Also, before making this decision I did explore if we could just add a rangelock in zfs_fillpage(), but we can not do that. The reason is when the page is in zfs_readpage_common() it has already been locked by the kernel. So, if we try and grab the rangelock anywhere in that path we can get stuck if another thread is issuing writes to the file region that was mmap'ed in. The reason is update_pages() holds the rangelock and then tries to lock the page. In this case zfs_fillpage() holds the page lock but is stuck in the rangelock waiting and holding the page lock. Deadlock is unavoidable in this case. [260136.244332] INFO: task fio:3791107 blocked for more than 120 seconds. [260136.250867] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260136.258693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260136.266607] task:fio state:D stack: 0 pid:3791107 ppid:3790841 flags:0x00004080 [260136.275306] Call Trace: [260136.277845] __schedule+0x2d1/0x830 [260136.281432] schedule+0x35/0xa0 [260136.284665] io_schedule+0x12/0x40 [260136.288157] wait_on_page_bit+0x123/0x220 [260136.292258] ? xas_load+0x8/0x80 [260136.295577] ? file_fdatawait_range+0x20/0x20 [260136.300024] filemap_page_mkwrite+0x9b/0xb0 [260136.304295] do_page_mkwrite+0x53/0x90 [260136.308135] ? vm_normal_page+0x1a/0xc0 [260136.312062] do_wp_page+0x298/0x350 [260136.315640] __handle_mm_fault+0x44f/0x6c0 [260136.319826] ? __switch_to_asm+0x41/0x70 [260136.323839] handle_mm_fault+0xc1/0x1e0 [260136.327766] do_user_addr_fault+0x1b5/0x440 [260136.332038] do_page_fault+0x37/0x130 [260136.335792] ? page_fault+0x8/0x30 [260136.339284] page_fault+0x1e/0x30 [260136.342689] RIP: 0033:0x7f6deee7f1b4 [260136.346361] Code: Unable to access opcode bytes at RIP 0x7f6deee7f18a. [260136.352977] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202 [260136.358288] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX: 00007f6d83148fe0 [260136.365508] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI: 00007f6d8309bfa0 [260136.372730] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09: 0000000000000000 [260136.379946] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12: 0000000000000001 [260136.387167] R13: 0000556b63614ec0 R14: 0000000000100000 R15: 0000556b63614ee8 [260136.394387] INFO: task fio:3791108 blocked for more than 120 seconds. [260136.400911] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260136.408739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260136.416651] task:fio state:D stack: 0 pid:3791108 ppid:3790835 flags:0x00004080 [260136.425343] Call Trace: [260136.427883] __schedule+0x2d1/0x830 [260136.431463] ? cv_wait_common+0x12d/0x240 [spl] [260136.436091] schedule+0x35/0xa0 [260136.439321] io_schedule+0x12/0x40 [260136.442814] __lock_page+0x12d/0x230 [260136.446483] ? file_fdatawait_range+0x20/0x20 [260136.450929] zfs_putpage+0x148/0x590 [zfs] [260136.455322] ? rmap_walk_file+0x116/0x290 [260136.459421] ? __mod_memcg_lruvec_state+0x5d/0x160 [260136.464300] zpl_putpage+0x67/0xd0 [zfs] [260136.468495] write_cache_pages+0x197/0x420 [260136.472679] ? zpl_readpage_filler+0x10/0x10 [zfs] [260136.477732] zpl_writepages+0x119/0x130 [zfs] [260136.482352] do_writepages+0xc2/0x1c0 [260136.486103] ? flush_tlb_func_common.constprop.9+0x125/0x220 [260136.491850] __filemap_fdatawrite_range+0xc7/0x100 [260136.496732] filemap_write_and_wait_range+0x30/0x80 [260136.501695] generic_file_direct_write+0x120/0x160 [260136.506575] ? rrw_exit+0xb0/0x1c0 [zfs] [260136.510779] zpl_iter_write+0xdd/0x160 [zfs] [260136.515323] new_sync_write+0x112/0x160 [260136.519255] vfs_write+0xa5/0x1a0 [260136.522662] ksys_write+0x4f/0xb0 [260136.526067] do_syscall_64+0x5b/0x1a0 [260136.529818] entry_SYSCALL_64_after_hwframe+0x65/0xca [260136.534959] RIP: 0033:0x7f9d192c7a17 [260136.538625] Code: Unable to access opcode bytes at RIP 0x7f9d192c79ed. [260136.545236] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [260136.552889] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9d192c7a17 [260136.560108] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI: 0000000000000005 [260136.567329] RBP: 00007f9caea03000 R08: 0000000000000000 R09: 0000000000000000 [260136.574548] R10: 00005558e8975680 R11: 0000000000000293 R12: 0000000000100000 [260136.581767] R13: 00005558e8985ec0 R14: 0000000000100000 R15: 00005558e8985ee8 [260136.588989] INFO: task fio:3791109 blocked for more than 120 seconds. [260136.595513] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260136.603337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260136.611250] task:fio state:D stack: 0 pid:3791109 ppid:3790838 flags:0x00004080 [260136.619943] Call Trace: [260136.622483] __schedule+0x2d1/0x830 [260136.626064] ? zfs_znode_held+0xe6/0x140 [zfs] [260136.630777] schedule+0x35/0xa0 [260136.634009] cv_wait_common+0x153/0x240 [spl] [260136.638466] ? finish_wait+0x80/0x80 [260136.642129] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [260136.647712] zfs_rangelock_enter_impl+0xbf/0x170 [zfs] [260136.653121] zfs_get_data+0x113/0x770 [zfs] [260136.657567] zil_lwb_commit+0x537/0x780 [zfs] [260136.662187] zil_process_commit_list+0x14c/0x460 [zfs] [260136.667585] zil_commit_writer+0xeb/0x160 [zfs] [260136.672376] zil_commit_impl+0x5d/0xa0 [zfs] [260136.676910] zfs_putpage+0x516/0x590 [zfs] [260136.681279] zpl_putpage+0x67/0xd0 [zfs] [260136.685467] write_cache_pages+0x197/0x420 [260136.689649] ? zpl_readpage_filler+0x10/0x10 [zfs] [260136.694705] zpl_writepages+0x119/0x130 [zfs] [260136.699322] do_writepages+0xc2/0x1c0 [260136.703076] __filemap_fdatawrite_range+0xc7/0x100 [260136.707952] filemap_write_and_wait_range+0x30/0x80 [260136.712920] zpl_iter_read_direct+0x86/0x1b0 [zfs] [260136.717972] ? rrw_exit+0xb0/0x1c0 [zfs] [260136.722174] zpl_iter_read+0x90/0xb0 [zfs] [260136.726536] new_sync_read+0x10f/0x150 [260136.730376] vfs_read+0x91/0x140 [260136.733693] ksys_read+0x4f/0xb0 [260136.737012] do_syscall_64+0x5b/0x1a0 [260136.740764] entry_SYSCALL_64_after_hwframe+0x65/0xca [260136.745906] RIP: 0033:0x7f1bd4687ab4 [260136.749574] Code: Unable to access opcode bytes at RIP 0x7f1bd4687a8a. [260136.756181] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [260136.763834] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f1bd4687ab4 [260136.771056] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI: 0000000000000005 [260136.778274] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09: 0000000000000000 [260136.785494] R10: 000000008fd0ea42 R11: 0000000000000246 R12: 0000000000100000 [260136.792714] R13: 000055ca4b405ec0 R14: 0000000000100000 R15: 000055ca4b405ee8 [260259.123003] INFO: task kworker/u128:0:3589938 blocked for more than 120 seconds. [260259.130487] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.138313] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.146224] task:kworker/u128:0 state:D stack: 0 pid:3589938 ppid: 2 flags:0x80004080 [260259.154832] Workqueue: writeback wb_workfn (flush-zfs-540) [260259.160411] Call Trace: [260259.162950] __schedule+0x2d1/0x830 [260259.166531] schedule+0x35/0xa0 [260259.169765] io_schedule+0x12/0x40 [260259.173257] __lock_page+0x12d/0x230 [260259.176921] ? file_fdatawait_range+0x20/0x20 [260259.181368] write_cache_pages+0x1f2/0x420 [260259.185554] ? zpl_readpage_filler+0x10/0x10 [zfs] [260259.190633] zpl_writepages+0x98/0x130 [zfs] [260259.195183] do_writepages+0xc2/0x1c0 [260259.198935] __writeback_single_inode+0x39/0x2f0 [260259.203640] writeback_sb_inodes+0x1e6/0x450 [260259.208002] __writeback_inodes_wb+0x5f/0xc0 [260259.212359] wb_writeback+0x247/0x2e0 [260259.216114] ? get_nr_inodes+0x35/0x50 [260259.219953] wb_workfn+0x37c/0x4d0 [260259.223443] ? __switch_to_asm+0x35/0x70 [260259.227456] ? __switch_to_asm+0x41/0x70 [260259.231469] ? __switch_to_asm+0x35/0x70 [260259.235481] ? __switch_to_asm+0x41/0x70 [260259.239495] ? __switch_to_asm+0x35/0x70 [260259.243505] ? __switch_to_asm+0x41/0x70 [260259.247518] ? __switch_to_asm+0x35/0x70 [260259.251533] ? __switch_to_asm+0x41/0x70 [260259.255545] process_one_work+0x1a7/0x360 [260259.259645] worker_thread+0x30/0x390 [260259.263396] ? create_worker+0x1a0/0x1a0 [260259.267409] kthread+0x10a/0x120 [260259.270730] ? set_kthread_struct+0x40/0x40 [260259.275003] ret_from_fork+0x35/0x40 [260259.278712] INFO: task fio:3791107 blocked for more than 120 seconds. [260259.285240] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.293064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.300976] task:fio state:D stack: 0 pid:3791107 ppid:3790841 flags:0x00004080 [260259.309668] Call Trace: [260259.312210] __schedule+0x2d1/0x830 [260259.315787] schedule+0x35/0xa0 [260259.319020] io_schedule+0x12/0x40 [260259.322511] wait_on_page_bit+0x123/0x220 [260259.326611] ? xas_load+0x8/0x80 [260259.329930] ? file_fdatawait_range+0x20/0x20 [260259.334376] filemap_page_mkwrite+0x9b/0xb0 [260259.338650] do_page_mkwrite+0x53/0x90 [260259.342489] ? vm_normal_page+0x1a/0xc0 [260259.346415] do_wp_page+0x298/0x350 [260259.349994] __handle_mm_fault+0x44f/0x6c0 [260259.354181] ? __switch_to_asm+0x41/0x70 [260259.358193] handle_mm_fault+0xc1/0x1e0 [260259.362117] do_user_addr_fault+0x1b5/0x440 [260259.366391] do_page_fault+0x37/0x130 [260259.370145] ? page_fault+0x8/0x30 [260259.373639] page_fault+0x1e/0x30 [260259.377043] RIP: 0033:0x7f6deee7f1b4 [260259.380714] Code: Unable to access opcode bytes at RIP 0x7f6deee7f18a. [260259.387323] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202 [260259.392633] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX: 00007f6d83148fe0 [260259.399853] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI: 00007f6d8309bfa0 [260259.407074] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09: 0000000000000000 [260259.414291] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12: 0000000000000001 [260259.421512] R13: 0000556b63614ec0 R14: 0000000000100000 R15: 0000556b63614ee8 [260259.428733] INFO: task fio:3791108 blocked for more than 120 seconds. [260259.435258] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.443085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.450997] task:fio state:D stack: 0 pid:3791108 ppid:3790835 flags:0x00004080 [260259.459689] Call Trace: [260259.462228] __schedule+0x2d1/0x830 [260259.465808] ? cv_wait_common+0x12d/0x240 [spl] [260259.470435] schedule+0x35/0xa0 [260259.473669] io_schedule+0x12/0x40 [260259.477161] __lock_page+0x12d/0x230 [260259.480828] ? file_fdatawait_range+0x20/0x20 [260259.485274] zfs_putpage+0x148/0x590 [zfs] [260259.489640] ? rmap_walk_file+0x116/0x290 [260259.493742] ? __mod_memcg_lruvec_state+0x5d/0x160 [260259.498619] zpl_putpage+0x67/0xd0 [zfs] [260259.502813] write_cache_pages+0x197/0x420 [260259.506998] ? zpl_readpage_filler+0x10/0x10 [zfs] [260259.512054] zpl_writepages+0x119/0x130 [zfs] [260259.516672] do_writepages+0xc2/0x1c0 [260259.520423] ? flush_tlb_func_common.constprop.9+0x125/0x220 [260259.526170] __filemap_fdatawrite_range+0xc7/0x100 [260259.531050] filemap_write_and_wait_range+0x30/0x80 [260259.536016] generic_file_direct_write+0x120/0x160 [260259.540896] ? rrw_exit+0xb0/0x1c0 [zfs] [260259.545099] zpl_iter_write+0xdd/0x160 [zfs] [260259.549639] new_sync_write+0x112/0x160 [260259.553566] vfs_write+0xa5/0x1a0 [260259.556971] ksys_write+0x4f/0xb0 [260259.560379] do_syscall_64+0x5b/0x1a0 [260259.564131] entry_SYSCALL_64_after_hwframe+0x65/0xca [260259.569269] RIP: 0033:0x7f9d192c7a17 [260259.572935] Code: Unable to access opcode bytes at RIP 0x7f9d192c79ed. [260259.579549] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [260259.587200] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9d192c7a17 [260259.594419] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI: 0000000000000005 [260259.601639] RBP: 00007f9caea03000 R08: 0000000000000000 R09: 0000000000000000 [260259.608859] R10: 00005558e8975680 R11: 0000000000000293 R12: 0000000000100000 [260259.616078] R13: 00005558e8985ec0 R14: 0000000000100000 R15: 00005558e8985ee8 [260259.623298] INFO: task fio:3791109 blocked for more than 120 seconds. [260259.629827] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.637650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.645564] task:fio state:D stack: 0 pid:3791109 ppid:3790838 flags:0x00004080 [260259.654254] Call Trace: [260259.656794] __schedule+0x2d1/0x830 [260259.660373] ? zfs_znode_held+0xe6/0x140 [zfs] [260259.665081] schedule+0x35/0xa0 [260259.668313] cv_wait_common+0x153/0x240 [spl] [260259.672768] ? finish_wait+0x80/0x80 [260259.676441] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [260259.682026] zfs_rangelock_enter_impl+0xbf/0x170 [zfs] [260259.687432] zfs_get_data+0x113/0x770 [zfs] [260259.691876] zil_lwb_commit+0x537/0x780 [zfs] [260259.696497] zil_process_commit_list+0x14c/0x460 [zfs] [260259.701895] zil_commit_writer+0xeb/0x160 [zfs] [260259.706689] zil_commit_impl+0x5d/0xa0 [zfs] [260259.711228] zfs_putpage+0x516/0x590 [zfs] [260259.715589] zpl_putpage+0x67/0xd0 [zfs] [260259.719775] write_cache_pages+0x197/0x420 [260259.723959] ? zpl_readpage_filler+0x10/0x10 [zfs] [260259.729013] zpl_writepages+0x119/0x130 [zfs] [260259.733632] do_writepages+0xc2/0x1c0 [260259.737384] __filemap_fdatawrite_range+0xc7/0x100 [260259.742264] filemap_write_and_wait_range+0x30/0x80 [260259.747229] zpl_iter_read_direct+0x86/0x1b0 [zfs] [260259.752286] ? rrw_exit+0xb0/0x1c0 [zfs] [260259.756487] zpl_iter_read+0x90/0xb0 [zfs] [260259.760855] new_sync_read+0x10f/0x150 [260259.764696] vfs_read+0x91/0x140 [260259.768013] ksys_read+0x4f/0xb0 [260259.771332] do_syscall_64+0x5b/0x1a0 [260259.775087] entry_SYSCALL_64_after_hwframe+0x65/0xca [260259.780225] RIP: 0033:0x7f1bd4687ab4 [260259.783893] Code: Unable to access opcode bytes at RIP 0x7f1bd4687a8a. [260259.790503] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [260259.798157] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f1bd4687ab4 [260259.805377] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI: 0000000000000005 [260259.812592] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09: 0000000000000000 [260259.819814] R10: 000000008fd0ea42 R11: 0000000000000246 R12: 0000000000100000 [260259.827032] R13: 000055ca4b405ec0 R14: 0000000000100000 R15: 000055ca4b405ee8 [260382.001731] INFO: task kworker/u128:0:3589938 blocked for more than 120 seconds. [260382.009227] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260382.017053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260382.024963] task:kworker/u128:0 state:D stack: 0 pid:3589938 ppid: 2 flags:0x80004080 [260382.033568] Workqueue: writeback wb_workfn (flush-zfs-540) [260382.039141] Call Trace: [260382.041683] __schedule+0x2d1/0x830 [260382.045271] schedule+0x35/0xa0 [260382.048503] io_schedule+0x12/0x40 [260382.051994] __lock_page+0x12d/0x230 [260382.055662] ? file_fdatawait_range+0x20/0x20 [260382.060107] write_cache_pages+0x1f2/0x420 [260382.064293] ? zpl_readpage_filler+0x10/0x10 [zfs] [260382.069379] zpl_writepages+0x98/0x130 [zfs] [260382.073919] do_writepages+0xc2/0x1c0 [260382.077672] __writeback_single_inode+0x39/0x2f0 [260382.082379] writeback_sb_inodes+0x1e6/0x450 [260382.086738] __writeback_inodes_wb+0x5f/0xc0 [260382.091097] wb_writeback+0x247/0x2e0 [260382.094850] ? get_nr_inodes+0x35/0x50 [260382.098689] wb_workfn+0x37c/0x4d0 [260382.102181] ? __switch_to_asm+0x35/0x70 [260382.106194] ? __switch_to_asm+0x41/0x70 [260382.110207] ? __switch_to_asm+0x35/0x70 [260382.114221] ? __switch_to_asm+0x41/0x70 [260382.118231] ? __switch_to_asm+0x35/0x70 [260382.122244] ? __switch_to_asm+0x41/0x70 [260382.126256] ? __switch_to_asm+0x35/0x70 [260382.130273] ? __switch_to_asm+0x41/0x70 [260382.134284] process_one_work+0x1a7/0x360 [260382.138384] worker_thread+0x30/0x390 [260382.142136] ? create_worker+0x1a0/0x1a0 [260382.146150] kthread+0x10a/0x120 [260382.149469] ? set_kthread_struct+0x40/0x40 [260382.153741] ret_from_fork+0x35/0x40 [260382.157448] INFO: task fio:3791107 blocked for more than 120 seconds. [260382.163977] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260382.171802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260382.179715] task:fio state:D stack: 0 pid:3791107 ppid:3790841 flags:0x00004080 [260382.188409] Call Trace: [260382.190945] __schedule+0x2d1/0x830 [260382.194527] schedule+0x35/0xa0 [260382.197757] io_schedule+0x12/0x40 [260382.201249] wait_on_page_bit+0x123/0x220 [260382.205350] ? xas_load+0x8/0x80 [260382.208668] ? file_fdatawait_range+0x20/0x20 [260382.213114] filemap_page_mkwrite+0x9b/0xb0 [260382.217386] do_page_mkwrite+0x53/0x90 [260382.221227] ? vm_normal_page+0x1a/0xc0 [260382.225152] do_wp_page+0x298/0x350 [260382.228733] __handle_mm_fault+0x44f/0x6c0 [260382.232919] ? __switch_to_asm+0x41/0x70 [260382.236930] handle_mm_fault+0xc1/0x1e0 [260382.240856] do_user_addr_fault+0x1b5/0x440 [260382.245132] do_page_fault+0x37/0x130 [260382.248883] ? page_fault+0x8/0x30 [260382.252375] page_fault+0x1e/0x30 [260382.255781] RIP: 0033:0x7f6deee7f1b4 [260382.259451] Code: Unable to access opcode bytes at RIP 0x7f6deee7f18a. [260382.266059] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202 [260382.271373] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX: 00007f6d83148fe0 [260382.278591] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI: 00007f6d8309bfa0 [260382.285813] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09: 0000000000000000 [260382.293030] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12: 0000000000000001 [260382.300249] R13: 0000556b63614ec0 R14: 0000000000100000 R15: 0000556b63614ee8 [260382.307472] INFO: task fio:3791108 blocked for more than 120 seconds. [260382.313997] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260382.321823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260382.329734] task:fio state:D stack: 0 pid:3791108 ppid:3790835 flags:0x00004080 [260382.338427] Call Trace: [260382.340967] __schedule+0x2d1/0x830 [260382.344547] ? cv_wait_common+0x12d/0x240 [spl] [260382.349173] schedule+0x35/0xa0 [260382.352406] io_schedule+0x12/0x40 [260382.355899] __lock_page+0x12d/0x230 [260382.359563] ? file_fdatawait_range+0x20/0x20 [260382.364010] zfs_putpage+0x148/0x590 [zfs] [260382.368379] ? rmap_walk_file+0x116/0x290 [260382.372479] ? __mod_memcg_lruvec_state+0x5d/0x160 [260382.377358] zpl_putpage+0x67/0xd0 [zfs] [260382.381552] write_cache_pages+0x197/0x420 [260382.385739] ? zpl_readpage_filler+0x10/0x10 [zfs] [260382.390791] zpl_writepages+0x119/0x130 [zfs] [260382.395410] do_writepages+0xc2/0x1c0 [260382.399161] ? flush_tlb_func_common.constprop.9+0x125/0x220 [260382.404907] __filemap_fdatawrite_range+0xc7/0x100 [260382.409790] filemap_write_and_wait_range+0x30/0x80 [260382.414752] generic_file_direct_write+0x120/0x160 [260382.419632] ? rrw_exit+0xb0/0x1c0 [zfs] [260382.423838] zpl_iter_write+0xdd/0x160 [zfs] [260382.428379] new_sync_write+0x112/0x160 [260382.432304] vfs_write+0xa5/0x1a0 [260382.435711] ksys_write+0x4f/0xb0 [260382.439115] do_syscall_64+0x5b/0x1a0 [260382.442866] entry_SYSCALL_64_after_hwframe+0x65/0xca [260382.448007] RIP: 0033:0x7f9d192c7a17 [260382.451675] Code: Unable to access opcode bytes at RIP 0x7f9d192c79ed. [260382.458286] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [260382.465938] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9d192c7a17 [260382.473158] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI: 0000000000000005 [260382.480379] RBP: 00007f9caea03000 R08: 0000000000000000 R09: 0000000000000000 [260382.487597] R10: 00005558e8975680 R11: 0000000000000293 R12: 0000000000100000 [260382.494814] R13: 00005558e8985ec0 R14: 0000000000100000 R15: 00005558e8985ee8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 9, 2024
In commit ba30ec9 I got a little overzealous in code cleanup. While I was trying to remove all the stable page code for Linux, I misinterpreted why Brian Behlendorf originally had the try rangelock, drop page lock, and aquire range lock in zfs_fillpage(). This is still necessary even without stable pages. This has to occur to avoid a race condition between direct IO writes and pages being faulted in for mmap files. If the rangelock is not held, then a direct IO write can set db->db_data = NULL either in: 1. dmu_write_direct() -> dmu_buf_will_not_fill() -> dmu_buf_will_fill() -> dbuf_noread() -> dbuf_clear_data() 2. dmu_write_direct_done() This can cause the panic then, withtout the rangelock as dmu_read_imp() can get a NULL pointer for db->db_data when trying to do the memcpy. So this rangelock must be held in zfs_fillpage() not matter what. There are further semantics on when the rangelock should be held in zfs_fillpage(). It must only be held when doing zfs_getpage() -> zfs_fillpage(). The reason for this is mappedread() can call zfs_fillpage() if the page is not uptodate. This can occur becaue filemap_fault() will first add the pages to the inode's address_space mapping and then drop the page lock. This leaves open a window were mappedread() can be called. Since this can occur, mappedread() will hold both the page lock and the rangelock. This is perfectly valid and correct. However, it is important in this case to never grab the rangelock in zfs_fillpage(). If this happens, then a dead lock will occur. Finally it is important to note that the rangelock is first attempted to be grabbed with zfs_rangelock_tryenter(). The reason for this is the page lock must be dropped in order to grab the rangelock in this case. Otherwise there is a race between zfs_fillpage() and zfs_write() -> update_pages(). In update_pages() the rangelock is already held and it then grabs the page lock. So if the page lock is not dropped before acquiring the rangelock in zfs_fillpage() there can be a deadlock. Below is a stack trace showing the NULL pointer dereference that was occuring with the dio_mmap ZTS test case before this commit. [ 7737.430658] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 7737.438486] PGD 0 P4D 0 [ 7737.441024] Oops: 0000 [openzfs#1] SMP NOPTI [ 7737.444692] CPU: 33 PID: 599346 Comm: fio Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [ 7737.455721] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 7737.463106] RIP: 0010:__memcpy+0x12/0x20 [ 7737.467032] Code: ff 0f 31 48 c1 e2 20 48 09 c2 48 31 d3 e9 79 ff ff ff 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4 [ 7737.485770] RSP: 0000:ffffc1db829e3b60 EFLAGS: 00010246 [ 7737.490987] RAX: ffff9ef195b6f000 RBX: 0000000000001000 RCX: 0000000000000200 [ 7737.498111] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ef195b6f000 [ 7737.505235] RBP: ffff9ef195b70000 R08: ffff9eef1d1d0000 R09: ffff9eef1d1d0000 [ 7737.512358] R10: ffff9eef27968218 R11: 0000000000000000 R12: 0000000000000000 [ 7737.519481] R13: ffff9ef041adb6d8 R14: 00000000005e1000 R15: 0000000000000001 [ 7737.526607] FS: 00007f77fe2bae80(0000) GS:ffff9f0cae840000(0000) knlGS:0000000000000000 [ 7737.534683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7737.540423] CR2: 0000000000000000 CR3: 00000003387a6000 CR4: 0000000000350ee0 [ 7737.547553] Call Trace: [ 7737.550003] dmu_read_impl+0x11a/0x210 [zfs] [ 7737.554464] dmu_read+0x56/0x90 [zfs] [ 7737.558292] zfs_fillpage+0x76/0x190 [zfs] [ 7737.562584] zfs_getpage+0x4c/0x80 [zfs] [ 7737.566691] zpl_readpage_common+0x3b/0x80 [zfs] [ 7737.571485] filemap_fault+0x5d6/0xa10 [ 7737.575236] ? obj_cgroup_charge_pages+0xba/0xd0 [ 7737.579856] ? xas_load+0x8/0x80 [ 7737.583088] ? xas_find+0x173/0x1b0 [ 7737.586579] ? filemap_map_pages+0x84/0x410 [ 7737.590759] __do_fault+0x38/0xb0 [ 7737.594077] handle_pte_fault+0x559/0x870 [ 7737.598082] __handle_mm_fault+0x44f/0x6c0 [ 7737.602181] handle_mm_fault+0xc1/0x1e0 [ 7737.606019] do_user_addr_fault+0x1b5/0x440 [ 7737.610207] do_page_fault+0x37/0x130 [ 7737.613873] ? page_fault+0x8/0x30 [ 7737.617277] page_fault+0x1e/0x30 [ 7737.620589] RIP: 0033:0x7f77fbce9140 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 9, 2024
It is important to hold the dbuf mutex (db_mtx) when creating ZIO's in dmu_read_abd(). The BP that is returned dmu_buf_get_gp_from_dbuf() may come from a previous direct IO write. In this case, it is attached to a dirty record in the dbuf. When zio_read() is called, a copy of the BP is made through io_bp_copy to io_bp in zio_create(). Without holding the db_mtx though, the dirty record may be freed in dbuf_read_done(). This can result in garbage beening place BP for the ZIO creatd through zio_read(). By holding the db_mtx, this race can be avoided. Below is a stack trace of the issue that was occuring in vdev_mirror_child_select() without holding the db_mtx and creating the the ZIO. [29882.427056] VERIFY(zio->io_bp == NULL || BP_PHYSICAL_BIRTH(zio->io_bp) == txg) failed [29882.434884] PANIC at vdev_mirror.c:545:vdev_mirror_child_select() [29882.440976] Showing stack for process 1865540 [29882.445336] CPU: 57 PID: 1865540 Comm: fio Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [29882.456457] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [29882.463844] Call Trace: [29882.466296] dump_stack+0x41/0x60 [29882.469618] spl_panic+0xd0/0xe8 [spl] [29882.473376] ? __dprintf+0x10e/0x180 [zfs] [29882.477674] ? kfree+0xd3/0x250 [29882.480819] ? __dprintf+0x10e/0x180 [zfs] [29882.485103] ? vdev_mirror_map_alloc+0x29/0x50 [zfs] [29882.490250] ? vdev_lookup_top+0x20/0x90 [zfs] [29882.494878] spl_assert+0x17/0x20 [zfs] [29882.498893] vdev_mirror_child_select+0x279/0x300 [zfs] [29882.504289] vdev_mirror_io_start+0x11f/0x2b0 [zfs] [29882.509336] zio_vdev_io_start+0x3ee/0x520 [zfs] [29882.514137] zio_nowait+0x105/0x290 [zfs] [29882.518330] dmu_read_abd+0x196/0x460 [zfs] [29882.522691] dmu_read_uio_direct+0x6d/0xf0 [zfs] [29882.527472] dmu_read_uio_dnode+0x12a/0x140 [zfs] [29882.532345] dmu_read_uio_dbuf+0x3f/0x60 [zfs] [29882.536953] zfs_read+0x238/0x3f0 [zfs] [29882.540976] zpl_iter_read_direct+0xe0/0x180 [zfs] [29882.545952] ? rrw_exit+0xc6/0x200 [zfs] [29882.550058] zpl_iter_read+0x90/0xb0 [zfs] [29882.554340] new_sync_read+0x10f/0x150 [29882.558094] vfs_read+0x91/0x140 [29882.561325] ksys_read+0x4f/0xb0 [29882.564557] do_syscall_64+0x5b/0x1a0 [29882.568222] entry_SYSCALL_64_after_hwframe+0x65/0xca [29882.573267] RIP: 0033:0x7f7fe0fa6ab4 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 9, 2024
There existed a race condition that was discovered through the dio_random test. When doing fio with --fsync=32, after 32 writes fsync is called on the file. When this happens, blocks committed to the ZIL will be sync'ed out. However, the code for the O_DIRECT write was updated in 31983d2 to always wait if there was an associated ARC buf with the dbuf for all previous TXG's to sync out. There was an oversight with this update. When waiting on previous TXG's to sync out, the O_DIRECT write is holding the rangelock as a writer the entire time. This causes an issue with the ZIL is commit writes out though `zfs_get_data()` because it will try and grab the rangelock as reader. This will lead to a deadlock. In order to fix this race condition, I updated the `dmu_buf_impl_t` struct to contain a uint8_t variable that is used to signal if the dbuf attached to a O_DIRECT write is the wait hold because of mixed direct and buffered data. Using this new `db_mixed_io_dio_wait` variable in the in the `dmu_buf_impl_t` the code in `zfs_get_data()` can tell that rangelock is already being held across the entire block and there is no need to grab the rangelock at all. Because the rangelock is being held as a writer across the entire block already, no modifications can take place against the block as long as the O_DIRECT write is stalled waiting in `dmu_buf_direct_mixed_io_wait()`. Also as part of this update, I realized the `db_state` in `dmu_buf_direct_mixed_io_wait()` needs to be changed temporarily to `DB_CACHED`. This is necessary so the logic in `dbuf_read()` is correct if `dmu_sync_late_arrival()` is called by `dmu_sync()`. It is completely valid to switch the `db_state` back to `DB_CACHED` is there is still an associated ARC buf that will not be freed till out O_DIRECT write is completed which will only happen after if leaves `dmu_buf_direct_mixed_io_wait()`. Here is the stack trace of the deadlock that happen with `dio_random.ksh` before this commit: [ 5513.663402] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 7496.580415] INFO: task z_wr_int:1098000 blocked for more than 120 seconds. [ 7496.585709] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.593349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.600839] task:z_wr_int state:D stack: 0 pid:1098000 ppid: 2 flags:0x80004080 [ 7496.608622] Call Trace: [ 7496.611770] __schedule+0x2d1/0x870 [ 7496.615404] schedule+0x55/0xf0 [ 7496.618866] cv_wait_common+0x16d/0x280 [spl] [ 7496.622910] ? finish_wait+0x80/0x80 [ 7496.626601] dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs] [ 7496.631327] dmu_write_direct_done+0x90/0x3a0 [zfs] [ 7496.635798] zio_done+0x373/0x1d40 [zfs] [ 7496.639795] zio_execute+0xee/0x210 [zfs] [ 7496.643840] taskq_thread+0x203/0x420 [spl] [ 7496.647836] ? wake_up_q+0x70/0x70 [ 7496.651411] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 7496.656489] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 7496.660604] kthread+0x134/0x150 [ 7496.664092] ? set_kthread_struct+0x50/0x50 [ 7496.668080] ret_from_fork+0x35/0x40 [ 7496.671745] INFO: task txg_sync:1098025 blocked for more than 120 seconds. [ 7496.676991] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.684666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.692060] task:txg_sync state:D stack: 0 pid:1098025 ppid: 2 flags:0x80004080 [ 7496.699888] Call Trace: [ 7496.703012] __schedule+0x2d1/0x870 [ 7496.706658] schedule+0x55/0xf0 [ 7496.710093] schedule_timeout+0x197/0x300 [ 7496.713982] ? __next_timer_interrupt+0xf0/0xf0 [ 7496.718135] io_schedule_timeout+0x19/0x40 [ 7496.722049] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7496.726349] ? finish_wait+0x80/0x80 [ 7496.730039] __cv_timedwait_io+0x15/0x20 [spl] [ 7496.734100] zio_wait+0x1a2/0x4d0 [zfs] [ 7496.738082] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 7496.742205] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7496.746534] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 7496.750842] spa_sync_iterate_to_convergence+0xcf/0x310 [zfs] [ 7496.755742] spa_sync+0x362/0x8d0 [zfs] [ 7496.759689] txg_sync_thread+0x274/0x3b0 [zfs] [ 7496.763928] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 7496.768439] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 7496.772799] thread_generic_wrapper+0x63/0x90 [spl] [ 7496.777097] kthread+0x134/0x150 [ 7496.780616] ? set_kthread_struct+0x50/0x50 [ 7496.784549] ret_from_fork+0x35/0x40 [ 7496.788204] INFO: task fio:1101750 blocked for more than 120 seconds. [ 7496.895852] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.903765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.911170] task:fio state:D stack: 0 pid:1101750 ppid:1101741 flags:0x00004080 [ 7496.919033] Call Trace: [ 7496.922136] __schedule+0x2d1/0x870 [ 7496.925769] schedule+0x55/0xf0 [ 7496.929245] schedule_timeout+0x197/0x300 [ 7496.933120] ? __next_timer_interrupt+0xf0/0xf0 [ 7496.937213] io_schedule_timeout+0x19/0x40 [ 7496.941126] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7496.945444] ? finish_wait+0x80/0x80 [ 7496.949125] __cv_timedwait_io+0x15/0x20 [spl] [ 7496.953191] zio_wait+0x1a2/0x4d0 [zfs] [ 7496.957180] dmu_write_abd+0x174/0x1c0 [zfs] [ 7496.961319] dmu_write_uio_direct+0x79/0xf0 [zfs] [ 7496.965731] dmu_write_uio_dnode+0xa6/0x2d0 [zfs] [ 7496.970043] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 7496.974305] zfs_write+0x55f/0xea0 [zfs] [ 7496.978325] ? iov_iter_get_pages+0xe9/0x390 [ 7496.982333] ? trylock_page+0xd/0x20 [zfs] [ 7496.986451] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7496.990713] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7496.995031] zpl_iter_write_direct+0xda/0x170 [zfs] [ 7496.999489] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.003476] zpl_iter_write+0xd5/0x110 [zfs] [ 7497.007610] new_sync_write+0x112/0x160 [ 7497.011429] vfs_write+0xa5/0x1b0 [ 7497.014916] ksys_write+0x4f/0xb0 [ 7497.018443] do_syscall_64+0x5b/0x1b0 [ 7497.022150] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.026532] RIP: 0033:0x7f8771d72a17 [ 7497.030195] Code: Unable to access opcode bytes at RIP 0x7f8771d729ed. [ 7497.035263] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 7497.042547] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72a17 [ 7497.047933] RDX: 000000000009b000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7497.053269] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.058660] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000009b000 [ 7497.063960] R13: 000055b390afcac0 R14: 000000000009b000 R15: 000055b390afcae8 [ 7497.069334] INFO: task fio:1101751 blocked for more than 120 seconds. [ 7497.074308] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.081973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.089371] task:fio state:D stack: 0 pid:1101751 ppid:1101741 flags:0x00000080 [ 7497.097147] Call Trace: [ 7497.100263] __schedule+0x2d1/0x870 [ 7497.103897] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.107878] schedule+0x55/0xf0 [ 7497.111386] cv_wait_common+0x16d/0x280 [spl] [ 7497.115391] ? finish_wait+0x80/0x80 [ 7497.119028] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7497.123667] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7497.128240] zfs_read+0xaf/0x3f0 [zfs] [ 7497.132146] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.136091] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7497.140366] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7497.144679] zpl_iter_read_direct+0xe0/0x180 [zfs] [ 7497.149054] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.153040] zpl_iter_read+0x94/0xb0 [zfs] [ 7497.157103] new_sync_read+0x10f/0x160 [ 7497.160855] vfs_read+0x91/0x150 [ 7497.164336] ksys_read+0x4f/0xb0 [ 7497.168004] do_syscall_64+0x5b/0x1b0 [ 7497.171706] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.176105] RIP: 0033:0x7f8771d72ab4 [ 7497.179742] Code: Unable to access opcode bytes at RIP 0x7f8771d72a8a. [ 7497.184807] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 7497.192129] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72ab4 [ 7497.197485] RDX: 0000000000002000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7497.202922] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.208309] R10: 00000001ffffffff R11: 0000000000000246 R12: 0000000000002000 [ 7497.213694] R13: 000055b390afcac0 R14: 0000000000002000 R15: 000055b390afcae8 [ 7497.219063] INFO: task fio:1101755 blocked for more than 120 seconds. [ 7497.224098] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.231786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.239165] task:fio state:D stack: 0 pid:1101755 ppid:1101744 flags:0x00000080 [ 7497.246989] Call Trace: [ 7497.250121] __schedule+0x2d1/0x870 [ 7497.253779] schedule+0x55/0xf0 [ 7497.257240] schedule_preempt_disabled+0xa/0x10 [ 7497.261344] __mutex_lock.isra.7+0x349/0x420 [ 7497.265326] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7497.269674] zil_commit_writer+0x89/0x230 [zfs] [ 7497.273938] zil_commit_impl+0x5f/0xd0 [zfs] [ 7497.278101] zfs_fsync+0x81/0xa0 [zfs] [ 7497.282002] zpl_fsync+0xe5/0x140 [zfs] [ 7497.285985] do_fsync+0x38/0x70 [ 7497.289458] __x64_sys_fsync+0x10/0x20 [ 7497.293208] do_syscall_64+0x5b/0x1b0 [ 7497.296928] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.301260] RIP: 0033:0x7f9559073027 [ 7497.304920] Code: Unable to access opcode bytes at RIP 0x7f9559072ffd. [ 7497.310015] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 7497.317346] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9559073027 [ 7497.322722] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI: 0000000000000005 [ 7497.328126] RBP: 00007f94fb858000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.333514] R10: 0000000000008000 R11: 0000000000000293 R12: 0000000000000003 [ 7497.338887] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15: 0000563adcbf2ae8 [ 7497.344247] INFO: task fio:1101756 blocked for more than 120 seconds. [ 7497.349327] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.357032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.364517] task:fio state:D stack: 0 pid:1101756 ppid:1101744 flags:0x00004080 [ 7497.372310] Call Trace: [ 7497.375433] __schedule+0x2d1/0x870 [ 7497.379004] schedule+0x55/0xf0 [ 7497.382454] cv_wait_common+0x16d/0x280 [spl] [ 7497.386477] ? finish_wait+0x80/0x80 [ 7497.390137] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7497.394816] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7497.399397] zfs_get_data+0x1a8/0x7e0 [zfs] [ 7497.403515] zil_lwb_commit+0x1a5/0x400 [zfs] [ 7497.407712] zil_lwb_write_close+0x408/0x630 [zfs] [ 7497.412126] zil_commit_waiter_timeout+0x16d/0x520 [zfs] [ 7497.416801] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 7497.421139] zil_commit_impl+0x6d/0xd0 [zfs] [ 7497.425294] zfs_fsync+0x81/0xa0 [zfs] [ 7497.429454] zpl_fsync+0xe5/0x140 [zfs] [ 7497.433396] do_fsync+0x38/0x70 [ 7497.436878] __x64_sys_fsync+0x10/0x20 [ 7497.440586] do_syscall_64+0x5b/0x1b0 [ 7497.444313] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.448659] RIP: 0033:0x7f9559073027 [ 7497.452343] Code: Unable to access opcode bytes at RIP 0x7f9559072ffd. [ 7497.457408] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 7497.464724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9559073027 [ 7497.470106] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI: 0000000000000005 [ 7497.475477] RBP: 00007f94fb89ca18 R08: 0000000000000000 R09: 0000000000000000 [ 7497.480806] R10: 00000000000b4cc0 R11: 0000000000000293 R12: 0000000000000003 [ 7497.486158] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15: 0000563adcbf2ae8 [ 7619.459402] INFO: task z_wr_int:1098000 blocked for more than 120 seconds. [ 7619.464605] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.472233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.479659] task:z_wr_int state:D stack: 0 pid:1098000 ppid: 2 flags:0x80004080 [ 7619.487518] Call Trace: [ 7619.490650] __schedule+0x2d1/0x870 [ 7619.494246] schedule+0x55/0xf0 [ 7619.497719] cv_wait_common+0x16d/0x280 [spl] [ 7619.501749] ? finish_wait+0x80/0x80 [ 7619.505411] dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs] [ 7619.510143] dmu_write_direct_done+0x90/0x3a0 [zfs] [ 7619.514603] zio_done+0x373/0x1d40 [zfs] [ 7619.518594] zio_execute+0xee/0x210 [zfs] [ 7619.522619] taskq_thread+0x203/0x420 [spl] [ 7619.526567] ? wake_up_q+0x70/0x70 [ 7619.530208] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 7619.535302] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 7619.539385] kthread+0x134/0x150 [ 7619.542873] ? set_kthread_struct+0x50/0x50 [ 7619.546810] ret_from_fork+0x35/0x40 [ 7619.550477] INFO: task txg_sync:1098025 blocked for more than 120 seconds. [ 7619.555715] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.563415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.570851] task:txg_sync state:D stack: 0 pid:1098025 ppid: 2 flags:0x80004080 [ 7619.578606] Call Trace: [ 7619.581742] __schedule+0x2d1/0x870 [ 7619.585396] schedule+0x55/0xf0 [ 7619.589006] schedule_timeout+0x197/0x300 [ 7619.592916] ? __next_timer_interrupt+0xf0/0xf0 [ 7619.597027] io_schedule_timeout+0x19/0x40 [ 7619.600947] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7619.709878] ? finish_wait+0x80/0x80 [ 7619.713565] __cv_timedwait_io+0x15/0x20 [spl] [ 7619.717596] zio_wait+0x1a2/0x4d0 [zfs] [ 7619.721567] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 7619.725657] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7619.730050] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 7619.734415] spa_sync_iterate_to_convergence+0xcf/0x310 [zfs] [ 7619.739268] spa_sync+0x362/0x8d0 [zfs] [ 7619.743270] txg_sync_thread+0x274/0x3b0 [zfs] [ 7619.747494] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 7619.751939] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 7619.756279] thread_generic_wrapper+0x63/0x90 [spl] [ 7619.760569] kthread+0x134/0x150 [ 7619.764050] ? set_kthread_struct+0x50/0x50 [ 7619.767978] ret_from_fork+0x35/0x40 [ 7619.771639] INFO: task fio:1101750 blocked for more than 120 seconds. [ 7619.776678] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.784324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.791914] task:fio state:D stack: 0 pid:1101750 ppid:1101741 flags:0x00004080 [ 7619.799712] Call Trace: [ 7619.802816] __schedule+0x2d1/0x870 [ 7619.806427] schedule+0x55/0xf0 [ 7619.809867] schedule_timeout+0x197/0x300 [ 7619.813760] ? __next_timer_interrupt+0xf0/0xf0 [ 7619.817848] io_schedule_timeout+0x19/0x40 [ 7619.821766] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7619.826097] ? finish_wait+0x80/0x80 [ 7619.829780] __cv_timedwait_io+0x15/0x20 [spl] [ 7619.833857] zio_wait+0x1a2/0x4d0 [zfs] [ 7619.837838] dmu_write_abd+0x174/0x1c0 [zfs] [ 7619.842015] dmu_write_uio_direct+0x79/0xf0 [zfs] [ 7619.846388] dmu_write_uio_dnode+0xa6/0x2d0 [zfs] [ 7619.850760] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 7619.855011] zfs_write+0x55f/0xea0 [zfs] [ 7619.859008] ? iov_iter_get_pages+0xe9/0x390 [ 7619.863036] ? trylock_page+0xd/0x20 [zfs] [ 7619.867084] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7619.871366] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7619.875715] zpl_iter_write_direct+0xda/0x170 [zfs] [ 7619.880164] ? rrw_exit+0xc6/0x200 [zfs] [ 7619.884174] zpl_iter_write+0xd5/0x110 [zfs] [ 7619.888492] new_sync_write+0x112/0x160 [ 7619.892285] vfs_write+0xa5/0x1b0 [ 7619.895829] ksys_write+0x4f/0xb0 [ 7619.899384] do_syscall_64+0x5b/0x1b0 [ 7619.903071] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7619.907394] RIP: 0033:0x7f8771d72a17 [ 7619.911026] Code: Unable to access opcode bytes at RIP 0x7f8771d729ed. [ 7619.916073] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 7619.923363] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72a17 [ 7619.928675] RDX: 000000000009b000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7619.934019] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7619.939354] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000009b000 [ 7619.944775] R13: 000055b390afcac0 R14: 000000000009b000 R15: 000055b390afcae8 [ 7619.950175] INFO: task fio:1101751 blocked for more than 120 seconds. [ 7619.955232] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.962889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.970301] task:fio state:D stack: 0 pid:1101751 ppid:1101741 flags:0x00000080 [ 7619.978139] Call Trace: [ 7619.981278] __schedule+0x2d1/0x870 [ 7619.984872] ? rrw_exit+0xc6/0x200 [zfs] [ 7619.989260] schedule+0x55/0xf0 [ 7619.992725] cv_wait_common+0x16d/0x280 [spl] [ 7619.996754] ? finish_wait+0x80/0x80 [ 7620.000414] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7620.005050] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7620.009617] zfs_read+0xaf/0x3f0 [zfs] [ 7620.013503] ? rrw_exit+0xc6/0x200 [zfs] [ 7620.017489] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7620.021774] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7620.026091] zpl_iter_read_direct+0xe0/0x180 [zfs] [ 7620.030508] ? rrw_exit+0xc6/0x200 [zfs] [ 7620.034497] zpl_iter_read+0x94/0xb0 [zfs] [ 7620.038579] new_sync_read+0x10f/0x160 [ 7620.042325] vfs_read+0x91/0x150 [ 7620.045809] ksys_read+0x4f/0xb0 [ 7620.049273] do_syscall_64+0x5b/0x1b0 [ 7620.052965] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7620.057354] RIP: 0033:0x7f8771d72ab4 [ 7620.060988] Code: Unable to access opcode bytes at RIP 0x7f8771d72a8a. [ 7620.066041] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 7620.073256] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72ab4 [ 7620.078553] RDX: 0000000000002000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7620.083878] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7620.089353] R10: 00000001ffffffff R11: 0000000000000246 R12: 0000000000002000 [ 7620.094697] R13: 000055b390afcac0 R14: 0000000000002000 R15: 000055b390afcae8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 9, 2024
995734e added a test for block cloning with mmap files. As a result I began hitting a panic in that test in dbuf_unoverride(). The was if the dirty record was from a cloned block, then the dr_data must be set to NULL. This ASSERT was added in 86e115e. The point of that commit was to make sure that if a cloned block is read before it is synced out, then the associated ARC buffer is set in the dirty record. This became an issue with the O_DIRECT code, because dr_data was set to the ARC buf in dbuf_set_data() after the read. This is the incorrect logic though for the cloned block. In order to fix this issue, I refined how to determine if the dirty record is in fact from a O_DIRECT write by make sure that dr_brtwrite is false. I created the function dbuf_dirty_is_direct_write() to perform the proper check. As part of this, I also cleaned up other code that did the exact same check for an O_DIRECT write to make sure the proper check is taking place everywhere. The trace of the ASSERT that was being tripped before this change is below: [3649972.811039] VERIFY0P(dr->dt.dl.dr_data) failed (NULL == ffff8d58e8183c80) [3649972.817999] PANIC at dbuf.c:1999:dbuf_unoverride() [3649972.822968] Showing stack for process 2365657 [3649972.827502] CPU: 0 PID: 2365657 Comm: clone_mmap_writ Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [3649972.839749] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [3649972.847315] Call Trace: [3649972.849935] dump_stack+0x41/0x60 [3649972.853428] spl_panic+0xd0/0xe8 [spl] [3649972.857370] ? cityhash4+0x75/0x90 [zfs] [3649972.861649] ? _cond_resched+0x15/0x30 [3649972.865577] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [3649972.870548] ? __kmalloc_node+0x10d/0x300 [3649972.874735] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [3649972.879702] ? __list_add+0x12/0x30 [zfs] [3649972.884061] dbuf_unoverride+0x1c1/0x1d0 [zfs] [3649972.888856] dbuf_redirty+0x3b/0xd0 [zfs] [3649972.893204] dbuf_dirty+0xeb1/0x1330 [zfs] [3649972.897643] ? _cond_resched+0x15/0x30 [3649972.901569] ? mutex_lock+0xe/0x30 [3649972.905148] ? dbuf_noread+0x117/0x240 [zfs] [3649972.909760] dmu_write_uio_dnode+0x1d2/0x320 [zfs] [3649972.914900] dmu_write_uio_dbuf+0x47/0x60 [zfs] [3649972.919777] zfs_write+0x57d/0xe00 [zfs] [3649972.924076] ? alloc_set_pte+0xb8/0x3e0 [3649972.928088] zpl_iter_write_buffered+0xb2/0x120 [zfs] [3649972.933507] ? rrw_exit+0xc6/0x200 [zfs] [3649972.937796] zpl_iter_write+0xba/0x110 [zfs] [3649972.942433] new_sync_write+0x112/0x160 [3649972.946445] vfs_write+0xa5/0x1a0 [3649972.949935] ksys_pwrite64+0x61/0xa0 [3649972.953681] do_syscall_64+0x5b/0x1a0 [3649972.957519] entry_SYSCALL_64_after_hwframe+0x65/0xca [3649972.962745] RIP: 0033:0x7f610616f01b Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 9, 2024
Originally I was checking dr->dr_dbuf->db_level == 0 in dbuf_dirty_is_direct_write(). Howver, this can lead to a NULL ponter dereference if the dr_dbuf is no longer set. I updated dbuf_dirty_is_direct_write() to now also take a dmu_buf_impl_t to check if db->db_level == 0. This failure was caught on the Fedora 37 CI running in test enospc_rm. Below is the stack trace. [ 9851.511608] BUG: kernel NULL pointer dereference, address: 0000000000000068 [ 9851.515922] #PF: supervisor read access in kernel mode [ 9851.519462] #PF: error_code(0x0000) - not-present page [ 9851.522992] PGD 0 P4D 0 [ 9851.525684] Oops: 0000 [openzfs#1] PREEMPT SMP PTI [ 9851.528878] CPU: 0 PID: 1272993 Comm: fio Tainted: P OE 6.5.12-100.fc37.x86_64 openzfs#1 [ 9851.535266] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017 [ 9851.539226] RIP: 0010:dbuf_dirty_is_direct_write+0xb/0x40 [zfs] [ 9851.543379] Code: 10 74 02 31 c0 5b c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 31 c0 48 85 ff 74 31 48 8b 57 20 <80> 7a 68 00 75 27 8b 87 64 01 00 00 85 c0 75 1b 83 bf 58 01 00 00 [ 9851.554719] RSP: 0018:ffff9b5b8305f8e8 EFLAGS: 00010286 [ 9851.558276] RAX: 0000000000000000 RBX: ffff9b5b8569b0b8 RCX: 0000000000000000 [ 9851.562481] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8f2e97de9e00 [ 9851.566672] RBP: 0000000000020000 R08: 0000000000000000 R09: ffff8f2f70e94000 [ 9851.570851] R10: 0000000000000001 R11: 0000000000000110 R12: ffff8f2f774ae4c0 [ 9851.575032] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 9851.579209] FS: 00007f57c5542240(0000) GS:ffff8f2faa800000(0000) knlGS:0000000000000000 [ 9851.585357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9851.589064] CR2: 0000000000000068 CR3: 00000001f9a38001 CR4: 00000000007706f0 [ 9851.593256] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9851.597440] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9851.601618] PKRU: 55555554 [ 9851.604341] Call Trace: [ 9851.606981] <TASK> [ 9851.609515] ? __die+0x23/0x70 [ 9851.612388] ? page_fault_oops+0x171/0x4e0 [ 9851.615571] ? exc_page_fault+0x77/0x170 [ 9851.618704] ? asm_exc_page_fault+0x26/0x30 [ 9851.621900] ? dbuf_dirty_is_direct_write+0xb/0x40 [zfs] [ 9851.625828] zfs_get_data+0x407/0x820 [zfs] [ 9851.629400] zil_lwb_commit+0x18d/0x3f0 [zfs] [ 9851.633026] zil_lwb_write_issue+0x92/0xbb0 [zfs] [ 9851.636758] zil_commit_waiter_timeout+0x1f3/0x580 [zfs] [ 9851.640696] zil_commit_waiter+0x1ff/0x3a0 [zfs] [ 9851.644402] zil_commit_impl+0x71/0xd0 [zfs] [ 9851.647998] zfs_write+0xb51/0xdc0 [zfs] [ 9851.651467] zpl_iter_write_buffered+0xc9/0x140 [zfs] [ 9851.655337] zpl_iter_write+0xc0/0x110 [zfs] [ 9851.658920] vfs_write+0x23e/0x420 [ 9851.661871] __x64_sys_pwrite64+0x98/0xd0 [ 9851.665013] do_syscall_64+0x5f/0x90 [ 9851.668027] ? ksys_fadvise64_64+0x57/0xa0 [ 9851.671212] ? syscall_exit_to_user_mode+0x2b/0x40 [ 9851.674594] ? do_syscall_64+0x6b/0x90 [ 9851.677655] ? syscall_exit_to_user_mode+0x2b/0x40 [ 9851.681051] ? do_syscall_64+0x6b/0x90 [ 9851.684128] ? exc_page_fault+0x77/0x170 [ 9851.687256] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 9851.690759] RIP: 0033:0x7f57c563c377 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 9, 2024
There existed a race condition between when a Direct I/O write could complete and if a sync operation was issued. This was due to the fact that a Direct I/O would sleep waiting on previous TXG's to sync out their dirty records assosciated with a dbuf if there was an ARC buffer associated with the dbuf. This was necessay to safely destroy the ARC buffer in case previous dirty records dr_data as pointed at that the db_buf. The main issue with this approach is a Direct I/o write holds the rangelock across the entire block, so when a sync on that same block was issued and tried to grab the rangelock as reader, it would be blocked indefinitely because the Direct I/O that was now sleeping was holding that same rangelock as writer. This led to a complete deadlock. This commit fixes this issue and removes the wait in dmu_write_direct_done(). The way this is now handled is the ARC buffer is destroyed, if there an associated one with dbuf, before ever issuing the Direct I/O write. This implemenation heavily borrows from the block cloning implementation. A new function dmu_buf_wil_clone_or_dio() is called in both dmu_write_direct() and dmu_brt_clone() that does the following: 1. Undirties a dirty record for that db if there one currently associated with the current TXG. 2. Destroys the ARC buffer if the previous dirty record dr_data does not point at the dbufs ARC buffer (db_buf). 3. Sets the dbufs data pointers to NULL. 4. Redirties the dbuf using db_state = DB_NOFILL. As part of this commit, the dmu_write_direct_done() function was also cleaned up. Now dmu_sync_done() is called before undirtying the dbuf dirty record associated with a failed Direct I/O write. This is correct logic and how it always should have been. The additional benefits of these modifications is there is no longer a stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT together. Also it unifies the block cloning and Direct I/O write path as they both need to call dbuf_fix_old_data() before destroying the ARC buffer. As part of this commit, there is also just general code cleanup. Various dbuf stats were removed because they are not necesary any longer. Additionally, useless functions were removed to make the code paths cleaner for Direct I/O. Below is the race condtion stack trace that was being consistently observed in the CI runs for the dio_random test case that prompted these changes: trace: [ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [ 9954.770512] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.773848] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [ 9954.775512] Call Trace: [ 9954.776406] __schedule+0x2d1/0x870 [ 9954.777386] ? free_one_page+0x204/0x530 [ 9954.778466] schedule+0x55/0xf0 [ 9954.779355] cv_wait_common+0x16d/0x280 [spl] [ 9954.780491] ? finish_wait+0x80/0x80 [ 9954.781450] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [ 9954.782889] dmu_write_direct_done+0x90/0x3b0 [zfs] [ 9954.784255] zio_done+0x373/0x1d50 [zfs] [ 9954.785410] zio_execute+0xee/0x210 [zfs] [ 9954.786588] taskq_thread+0x205/0x3f0 [spl] [ 9954.787673] ? wake_up_q+0x60/0x60 [ 9954.788571] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 9954.790079] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 9954.791199] kthread+0x134/0x150 [ 9954.792082] ? set_kthread_struct+0x50/0x50 [ 9954.793189] ret_from_fork+0x35/0x40 [ 9954.794108] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [ 9954.795535] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.798669] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [ 9954.800267] Call Trace: [ 9954.801096] __schedule+0x2d1/0x870 [ 9954.801972] ? __wake_up_common+0x7a/0x190 [ 9954.802963] schedule+0x55/0xf0 [ 9954.803884] schedule_timeout+0x19f/0x320 [ 9954.804837] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.805932] ? taskq_dispatch+0xab/0x280 [spl] [ 9954.806959] io_schedule_timeout+0x19/0x40 [ 9954.807989] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.809110] ? finish_wait+0x80/0x80 [ 9954.810068] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.811103] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.812255] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 9954.813442] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 9954.814648] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [ 9954.816023] spa_sync+0x362/0x8f0 [zfs] [ 9954.817110] txg_sync_thread+0x27a/0x3b0 [zfs] [ 9954.818267] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 9954.819510] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 9954.820643] thread_generic_wrapper+0x63/0x90 [spl] [ 9954.821709] kthread+0x134/0x150 [ 9954.822590] ? set_kthread_struct+0x50/0x50 [ 9954.823584] ret_from_fork+0x35/0x40 [ 9954.824444] INFO: task fio:1055501 blocked for more than 120 seconds. [ 9954.825781] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.828871] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [ 9954.830463] Call Trace: [ 9954.831280] __schedule+0x2d1/0x870 [ 9954.832159] ? dbuf_hold_copy+0xec/0x230 [zfs] [ 9954.833396] schedule+0x55/0xf0 [ 9954.834286] cv_wait_common+0x16d/0x280 [spl] [ 9954.835291] ? finish_wait+0x80/0x80 [ 9954.836235] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 9954.837543] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 9954.838838] zfs_get_data+0x566/0x810 [zfs] [ 9954.840034] zil_lwb_commit+0x194/0x3f0 [zfs] [ 9954.841154] zil_lwb_write_issue+0x68/0xb90 [zfs] [ 9954.842367] ? __list_add+0x12/0x30 [zfs] [ 9954.843496] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.844665] ? zil_alloc_lwb+0x217/0x360 [zfs] [ 9954.845852] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [ 9954.847203] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 9954.848380] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.849550] zfs_fsync+0x66/0x90 [zfs] [ 9954.850640] zpl_fsync+0xe5/0x140 [zfs] [ 9954.851729] do_fsync+0x38/0x70 [ 9954.852585] __x64_sys_fsync+0x10/0x20 [ 9954.853486] do_syscall_64+0x5b/0x1b0 [ 9954.854416] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.855466] RIP: 0033:0x7eff236bb057 [ 9954.856388] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [ 9954.866149] INFO: task fio:1055502 blocked for more than 120 seconds. [ 9954.867490] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.870571] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [ 9954.872162] Call Trace: [ 9954.872947] __schedule+0x2d1/0x870 [ 9954.873844] schedule+0x55/0xf0 [ 9954.874716] schedule_timeout+0x19f/0x320 [ 9954.875645] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.876722] io_schedule_timeout+0x19/0x40 [ 9954.877677] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.878822] ? finish_wait+0x80/0x80 [ 9954.879694] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.880763] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.881865] dmu_write_abd+0x174/0x1c0 [zfs] [ 9954.883074] dmu_write_uio_direct+0x79/0x100 [zfs] [ 9954.884285] dmu_write_uio_dnode+0xb2/0x320 [zfs] [ 9954.885507] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 9954.886687] zfs_write+0x581/0xe20 [zfs] [ 9954.887822] ? iov_iter_get_pages+0xe9/0x390 [ 9954.888862] ? trylock_page+0xd/0x20 [zfs] [ 9954.890005] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.891217] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 9954.892391] zpl_iter_write_direct+0xd4/0x170 [zfs] [ 9954.893663] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.894764] zpl_iter_write+0xd5/0x110 [zfs] [ 9954.895911] new_sync_write+0x112/0x160 [ 9954.896881] vfs_write+0xa5/0x1b0 [ 9954.897701] ksys_write+0x4f/0xb0 [ 9954.898569] do_syscall_64+0x5b/0x1b0 [ 9954.899417] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.900515] RIP: 0033:0x7eff236baa47 [ 9954.901363] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 [ 9954.911129] INFO: task fio:1055504 blocked for more than 120 seconds. [ 9954.912381] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.915434] task:fio state:D stack:0 pid:1055504 ppid:1055493 flags:0x00000080 [ 9954.917082] Call Trace: [ 9954.917773] __schedule+0x2d1/0x870 [ 9954.918648] ? zilog_dirty+0x4f/0xc0 [zfs] [ 9954.919831] schedule+0x55/0xf0 [ 9954.920717] cv_wait_common+0x16d/0x280 [spl] [ 9954.921704] ? finish_wait+0x80/0x80 [ 9954.922639] zfs_rangelock_enter_writer+0x46/0x1c0 [zfs] [ 9954.923940] zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs] [ 9954.925306] zfs_write+0x703/0xe20 [zfs] [ 9954.926406] zpl_iter_write_buffered+0xb2/0x120 [zfs] [ 9954.927687] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.928821] zpl_iter_write+0xbe/0x110 [zfs] [ 9954.930028] new_sync_write+0x112/0x160 [ 9954.930913] vfs_write+0xa5/0x1b0 [ 9954.931758] ksys_write+0x4f/0xb0 [ 9954.932666] do_syscall_64+0x5b/0x1b0 [ 9954.933544] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.934689] RIP: 0033:0x7fcaee8f0a47 [ 9954.935551] Code: Unable to access opcode bytes at RIP 0x7fcaee8f0a1d. [ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fcaee8f0a47 [ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI: 0000000000000006 [ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09: 0000000000000000 [ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000001d000 [ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15: 0000557a2006bae8 [ 9954.945525] INFO: task fio:1055505 blocked for more than 120 seconds. [ 9954.946819] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.949959] task:fio state:D stack:0 pid:1055505 ppid:1055493 flags:0x00004080 [ 9954.951653] Call Trace: [ 9954.952417] __schedule+0x2d1/0x870 [ 9954.953393] ? finish_wait+0x3e/0x80 [ 9954.954315] schedule+0x55/0xf0 [ 9954.955212] cv_wait_common+0x16d/0x280 [spl] [ 9954.956211] ? finish_wait+0x80/0x80 [ 9954.957159] zil_commit_waiter+0xfa/0x3b0 [zfs] [ 9954.958343] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.959524] zfs_fsync+0x66/0x90 [zfs] [ 9954.960626] zpl_fsync+0xe5/0x140 [zfs] [ 9954.961763] do_fsync+0x38/0x70 [ 9954.962638] __x64_sys_fsync+0x10/0x20 [ 9954.963520] do_syscall_64+0x5b/0x1b0 [ 9954.964470] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.965567] RIP: 0033:0x7fcaee8f1057 [ 9954.966490] Code: Unable to access opcode bytes at RIP 0x7fcaee8f102d. [ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fcaee8f1057 [ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI: 0000000000000005 [ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09: 0000000000000000 [ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15: 0000557a2006bae8 [10077.648150] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [10077.649541] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.652782] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [10077.654420] Call Trace: [10077.655267] __schedule+0x2d1/0x870 [10077.656179] ? free_one_page+0x204/0x530 [10077.657192] schedule+0x55/0xf0 [10077.658004] cv_wait_common+0x16d/0x280 [spl] [10077.659018] ? finish_wait+0x80/0x80 [10077.660013] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [10077.661396] dmu_write_direct_done+0x90/0x3b0 [zfs] [10077.662617] zio_done+0x373/0x1d50 [zfs] [10077.663783] zio_execute+0xee/0x210 [zfs] [10077.664921] taskq_thread+0x205/0x3f0 [spl] [10077.665982] ? wake_up_q+0x60/0x60 [10077.666842] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [10077.668295] ? taskq_lowest_id+0xc0/0xc0 [spl] [10077.669360] kthread+0x134/0x150 [10077.670191] ? set_kthread_struct+0x50/0x50 [10077.671209] ret_from_fork+0x35/0x40 [10077.672076] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [10077.673467] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.676612] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [10077.678288] Call Trace: [10077.679024] __schedule+0x2d1/0x870 [10077.679948] ? __wake_up_common+0x7a/0x190 [10077.681042] schedule+0x55/0xf0 [10077.681899] schedule_timeout+0x19f/0x320 [10077.682951] ? __next_timer_interrupt+0xf0/0xf0 [10077.684005] ? taskq_dispatch+0xab/0x280 [spl] [10077.685085] io_schedule_timeout+0x19/0x40 [10077.686080] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.687227] ? finish_wait+0x80/0x80 [10077.688123] __cv_timedwait_io+0x15/0x20 [spl] [10077.689206] zio_wait+0x1ad/0x4f0 [zfs] [10077.690300] dsl_pool_sync+0xcb/0x6c0 [zfs] [10077.691435] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [10077.692636] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [10077.693997] spa_sync+0x362/0x8f0 [zfs] [10077.695112] txg_sync_thread+0x27a/0x3b0 [zfs] [10077.696239] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [10077.697512] ? spl_assert.constprop.0+0x20/0x20 [spl] [10077.698639] thread_generic_wrapper+0x63/0x90 [spl] [10077.699687] kthread+0x134/0x150 [10077.700567] ? set_kthread_struct+0x50/0x50 [10077.701502] ret_from_fork+0x35/0x40 [10077.702430] INFO: task fio:1055501 blocked for more than 120 seconds. [10077.703697] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.706780] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [10077.708479] Call Trace: [10077.709231] __schedule+0x2d1/0x870 [10077.710190] ? dbuf_hold_copy+0xec/0x230 [zfs] [10077.711368] schedule+0x55/0xf0 [10077.712286] cv_wait_common+0x16d/0x280 [spl] [10077.713316] ? finish_wait+0x80/0x80 [10077.714262] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [10077.715566] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [10077.716878] zfs_get_data+0x566/0x810 [zfs] [10077.718032] zil_lwb_commit+0x194/0x3f0 [zfs] [10077.719234] zil_lwb_write_issue+0x68/0xb90 [zfs] [10077.720413] ? __list_add+0x12/0x30 [zfs] [10077.721525] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.722708] ? zil_alloc_lwb+0x217/0x360 [zfs] [10077.723931] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [10077.725273] zil_commit_waiter+0x1d2/0x3b0 [zfs] [10077.726438] zil_commit_impl+0x6d/0xd0 [zfs] [10077.727586] zfs_fsync+0x66/0x90 [zfs] [10077.728675] zpl_fsync+0xe5/0x140 [zfs] [10077.729755] do_fsync+0x38/0x70 [10077.730607] __x64_sys_fsync+0x10/0x20 [10077.731482] do_syscall_64+0x5b/0x1b0 [10077.732415] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.733487] RIP: 0033:0x7eff236bb057 [10077.734399] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [10077.744168] INFO: task fio:1055502 blocked for more than 120 seconds. [10077.745505] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.748642] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [10077.750233] Call Trace: [10077.751011] __schedule+0x2d1/0x870 [10077.751915] schedule+0x55/0xf0 [10077.752811] schedule_timeout+0x19f/0x320 [10077.753762] ? __next_timer_interrupt+0xf0/0xf0 [10077.754824] io_schedule_timeout+0x19/0x40 [10077.755782] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.756922] ? finish_wait+0x80/0x80 [10077.757788] __cv_timedwait_io+0x15/0x20 [spl] [10077.758845] zio_wait+0x1ad/0x4f0 [zfs] [10077.759941] dmu_write_abd+0x174/0x1c0 [zfs] [10077.761144] dmu_write_uio_direct+0x79/0x100 [zfs] [10077.762327] dmu_write_uio_dnode+0xb2/0x320 [zfs] [10077.763523] dmu_write_uio_dbuf+0x47/0x60 [zfs] [10077.764749] zfs_write+0x581/0xe20 [zfs] [10077.765825] ? iov_iter_get_pages+0xe9/0x390 [10077.766842] ? trylock_page+0xd/0x20 [zfs] [10077.767956] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.769189] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [10077.770343] zpl_iter_write_direct+0xd4/0x170 [zfs] [10077.771570] ? rrw_exit+0xc6/0x200 [zfs] [10077.772674] zpl_iter_write+0xd5/0x110 [zfs] [10077.773834] new_sync_write+0x112/0x160 [10077.774805] vfs_write+0xa5/0x1b0 [10077.775634] ksys_write+0x4f/0xb0 [10077.776526] do_syscall_64+0x5b/0x1b0 [10077.777386] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.778488] RIP: 0033:0x7eff236baa47 [10077.779339] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 10, 2024
Varada (varada.kari@gmail.com) pointed out an issue with O_DIRECT reads with the following test case: dd if=/dev/urandom of=/local_zpool/file2 bs=512 count=79 truncate -s 40382 /local_zpool/file2 zpool export local_zpool zpool import -d ~/ local_zpool dd if=/local_zpool/file2 of=/dev/null bs=1M iflag=direct That led to following panic happening: [ 307.769267] VERIFY(IS_P2ALIGNED(size, sizeof (uint32_t))) failed [ 307.782997] PANIC at zfs_fletcher.c:870:abd_fletcher_4_iter() [ 307.788743] Showing stack for process 9665 [ 307.792834] CPU: 47 PID: 9665 Comm: z_rd_int_5 Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [ 307.804298] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 307.811682] Call Trace: [ 307.814131] dump_stack+0x41/0x60 [ 307.817449] spl_panic+0xd0/0xe8 [spl] [ 307.821210] ? irq_work_queue+0x9/0x20 [ 307.824961] ? wake_up_klogd.part.30+0x30/0x40 [ 307.829407] ? vprintk_emit+0x125/0x250 [ 307.833246] ? printk+0x58/0x6f [ 307.836391] spl_assert.constprop.1+0x16/0x20 [zfs] [ 307.841438] abd_fletcher_4_iter+0x6c/0x101 [zfs] [ 307.846343] ? abd_fletcher_4_simd2scalar+0x83/0x83 [zfs] [ 307.851922] abd_iterate_func+0xb1/0x170 [zfs] [ 307.856533] abd_fletcher_4_impl+0x3f/0xa0 [zfs] [ 307.861334] abd_fletcher_4_native+0x52/0x70 [zfs] [ 307.866302] ? enqueue_entity+0xf1/0x6e0 [ 307.870226] ? select_idle_sibling+0x23/0x700 [ 307.874587] ? enqueue_task_fair+0x94/0x710 [ 307.878771] ? select_task_rq_fair+0x351/0x990 [ 307.883208] zio_checksum_error_impl+0xff/0x5f0 [zfs] [ 307.888435] ? abd_fletcher_4_impl+0xa0/0xa0 [zfs] [ 307.893401] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [ 307.898203] ? __wake_up_common+0x7a/0x190 [ 307.902300] ? __switch_to_asm+0x41/0x70 [ 307.906220] ? __switch_to_asm+0x35/0x70 [ 307.910145] ? __switch_to_asm+0x41/0x70 [ 307.914061] ? __switch_to_asm+0x35/0x70 [ 307.917980] ? __switch_to_asm+0x41/0x70 [ 307.921903] ? __switch_to_asm+0x35/0x70 [ 307.925821] ? __switch_to_asm+0x35/0x70 [ 307.929739] ? __switch_to_asm+0x41/0x70 [ 307.933658] ? __switch_to_asm+0x35/0x70 [ 307.937582] zio_checksum_error+0x47/0xc0 [zfs] [ 307.942288] raidz_checksum_verify+0x3a/0x70 [zfs] [ 307.947257] vdev_raidz_io_done+0x4b/0x160 [zfs] [ 307.952049] zio_vdev_io_done+0x7f/0x200 [zfs] [ 307.956669] zio_execute+0xee/0x210 [zfs] [ 307.960855] taskq_thread+0x203/0x420 [spl] [ 307.965048] ? wake_up_q+0x70/0x70 [ 307.968455] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 307.974807] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 307.979260] kthread+0x10a/0x120 [ 307.982485] ? set_kthread_struct+0x40/0x40 [ 307.986670] ret_from_fork+0x35/0x40 The reason this was occuring was because by doing the zpool export that meant the initial read of O_DIRECT would be forced to go down to disk. In this case it was still valid as bs=1M is still page size aligned; howver, the file length was not. So when issuing the O_DIRECT read even after calling make_abd_for_dbuf() we had an extra page allocated in the original ABD along with the linear ABD attached at the end of the gang abd from make_abd_for_dbuf(). This is an issue as it is our expectations with read that the block sizes being read are page aligned. When we do our check we are only checking the request but not the actual size of data we may read such as the entire file. In order to remedy this situation, I updated zfs_read() to attempt to read as much as it can using O_DIRECT based on if the length is page-sized aligned. Any additional bytes that are requested are then read into the ARC. This still stays with our semantics that IO requests must be page sized aligned. There are a bit of draw backs here if there is only a single block being read. In this case the block will be read twice. Once using O_DIRECT and then using buffered to fill in the remaining data for the users request. However, this should not be a big issue most of the time. In the normal case a user may ask for a lot of data from a file and only the stray bytes at the end of the file will have to be read using the ARC. In order to make sure this case was completely covered, I added a new ZTS test case dio_unaligned_filesize to test this out. The main thing with that test case is the first O_DIRECT read will issue out two reads two being O_DIRECT and the third being buffered for the remaining requested bytes. As part of this commit, I also updated stride_dd to take an additional parameter of -e, which says read the entire input file and ingore the count (-c) option. We need to use stride_dd for FreeBSD as dd does not make sure the buffer is page aligned. This udpate to stride_dd allows us to use it to test out this case in dio_unaligned_filesize for both Linux and FreeBSD. While this may not be the most elegant solution, it does stick with the semanatics and still reads all the data the user requested. I am fine with revisiting this and maybe we just return a short read? Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 10, 2024
We were using the generic Linux calls to make sure that the page cache was cleaned out before issuing any Direct I/O reads or writes. However, this only matters in the event the file region being written/read from using O_DIRECT was mmap'ed. One of stipulations with O_DIRECT is that is redirected through the ARC in the event the file range is mmap'ed. Becaues of this, it did not make sense to try and invalidate the page cache if we were never intending to have O_DIRECT to work with mmap'ed regions. Also, calls into the generic Linux calls in writes would often lead to lockups as the page lock is dropped in zfs_putpage(). See the stack dump below. In order to just prevent this, we no longer will use the generic linux direct IO wrappers or try and flush out the page cache. Instead if we find the file range has been mmap'ed in since the initial check in zfs_setup_direct() we will just now directly handle that in zfs_read() and zfs_write(). In most case zfs_setup_direct() will prevent O_DIRECT to mmap'ed regions of the file that have been page faulted in, but if that happen when we are issuing the direct I/O request the the normal parts of the ZFS paths will be taken to account for this. It is highly suggested not to mmap a region of file and then write or read directly to the file. In general, that is kind of an isane thing to do... However, we try our best to still have consistency with the ARC. Also, before making this decision I did explore if we could just add a rangelock in zfs_fillpage(), but we can not do that. The reason is when the page is in zfs_readpage_common() it has already been locked by the kernel. So, if we try and grab the rangelock anywhere in that path we can get stuck if another thread is issuing writes to the file region that was mmap'ed in. The reason is update_pages() holds the rangelock and then tries to lock the page. In this case zfs_fillpage() holds the page lock but is stuck in the rangelock waiting and holding the page lock. Deadlock is unavoidable in this case. [260136.244332] INFO: task fio:3791107 blocked for more than 120 seconds. [260136.250867] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260136.258693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260136.266607] task:fio state:D stack: 0 pid:3791107 ppid:3790841 flags:0x00004080 [260136.275306] Call Trace: [260136.277845] __schedule+0x2d1/0x830 [260136.281432] schedule+0x35/0xa0 [260136.284665] io_schedule+0x12/0x40 [260136.288157] wait_on_page_bit+0x123/0x220 [260136.292258] ? xas_load+0x8/0x80 [260136.295577] ? file_fdatawait_range+0x20/0x20 [260136.300024] filemap_page_mkwrite+0x9b/0xb0 [260136.304295] do_page_mkwrite+0x53/0x90 [260136.308135] ? vm_normal_page+0x1a/0xc0 [260136.312062] do_wp_page+0x298/0x350 [260136.315640] __handle_mm_fault+0x44f/0x6c0 [260136.319826] ? __switch_to_asm+0x41/0x70 [260136.323839] handle_mm_fault+0xc1/0x1e0 [260136.327766] do_user_addr_fault+0x1b5/0x440 [260136.332038] do_page_fault+0x37/0x130 [260136.335792] ? page_fault+0x8/0x30 [260136.339284] page_fault+0x1e/0x30 [260136.342689] RIP: 0033:0x7f6deee7f1b4 [260136.346361] Code: Unable to access opcode bytes at RIP 0x7f6deee7f18a. [260136.352977] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202 [260136.358288] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX: 00007f6d83148fe0 [260136.365508] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI: 00007f6d8309bfa0 [260136.372730] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09: 0000000000000000 [260136.379946] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12: 0000000000000001 [260136.387167] R13: 0000556b63614ec0 R14: 0000000000100000 R15: 0000556b63614ee8 [260136.394387] INFO: task fio:3791108 blocked for more than 120 seconds. [260136.400911] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260136.408739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260136.416651] task:fio state:D stack: 0 pid:3791108 ppid:3790835 flags:0x00004080 [260136.425343] Call Trace: [260136.427883] __schedule+0x2d1/0x830 [260136.431463] ? cv_wait_common+0x12d/0x240 [spl] [260136.436091] schedule+0x35/0xa0 [260136.439321] io_schedule+0x12/0x40 [260136.442814] __lock_page+0x12d/0x230 [260136.446483] ? file_fdatawait_range+0x20/0x20 [260136.450929] zfs_putpage+0x148/0x590 [zfs] [260136.455322] ? rmap_walk_file+0x116/0x290 [260136.459421] ? __mod_memcg_lruvec_state+0x5d/0x160 [260136.464300] zpl_putpage+0x67/0xd0 [zfs] [260136.468495] write_cache_pages+0x197/0x420 [260136.472679] ? zpl_readpage_filler+0x10/0x10 [zfs] [260136.477732] zpl_writepages+0x119/0x130 [zfs] [260136.482352] do_writepages+0xc2/0x1c0 [260136.486103] ? flush_tlb_func_common.constprop.9+0x125/0x220 [260136.491850] __filemap_fdatawrite_range+0xc7/0x100 [260136.496732] filemap_write_and_wait_range+0x30/0x80 [260136.501695] generic_file_direct_write+0x120/0x160 [260136.506575] ? rrw_exit+0xb0/0x1c0 [zfs] [260136.510779] zpl_iter_write+0xdd/0x160 [zfs] [260136.515323] new_sync_write+0x112/0x160 [260136.519255] vfs_write+0xa5/0x1a0 [260136.522662] ksys_write+0x4f/0xb0 [260136.526067] do_syscall_64+0x5b/0x1a0 [260136.529818] entry_SYSCALL_64_after_hwframe+0x65/0xca [260136.534959] RIP: 0033:0x7f9d192c7a17 [260136.538625] Code: Unable to access opcode bytes at RIP 0x7f9d192c79ed. [260136.545236] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [260136.552889] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9d192c7a17 [260136.560108] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI: 0000000000000005 [260136.567329] RBP: 00007f9caea03000 R08: 0000000000000000 R09: 0000000000000000 [260136.574548] R10: 00005558e8975680 R11: 0000000000000293 R12: 0000000000100000 [260136.581767] R13: 00005558e8985ec0 R14: 0000000000100000 R15: 00005558e8985ee8 [260136.588989] INFO: task fio:3791109 blocked for more than 120 seconds. [260136.595513] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260136.603337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260136.611250] task:fio state:D stack: 0 pid:3791109 ppid:3790838 flags:0x00004080 [260136.619943] Call Trace: [260136.622483] __schedule+0x2d1/0x830 [260136.626064] ? zfs_znode_held+0xe6/0x140 [zfs] [260136.630777] schedule+0x35/0xa0 [260136.634009] cv_wait_common+0x153/0x240 [spl] [260136.638466] ? finish_wait+0x80/0x80 [260136.642129] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [260136.647712] zfs_rangelock_enter_impl+0xbf/0x170 [zfs] [260136.653121] zfs_get_data+0x113/0x770 [zfs] [260136.657567] zil_lwb_commit+0x537/0x780 [zfs] [260136.662187] zil_process_commit_list+0x14c/0x460 [zfs] [260136.667585] zil_commit_writer+0xeb/0x160 [zfs] [260136.672376] zil_commit_impl+0x5d/0xa0 [zfs] [260136.676910] zfs_putpage+0x516/0x590 [zfs] [260136.681279] zpl_putpage+0x67/0xd0 [zfs] [260136.685467] write_cache_pages+0x197/0x420 [260136.689649] ? zpl_readpage_filler+0x10/0x10 [zfs] [260136.694705] zpl_writepages+0x119/0x130 [zfs] [260136.699322] do_writepages+0xc2/0x1c0 [260136.703076] __filemap_fdatawrite_range+0xc7/0x100 [260136.707952] filemap_write_and_wait_range+0x30/0x80 [260136.712920] zpl_iter_read_direct+0x86/0x1b0 [zfs] [260136.717972] ? rrw_exit+0xb0/0x1c0 [zfs] [260136.722174] zpl_iter_read+0x90/0xb0 [zfs] [260136.726536] new_sync_read+0x10f/0x150 [260136.730376] vfs_read+0x91/0x140 [260136.733693] ksys_read+0x4f/0xb0 [260136.737012] do_syscall_64+0x5b/0x1a0 [260136.740764] entry_SYSCALL_64_after_hwframe+0x65/0xca [260136.745906] RIP: 0033:0x7f1bd4687ab4 [260136.749574] Code: Unable to access opcode bytes at RIP 0x7f1bd4687a8a. [260136.756181] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [260136.763834] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f1bd4687ab4 [260136.771056] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI: 0000000000000005 [260136.778274] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09: 0000000000000000 [260136.785494] R10: 000000008fd0ea42 R11: 0000000000000246 R12: 0000000000100000 [260136.792714] R13: 000055ca4b405ec0 R14: 0000000000100000 R15: 000055ca4b405ee8 [260259.123003] INFO: task kworker/u128:0:3589938 blocked for more than 120 seconds. [260259.130487] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.138313] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.146224] task:kworker/u128:0 state:D stack: 0 pid:3589938 ppid: 2 flags:0x80004080 [260259.154832] Workqueue: writeback wb_workfn (flush-zfs-540) [260259.160411] Call Trace: [260259.162950] __schedule+0x2d1/0x830 [260259.166531] schedule+0x35/0xa0 [260259.169765] io_schedule+0x12/0x40 [260259.173257] __lock_page+0x12d/0x230 [260259.176921] ? file_fdatawait_range+0x20/0x20 [260259.181368] write_cache_pages+0x1f2/0x420 [260259.185554] ? zpl_readpage_filler+0x10/0x10 [zfs] [260259.190633] zpl_writepages+0x98/0x130 [zfs] [260259.195183] do_writepages+0xc2/0x1c0 [260259.198935] __writeback_single_inode+0x39/0x2f0 [260259.203640] writeback_sb_inodes+0x1e6/0x450 [260259.208002] __writeback_inodes_wb+0x5f/0xc0 [260259.212359] wb_writeback+0x247/0x2e0 [260259.216114] ? get_nr_inodes+0x35/0x50 [260259.219953] wb_workfn+0x37c/0x4d0 [260259.223443] ? __switch_to_asm+0x35/0x70 [260259.227456] ? __switch_to_asm+0x41/0x70 [260259.231469] ? __switch_to_asm+0x35/0x70 [260259.235481] ? __switch_to_asm+0x41/0x70 [260259.239495] ? __switch_to_asm+0x35/0x70 [260259.243505] ? __switch_to_asm+0x41/0x70 [260259.247518] ? __switch_to_asm+0x35/0x70 [260259.251533] ? __switch_to_asm+0x41/0x70 [260259.255545] process_one_work+0x1a7/0x360 [260259.259645] worker_thread+0x30/0x390 [260259.263396] ? create_worker+0x1a0/0x1a0 [260259.267409] kthread+0x10a/0x120 [260259.270730] ? set_kthread_struct+0x40/0x40 [260259.275003] ret_from_fork+0x35/0x40 [260259.278712] INFO: task fio:3791107 blocked for more than 120 seconds. [260259.285240] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.293064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.300976] task:fio state:D stack: 0 pid:3791107 ppid:3790841 flags:0x00004080 [260259.309668] Call Trace: [260259.312210] __schedule+0x2d1/0x830 [260259.315787] schedule+0x35/0xa0 [260259.319020] io_schedule+0x12/0x40 [260259.322511] wait_on_page_bit+0x123/0x220 [260259.326611] ? xas_load+0x8/0x80 [260259.329930] ? file_fdatawait_range+0x20/0x20 [260259.334376] filemap_page_mkwrite+0x9b/0xb0 [260259.338650] do_page_mkwrite+0x53/0x90 [260259.342489] ? vm_normal_page+0x1a/0xc0 [260259.346415] do_wp_page+0x298/0x350 [260259.349994] __handle_mm_fault+0x44f/0x6c0 [260259.354181] ? __switch_to_asm+0x41/0x70 [260259.358193] handle_mm_fault+0xc1/0x1e0 [260259.362117] do_user_addr_fault+0x1b5/0x440 [260259.366391] do_page_fault+0x37/0x130 [260259.370145] ? page_fault+0x8/0x30 [260259.373639] page_fault+0x1e/0x30 [260259.377043] RIP: 0033:0x7f6deee7f1b4 [260259.380714] Code: Unable to access opcode bytes at RIP 0x7f6deee7f18a. [260259.387323] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202 [260259.392633] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX: 00007f6d83148fe0 [260259.399853] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI: 00007f6d8309bfa0 [260259.407074] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09: 0000000000000000 [260259.414291] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12: 0000000000000001 [260259.421512] R13: 0000556b63614ec0 R14: 0000000000100000 R15: 0000556b63614ee8 [260259.428733] INFO: task fio:3791108 blocked for more than 120 seconds. [260259.435258] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.443085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.450997] task:fio state:D stack: 0 pid:3791108 ppid:3790835 flags:0x00004080 [260259.459689] Call Trace: [260259.462228] __schedule+0x2d1/0x830 [260259.465808] ? cv_wait_common+0x12d/0x240 [spl] [260259.470435] schedule+0x35/0xa0 [260259.473669] io_schedule+0x12/0x40 [260259.477161] __lock_page+0x12d/0x230 [260259.480828] ? file_fdatawait_range+0x20/0x20 [260259.485274] zfs_putpage+0x148/0x590 [zfs] [260259.489640] ? rmap_walk_file+0x116/0x290 [260259.493742] ? __mod_memcg_lruvec_state+0x5d/0x160 [260259.498619] zpl_putpage+0x67/0xd0 [zfs] [260259.502813] write_cache_pages+0x197/0x420 [260259.506998] ? zpl_readpage_filler+0x10/0x10 [zfs] [260259.512054] zpl_writepages+0x119/0x130 [zfs] [260259.516672] do_writepages+0xc2/0x1c0 [260259.520423] ? flush_tlb_func_common.constprop.9+0x125/0x220 [260259.526170] __filemap_fdatawrite_range+0xc7/0x100 [260259.531050] filemap_write_and_wait_range+0x30/0x80 [260259.536016] generic_file_direct_write+0x120/0x160 [260259.540896] ? rrw_exit+0xb0/0x1c0 [zfs] [260259.545099] zpl_iter_write+0xdd/0x160 [zfs] [260259.549639] new_sync_write+0x112/0x160 [260259.553566] vfs_write+0xa5/0x1a0 [260259.556971] ksys_write+0x4f/0xb0 [260259.560379] do_syscall_64+0x5b/0x1a0 [260259.564131] entry_SYSCALL_64_after_hwframe+0x65/0xca [260259.569269] RIP: 0033:0x7f9d192c7a17 [260259.572935] Code: Unable to access opcode bytes at RIP 0x7f9d192c79ed. [260259.579549] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [260259.587200] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9d192c7a17 [260259.594419] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI: 0000000000000005 [260259.601639] RBP: 00007f9caea03000 R08: 0000000000000000 R09: 0000000000000000 [260259.608859] R10: 00005558e8975680 R11: 0000000000000293 R12: 0000000000100000 [260259.616078] R13: 00005558e8985ec0 R14: 0000000000100000 R15: 00005558e8985ee8 [260259.623298] INFO: task fio:3791109 blocked for more than 120 seconds. [260259.629827] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.637650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.645564] task:fio state:D stack: 0 pid:3791109 ppid:3790838 flags:0x00004080 [260259.654254] Call Trace: [260259.656794] __schedule+0x2d1/0x830 [260259.660373] ? zfs_znode_held+0xe6/0x140 [zfs] [260259.665081] schedule+0x35/0xa0 [260259.668313] cv_wait_common+0x153/0x240 [spl] [260259.672768] ? finish_wait+0x80/0x80 [260259.676441] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [260259.682026] zfs_rangelock_enter_impl+0xbf/0x170 [zfs] [260259.687432] zfs_get_data+0x113/0x770 [zfs] [260259.691876] zil_lwb_commit+0x537/0x780 [zfs] [260259.696497] zil_process_commit_list+0x14c/0x460 [zfs] [260259.701895] zil_commit_writer+0xeb/0x160 [zfs] [260259.706689] zil_commit_impl+0x5d/0xa0 [zfs] [260259.711228] zfs_putpage+0x516/0x590 [zfs] [260259.715589] zpl_putpage+0x67/0xd0 [zfs] [260259.719775] write_cache_pages+0x197/0x420 [260259.723959] ? zpl_readpage_filler+0x10/0x10 [zfs] [260259.729013] zpl_writepages+0x119/0x130 [zfs] [260259.733632] do_writepages+0xc2/0x1c0 [260259.737384] __filemap_fdatawrite_range+0xc7/0x100 [260259.742264] filemap_write_and_wait_range+0x30/0x80 [260259.747229] zpl_iter_read_direct+0x86/0x1b0 [zfs] [260259.752286] ? rrw_exit+0xb0/0x1c0 [zfs] [260259.756487] zpl_iter_read+0x90/0xb0 [zfs] [260259.760855] new_sync_read+0x10f/0x150 [260259.764696] vfs_read+0x91/0x140 [260259.768013] ksys_read+0x4f/0xb0 [260259.771332] do_syscall_64+0x5b/0x1a0 [260259.775087] entry_SYSCALL_64_after_hwframe+0x65/0xca [260259.780225] RIP: 0033:0x7f1bd4687ab4 [260259.783893] Code: Unable to access opcode bytes at RIP 0x7f1bd4687a8a. [260259.790503] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [260259.798157] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f1bd4687ab4 [260259.805377] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI: 0000000000000005 [260259.812592] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09: 0000000000000000 [260259.819814] R10: 000000008fd0ea42 R11: 0000000000000246 R12: 0000000000100000 [260259.827032] R13: 000055ca4b405ec0 R14: 0000000000100000 R15: 000055ca4b405ee8 [260382.001731] INFO: task kworker/u128:0:3589938 blocked for more than 120 seconds. [260382.009227] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260382.017053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260382.024963] task:kworker/u128:0 state:D stack: 0 pid:3589938 ppid: 2 flags:0x80004080 [260382.033568] Workqueue: writeback wb_workfn (flush-zfs-540) [260382.039141] Call Trace: [260382.041683] __schedule+0x2d1/0x830 [260382.045271] schedule+0x35/0xa0 [260382.048503] io_schedule+0x12/0x40 [260382.051994] __lock_page+0x12d/0x230 [260382.055662] ? file_fdatawait_range+0x20/0x20 [260382.060107] write_cache_pages+0x1f2/0x420 [260382.064293] ? zpl_readpage_filler+0x10/0x10 [zfs] [260382.069379] zpl_writepages+0x98/0x130 [zfs] [260382.073919] do_writepages+0xc2/0x1c0 [260382.077672] __writeback_single_inode+0x39/0x2f0 [260382.082379] writeback_sb_inodes+0x1e6/0x450 [260382.086738] __writeback_inodes_wb+0x5f/0xc0 [260382.091097] wb_writeback+0x247/0x2e0 [260382.094850] ? get_nr_inodes+0x35/0x50 [260382.098689] wb_workfn+0x37c/0x4d0 [260382.102181] ? __switch_to_asm+0x35/0x70 [260382.106194] ? __switch_to_asm+0x41/0x70 [260382.110207] ? __switch_to_asm+0x35/0x70 [260382.114221] ? __switch_to_asm+0x41/0x70 [260382.118231] ? __switch_to_asm+0x35/0x70 [260382.122244] ? __switch_to_asm+0x41/0x70 [260382.126256] ? __switch_to_asm+0x35/0x70 [260382.130273] ? __switch_to_asm+0x41/0x70 [260382.134284] process_one_work+0x1a7/0x360 [260382.138384] worker_thread+0x30/0x390 [260382.142136] ? create_worker+0x1a0/0x1a0 [260382.146150] kthread+0x10a/0x120 [260382.149469] ? set_kthread_struct+0x40/0x40 [260382.153741] ret_from_fork+0x35/0x40 [260382.157448] INFO: task fio:3791107 blocked for more than 120 seconds. [260382.163977] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260382.171802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260382.179715] task:fio state:D stack: 0 pid:3791107 ppid:3790841 flags:0x00004080 [260382.188409] Call Trace: [260382.190945] __schedule+0x2d1/0x830 [260382.194527] schedule+0x35/0xa0 [260382.197757] io_schedule+0x12/0x40 [260382.201249] wait_on_page_bit+0x123/0x220 [260382.205350] ? xas_load+0x8/0x80 [260382.208668] ? file_fdatawait_range+0x20/0x20 [260382.213114] filemap_page_mkwrite+0x9b/0xb0 [260382.217386] do_page_mkwrite+0x53/0x90 [260382.221227] ? vm_normal_page+0x1a/0xc0 [260382.225152] do_wp_page+0x298/0x350 [260382.228733] __handle_mm_fault+0x44f/0x6c0 [260382.232919] ? __switch_to_asm+0x41/0x70 [260382.236930] handle_mm_fault+0xc1/0x1e0 [260382.240856] do_user_addr_fault+0x1b5/0x440 [260382.245132] do_page_fault+0x37/0x130 [260382.248883] ? page_fault+0x8/0x30 [260382.252375] page_fault+0x1e/0x30 [260382.255781] RIP: 0033:0x7f6deee7f1b4 [260382.259451] Code: Unable to access opcode bytes at RIP 0x7f6deee7f18a. [260382.266059] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202 [260382.271373] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX: 00007f6d83148fe0 [260382.278591] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI: 00007f6d8309bfa0 [260382.285813] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09: 0000000000000000 [260382.293030] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12: 0000000000000001 [260382.300249] R13: 0000556b63614ec0 R14: 0000000000100000 R15: 0000556b63614ee8 [260382.307472] INFO: task fio:3791108 blocked for more than 120 seconds. [260382.313997] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260382.321823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260382.329734] task:fio state:D stack: 0 pid:3791108 ppid:3790835 flags:0x00004080 [260382.338427] Call Trace: [260382.340967] __schedule+0x2d1/0x830 [260382.344547] ? cv_wait_common+0x12d/0x240 [spl] [260382.349173] schedule+0x35/0xa0 [260382.352406] io_schedule+0x12/0x40 [260382.355899] __lock_page+0x12d/0x230 [260382.359563] ? file_fdatawait_range+0x20/0x20 [260382.364010] zfs_putpage+0x148/0x590 [zfs] [260382.368379] ? rmap_walk_file+0x116/0x290 [260382.372479] ? __mod_memcg_lruvec_state+0x5d/0x160 [260382.377358] zpl_putpage+0x67/0xd0 [zfs] [260382.381552] write_cache_pages+0x197/0x420 [260382.385739] ? zpl_readpage_filler+0x10/0x10 [zfs] [260382.390791] zpl_writepages+0x119/0x130 [zfs] [260382.395410] do_writepages+0xc2/0x1c0 [260382.399161] ? flush_tlb_func_common.constprop.9+0x125/0x220 [260382.404907] __filemap_fdatawrite_range+0xc7/0x100 [260382.409790] filemap_write_and_wait_range+0x30/0x80 [260382.414752] generic_file_direct_write+0x120/0x160 [260382.419632] ? rrw_exit+0xb0/0x1c0 [zfs] [260382.423838] zpl_iter_write+0xdd/0x160 [zfs] [260382.428379] new_sync_write+0x112/0x160 [260382.432304] vfs_write+0xa5/0x1a0 [260382.435711] ksys_write+0x4f/0xb0 [260382.439115] do_syscall_64+0x5b/0x1a0 [260382.442866] entry_SYSCALL_64_after_hwframe+0x65/0xca [260382.448007] RIP: 0033:0x7f9d192c7a17 [260382.451675] Code: Unable to access opcode bytes at RIP 0x7f9d192c79ed. [260382.458286] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [260382.465938] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9d192c7a17 [260382.473158] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI: 0000000000000005 [260382.480379] RBP: 00007f9caea03000 R08: 0000000000000000 R09: 0000000000000000 [260382.487597] R10: 00005558e8975680 R11: 0000000000000293 R12: 0000000000100000 [260382.494814] R13: 00005558e8985ec0 R14: 0000000000100000 R15: 00005558e8985ee8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 10, 2024
In commit ba30ec9 I got a little overzealous in code cleanup. While I was trying to remove all the stable page code for Linux, I misinterpreted why Brian Behlendorf originally had the try rangelock, drop page lock, and aquire range lock in zfs_fillpage(). This is still necessary even without stable pages. This has to occur to avoid a race condition between direct IO writes and pages being faulted in for mmap files. If the rangelock is not held, then a direct IO write can set db->db_data = NULL either in: 1. dmu_write_direct() -> dmu_buf_will_not_fill() -> dmu_buf_will_fill() -> dbuf_noread() -> dbuf_clear_data() 2. dmu_write_direct_done() This can cause the panic then, withtout the rangelock as dmu_read_imp() can get a NULL pointer for db->db_data when trying to do the memcpy. So this rangelock must be held in zfs_fillpage() not matter what. There are further semantics on when the rangelock should be held in zfs_fillpage(). It must only be held when doing zfs_getpage() -> zfs_fillpage(). The reason for this is mappedread() can call zfs_fillpage() if the page is not uptodate. This can occur becaue filemap_fault() will first add the pages to the inode's address_space mapping and then drop the page lock. This leaves open a window were mappedread() can be called. Since this can occur, mappedread() will hold both the page lock and the rangelock. This is perfectly valid and correct. However, it is important in this case to never grab the rangelock in zfs_fillpage(). If this happens, then a dead lock will occur. Finally it is important to note that the rangelock is first attempted to be grabbed with zfs_rangelock_tryenter(). The reason for this is the page lock must be dropped in order to grab the rangelock in this case. Otherwise there is a race between zfs_fillpage() and zfs_write() -> update_pages(). In update_pages() the rangelock is already held and it then grabs the page lock. So if the page lock is not dropped before acquiring the rangelock in zfs_fillpage() there can be a deadlock. Below is a stack trace showing the NULL pointer dereference that was occuring with the dio_mmap ZTS test case before this commit. [ 7737.430658] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 7737.438486] PGD 0 P4D 0 [ 7737.441024] Oops: 0000 [openzfs#1] SMP NOPTI [ 7737.444692] CPU: 33 PID: 599346 Comm: fio Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [ 7737.455721] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 7737.463106] RIP: 0010:__memcpy+0x12/0x20 [ 7737.467032] Code: ff 0f 31 48 c1 e2 20 48 09 c2 48 31 d3 e9 79 ff ff ff 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4 [ 7737.485770] RSP: 0000:ffffc1db829e3b60 EFLAGS: 00010246 [ 7737.490987] RAX: ffff9ef195b6f000 RBX: 0000000000001000 RCX: 0000000000000200 [ 7737.498111] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ef195b6f000 [ 7737.505235] RBP: ffff9ef195b70000 R08: ffff9eef1d1d0000 R09: ffff9eef1d1d0000 [ 7737.512358] R10: ffff9eef27968218 R11: 0000000000000000 R12: 0000000000000000 [ 7737.519481] R13: ffff9ef041adb6d8 R14: 00000000005e1000 R15: 0000000000000001 [ 7737.526607] FS: 00007f77fe2bae80(0000) GS:ffff9f0cae840000(0000) knlGS:0000000000000000 [ 7737.534683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7737.540423] CR2: 0000000000000000 CR3: 00000003387a6000 CR4: 0000000000350ee0 [ 7737.547553] Call Trace: [ 7737.550003] dmu_read_impl+0x11a/0x210 [zfs] [ 7737.554464] dmu_read+0x56/0x90 [zfs] [ 7737.558292] zfs_fillpage+0x76/0x190 [zfs] [ 7737.562584] zfs_getpage+0x4c/0x80 [zfs] [ 7737.566691] zpl_readpage_common+0x3b/0x80 [zfs] [ 7737.571485] filemap_fault+0x5d6/0xa10 [ 7737.575236] ? obj_cgroup_charge_pages+0xba/0xd0 [ 7737.579856] ? xas_load+0x8/0x80 [ 7737.583088] ? xas_find+0x173/0x1b0 [ 7737.586579] ? filemap_map_pages+0x84/0x410 [ 7737.590759] __do_fault+0x38/0xb0 [ 7737.594077] handle_pte_fault+0x559/0x870 [ 7737.598082] __handle_mm_fault+0x44f/0x6c0 [ 7737.602181] handle_mm_fault+0xc1/0x1e0 [ 7737.606019] do_user_addr_fault+0x1b5/0x440 [ 7737.610207] do_page_fault+0x37/0x130 [ 7737.613873] ? page_fault+0x8/0x30 [ 7737.617277] page_fault+0x1e/0x30 [ 7737.620589] RIP: 0033:0x7f77fbce9140 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 10, 2024
It is important to hold the dbuf mutex (db_mtx) when creating ZIO's in dmu_read_abd(). The BP that is returned dmu_buf_get_gp_from_dbuf() may come from a previous direct IO write. In this case, it is attached to a dirty record in the dbuf. When zio_read() is called, a copy of the BP is made through io_bp_copy to io_bp in zio_create(). Without holding the db_mtx though, the dirty record may be freed in dbuf_read_done(). This can result in garbage beening place BP for the ZIO creatd through zio_read(). By holding the db_mtx, this race can be avoided. Below is a stack trace of the issue that was occuring in vdev_mirror_child_select() without holding the db_mtx and creating the the ZIO. [29882.427056] VERIFY(zio->io_bp == NULL || BP_PHYSICAL_BIRTH(zio->io_bp) == txg) failed [29882.434884] PANIC at vdev_mirror.c:545:vdev_mirror_child_select() [29882.440976] Showing stack for process 1865540 [29882.445336] CPU: 57 PID: 1865540 Comm: fio Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [29882.456457] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [29882.463844] Call Trace: [29882.466296] dump_stack+0x41/0x60 [29882.469618] spl_panic+0xd0/0xe8 [spl] [29882.473376] ? __dprintf+0x10e/0x180 [zfs] [29882.477674] ? kfree+0xd3/0x250 [29882.480819] ? __dprintf+0x10e/0x180 [zfs] [29882.485103] ? vdev_mirror_map_alloc+0x29/0x50 [zfs] [29882.490250] ? vdev_lookup_top+0x20/0x90 [zfs] [29882.494878] spl_assert+0x17/0x20 [zfs] [29882.498893] vdev_mirror_child_select+0x279/0x300 [zfs] [29882.504289] vdev_mirror_io_start+0x11f/0x2b0 [zfs] [29882.509336] zio_vdev_io_start+0x3ee/0x520 [zfs] [29882.514137] zio_nowait+0x105/0x290 [zfs] [29882.518330] dmu_read_abd+0x196/0x460 [zfs] [29882.522691] dmu_read_uio_direct+0x6d/0xf0 [zfs] [29882.527472] dmu_read_uio_dnode+0x12a/0x140 [zfs] [29882.532345] dmu_read_uio_dbuf+0x3f/0x60 [zfs] [29882.536953] zfs_read+0x238/0x3f0 [zfs] [29882.540976] zpl_iter_read_direct+0xe0/0x180 [zfs] [29882.545952] ? rrw_exit+0xc6/0x200 [zfs] [29882.550058] zpl_iter_read+0x90/0xb0 [zfs] [29882.554340] new_sync_read+0x10f/0x150 [29882.558094] vfs_read+0x91/0x140 [29882.561325] ksys_read+0x4f/0xb0 [29882.564557] do_syscall_64+0x5b/0x1a0 [29882.568222] entry_SYSCALL_64_after_hwframe+0x65/0xca [29882.573267] RIP: 0033:0x7f7fe0fa6ab4 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 10, 2024
There existed a race condition that was discovered through the dio_random test. When doing fio with --fsync=32, after 32 writes fsync is called on the file. When this happens, blocks committed to the ZIL will be sync'ed out. However, the code for the O_DIRECT write was updated in 31983d2 to always wait if there was an associated ARC buf with the dbuf for all previous TXG's to sync out. There was an oversight with this update. When waiting on previous TXG's to sync out, the O_DIRECT write is holding the rangelock as a writer the entire time. This causes an issue with the ZIL is commit writes out though `zfs_get_data()` because it will try and grab the rangelock as reader. This will lead to a deadlock. In order to fix this race condition, I updated the `dmu_buf_impl_t` struct to contain a uint8_t variable that is used to signal if the dbuf attached to a O_DIRECT write is the wait hold because of mixed direct and buffered data. Using this new `db_mixed_io_dio_wait` variable in the in the `dmu_buf_impl_t` the code in `zfs_get_data()` can tell that rangelock is already being held across the entire block and there is no need to grab the rangelock at all. Because the rangelock is being held as a writer across the entire block already, no modifications can take place against the block as long as the O_DIRECT write is stalled waiting in `dmu_buf_direct_mixed_io_wait()`. Also as part of this update, I realized the `db_state` in `dmu_buf_direct_mixed_io_wait()` needs to be changed temporarily to `DB_CACHED`. This is necessary so the logic in `dbuf_read()` is correct if `dmu_sync_late_arrival()` is called by `dmu_sync()`. It is completely valid to switch the `db_state` back to `DB_CACHED` is there is still an associated ARC buf that will not be freed till out O_DIRECT write is completed which will only happen after if leaves `dmu_buf_direct_mixed_io_wait()`. Here is the stack trace of the deadlock that happen with `dio_random.ksh` before this commit: [ 5513.663402] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 7496.580415] INFO: task z_wr_int:1098000 blocked for more than 120 seconds. [ 7496.585709] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.593349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.600839] task:z_wr_int state:D stack: 0 pid:1098000 ppid: 2 flags:0x80004080 [ 7496.608622] Call Trace: [ 7496.611770] __schedule+0x2d1/0x870 [ 7496.615404] schedule+0x55/0xf0 [ 7496.618866] cv_wait_common+0x16d/0x280 [spl] [ 7496.622910] ? finish_wait+0x80/0x80 [ 7496.626601] dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs] [ 7496.631327] dmu_write_direct_done+0x90/0x3a0 [zfs] [ 7496.635798] zio_done+0x373/0x1d40 [zfs] [ 7496.639795] zio_execute+0xee/0x210 [zfs] [ 7496.643840] taskq_thread+0x203/0x420 [spl] [ 7496.647836] ? wake_up_q+0x70/0x70 [ 7496.651411] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 7496.656489] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 7496.660604] kthread+0x134/0x150 [ 7496.664092] ? set_kthread_struct+0x50/0x50 [ 7496.668080] ret_from_fork+0x35/0x40 [ 7496.671745] INFO: task txg_sync:1098025 blocked for more than 120 seconds. [ 7496.676991] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.684666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.692060] task:txg_sync state:D stack: 0 pid:1098025 ppid: 2 flags:0x80004080 [ 7496.699888] Call Trace: [ 7496.703012] __schedule+0x2d1/0x870 [ 7496.706658] schedule+0x55/0xf0 [ 7496.710093] schedule_timeout+0x197/0x300 [ 7496.713982] ? __next_timer_interrupt+0xf0/0xf0 [ 7496.718135] io_schedule_timeout+0x19/0x40 [ 7496.722049] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7496.726349] ? finish_wait+0x80/0x80 [ 7496.730039] __cv_timedwait_io+0x15/0x20 [spl] [ 7496.734100] zio_wait+0x1a2/0x4d0 [zfs] [ 7496.738082] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 7496.742205] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7496.746534] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 7496.750842] spa_sync_iterate_to_convergence+0xcf/0x310 [zfs] [ 7496.755742] spa_sync+0x362/0x8d0 [zfs] [ 7496.759689] txg_sync_thread+0x274/0x3b0 [zfs] [ 7496.763928] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 7496.768439] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 7496.772799] thread_generic_wrapper+0x63/0x90 [spl] [ 7496.777097] kthread+0x134/0x150 [ 7496.780616] ? set_kthread_struct+0x50/0x50 [ 7496.784549] ret_from_fork+0x35/0x40 [ 7496.788204] INFO: task fio:1101750 blocked for more than 120 seconds. [ 7496.895852] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.903765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.911170] task:fio state:D stack: 0 pid:1101750 ppid:1101741 flags:0x00004080 [ 7496.919033] Call Trace: [ 7496.922136] __schedule+0x2d1/0x870 [ 7496.925769] schedule+0x55/0xf0 [ 7496.929245] schedule_timeout+0x197/0x300 [ 7496.933120] ? __next_timer_interrupt+0xf0/0xf0 [ 7496.937213] io_schedule_timeout+0x19/0x40 [ 7496.941126] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7496.945444] ? finish_wait+0x80/0x80 [ 7496.949125] __cv_timedwait_io+0x15/0x20 [spl] [ 7496.953191] zio_wait+0x1a2/0x4d0 [zfs] [ 7496.957180] dmu_write_abd+0x174/0x1c0 [zfs] [ 7496.961319] dmu_write_uio_direct+0x79/0xf0 [zfs] [ 7496.965731] dmu_write_uio_dnode+0xa6/0x2d0 [zfs] [ 7496.970043] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 7496.974305] zfs_write+0x55f/0xea0 [zfs] [ 7496.978325] ? iov_iter_get_pages+0xe9/0x390 [ 7496.982333] ? trylock_page+0xd/0x20 [zfs] [ 7496.986451] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7496.990713] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7496.995031] zpl_iter_write_direct+0xda/0x170 [zfs] [ 7496.999489] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.003476] zpl_iter_write+0xd5/0x110 [zfs] [ 7497.007610] new_sync_write+0x112/0x160 [ 7497.011429] vfs_write+0xa5/0x1b0 [ 7497.014916] ksys_write+0x4f/0xb0 [ 7497.018443] do_syscall_64+0x5b/0x1b0 [ 7497.022150] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.026532] RIP: 0033:0x7f8771d72a17 [ 7497.030195] Code: Unable to access opcode bytes at RIP 0x7f8771d729ed. [ 7497.035263] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 7497.042547] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72a17 [ 7497.047933] RDX: 000000000009b000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7497.053269] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.058660] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000009b000 [ 7497.063960] R13: 000055b390afcac0 R14: 000000000009b000 R15: 000055b390afcae8 [ 7497.069334] INFO: task fio:1101751 blocked for more than 120 seconds. [ 7497.074308] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.081973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.089371] task:fio state:D stack: 0 pid:1101751 ppid:1101741 flags:0x00000080 [ 7497.097147] Call Trace: [ 7497.100263] __schedule+0x2d1/0x870 [ 7497.103897] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.107878] schedule+0x55/0xf0 [ 7497.111386] cv_wait_common+0x16d/0x280 [spl] [ 7497.115391] ? finish_wait+0x80/0x80 [ 7497.119028] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7497.123667] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7497.128240] zfs_read+0xaf/0x3f0 [zfs] [ 7497.132146] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.136091] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7497.140366] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7497.144679] zpl_iter_read_direct+0xe0/0x180 [zfs] [ 7497.149054] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.153040] zpl_iter_read+0x94/0xb0 [zfs] [ 7497.157103] new_sync_read+0x10f/0x160 [ 7497.160855] vfs_read+0x91/0x150 [ 7497.164336] ksys_read+0x4f/0xb0 [ 7497.168004] do_syscall_64+0x5b/0x1b0 [ 7497.171706] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.176105] RIP: 0033:0x7f8771d72ab4 [ 7497.179742] Code: Unable to access opcode bytes at RIP 0x7f8771d72a8a. [ 7497.184807] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 7497.192129] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72ab4 [ 7497.197485] RDX: 0000000000002000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7497.202922] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.208309] R10: 00000001ffffffff R11: 0000000000000246 R12: 0000000000002000 [ 7497.213694] R13: 000055b390afcac0 R14: 0000000000002000 R15: 000055b390afcae8 [ 7497.219063] INFO: task fio:1101755 blocked for more than 120 seconds. [ 7497.224098] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.231786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.239165] task:fio state:D stack: 0 pid:1101755 ppid:1101744 flags:0x00000080 [ 7497.246989] Call Trace: [ 7497.250121] __schedule+0x2d1/0x870 [ 7497.253779] schedule+0x55/0xf0 [ 7497.257240] schedule_preempt_disabled+0xa/0x10 [ 7497.261344] __mutex_lock.isra.7+0x349/0x420 [ 7497.265326] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7497.269674] zil_commit_writer+0x89/0x230 [zfs] [ 7497.273938] zil_commit_impl+0x5f/0xd0 [zfs] [ 7497.278101] zfs_fsync+0x81/0xa0 [zfs] [ 7497.282002] zpl_fsync+0xe5/0x140 [zfs] [ 7497.285985] do_fsync+0x38/0x70 [ 7497.289458] __x64_sys_fsync+0x10/0x20 [ 7497.293208] do_syscall_64+0x5b/0x1b0 [ 7497.296928] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.301260] RIP: 0033:0x7f9559073027 [ 7497.304920] Code: Unable to access opcode bytes at RIP 0x7f9559072ffd. [ 7497.310015] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 7497.317346] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9559073027 [ 7497.322722] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI: 0000000000000005 [ 7497.328126] RBP: 00007f94fb858000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.333514] R10: 0000000000008000 R11: 0000000000000293 R12: 0000000000000003 [ 7497.338887] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15: 0000563adcbf2ae8 [ 7497.344247] INFO: task fio:1101756 blocked for more than 120 seconds. [ 7497.349327] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.357032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.364517] task:fio state:D stack: 0 pid:1101756 ppid:1101744 flags:0x00004080 [ 7497.372310] Call Trace: [ 7497.375433] __schedule+0x2d1/0x870 [ 7497.379004] schedule+0x55/0xf0 [ 7497.382454] cv_wait_common+0x16d/0x280 [spl] [ 7497.386477] ? finish_wait+0x80/0x80 [ 7497.390137] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7497.394816] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7497.399397] zfs_get_data+0x1a8/0x7e0 [zfs] [ 7497.403515] zil_lwb_commit+0x1a5/0x400 [zfs] [ 7497.407712] zil_lwb_write_close+0x408/0x630 [zfs] [ 7497.412126] zil_commit_waiter_timeout+0x16d/0x520 [zfs] [ 7497.416801] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 7497.421139] zil_commit_impl+0x6d/0xd0 [zfs] [ 7497.425294] zfs_fsync+0x81/0xa0 [zfs] [ 7497.429454] zpl_fsync+0xe5/0x140 [zfs] [ 7497.433396] do_fsync+0x38/0x70 [ 7497.436878] __x64_sys_fsync+0x10/0x20 [ 7497.440586] do_syscall_64+0x5b/0x1b0 [ 7497.444313] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.448659] RIP: 0033:0x7f9559073027 [ 7497.452343] Code: Unable to access opcode bytes at RIP 0x7f9559072ffd. [ 7497.457408] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 7497.464724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9559073027 [ 7497.470106] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI: 0000000000000005 [ 7497.475477] RBP: 00007f94fb89ca18 R08: 0000000000000000 R09: 0000000000000000 [ 7497.480806] R10: 00000000000b4cc0 R11: 0000000000000293 R12: 0000000000000003 [ 7497.486158] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15: 0000563adcbf2ae8 [ 7619.459402] INFO: task z_wr_int:1098000 blocked for more than 120 seconds. [ 7619.464605] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.472233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.479659] task:z_wr_int state:D stack: 0 pid:1098000 ppid: 2 flags:0x80004080 [ 7619.487518] Call Trace: [ 7619.490650] __schedule+0x2d1/0x870 [ 7619.494246] schedule+0x55/0xf0 [ 7619.497719] cv_wait_common+0x16d/0x280 [spl] [ 7619.501749] ? finish_wait+0x80/0x80 [ 7619.505411] dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs] [ 7619.510143] dmu_write_direct_done+0x90/0x3a0 [zfs] [ 7619.514603] zio_done+0x373/0x1d40 [zfs] [ 7619.518594] zio_execute+0xee/0x210 [zfs] [ 7619.522619] taskq_thread+0x203/0x420 [spl] [ 7619.526567] ? wake_up_q+0x70/0x70 [ 7619.530208] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 7619.535302] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 7619.539385] kthread+0x134/0x150 [ 7619.542873] ? set_kthread_struct+0x50/0x50 [ 7619.546810] ret_from_fork+0x35/0x40 [ 7619.550477] INFO: task txg_sync:1098025 blocked for more than 120 seconds. [ 7619.555715] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.563415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.570851] task:txg_sync state:D stack: 0 pid:1098025 ppid: 2 flags:0x80004080 [ 7619.578606] Call Trace: [ 7619.581742] __schedule+0x2d1/0x870 [ 7619.585396] schedule+0x55/0xf0 [ 7619.589006] schedule_timeout+0x197/0x300 [ 7619.592916] ? __next_timer_interrupt+0xf0/0xf0 [ 7619.597027] io_schedule_timeout+0x19/0x40 [ 7619.600947] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7619.709878] ? finish_wait+0x80/0x80 [ 7619.713565] __cv_timedwait_io+0x15/0x20 [spl] [ 7619.717596] zio_wait+0x1a2/0x4d0 [zfs] [ 7619.721567] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 7619.725657] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7619.730050] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 7619.734415] spa_sync_iterate_to_convergence+0xcf/0x310 [zfs] [ 7619.739268] spa_sync+0x362/0x8d0 [zfs] [ 7619.743270] txg_sync_thread+0x274/0x3b0 [zfs] [ 7619.747494] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 7619.751939] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 7619.756279] thread_generic_wrapper+0x63/0x90 [spl] [ 7619.760569] kthread+0x134/0x150 [ 7619.764050] ? set_kthread_struct+0x50/0x50 [ 7619.767978] ret_from_fork+0x35/0x40 [ 7619.771639] INFO: task fio:1101750 blocked for more than 120 seconds. [ 7619.776678] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.784324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.791914] task:fio state:D stack: 0 pid:1101750 ppid:1101741 flags:0x00004080 [ 7619.799712] Call Trace: [ 7619.802816] __schedule+0x2d1/0x870 [ 7619.806427] schedule+0x55/0xf0 [ 7619.809867] schedule_timeout+0x197/0x300 [ 7619.813760] ? __next_timer_interrupt+0xf0/0xf0 [ 7619.817848] io_schedule_timeout+0x19/0x40 [ 7619.821766] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7619.826097] ? finish_wait+0x80/0x80 [ 7619.829780] __cv_timedwait_io+0x15/0x20 [spl] [ 7619.833857] zio_wait+0x1a2/0x4d0 [zfs] [ 7619.837838] dmu_write_abd+0x174/0x1c0 [zfs] [ 7619.842015] dmu_write_uio_direct+0x79/0xf0 [zfs] [ 7619.846388] dmu_write_uio_dnode+0xa6/0x2d0 [zfs] [ 7619.850760] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 7619.855011] zfs_write+0x55f/0xea0 [zfs] [ 7619.859008] ? iov_iter_get_pages+0xe9/0x390 [ 7619.863036] ? trylock_page+0xd/0x20 [zfs] [ 7619.867084] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7619.871366] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7619.875715] zpl_iter_write_direct+0xda/0x170 [zfs] [ 7619.880164] ? rrw_exit+0xc6/0x200 [zfs] [ 7619.884174] zpl_iter_write+0xd5/0x110 [zfs] [ 7619.888492] new_sync_write+0x112/0x160 [ 7619.892285] vfs_write+0xa5/0x1b0 [ 7619.895829] ksys_write+0x4f/0xb0 [ 7619.899384] do_syscall_64+0x5b/0x1b0 [ 7619.903071] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7619.907394] RIP: 0033:0x7f8771d72a17 [ 7619.911026] Code: Unable to access opcode bytes at RIP 0x7f8771d729ed. [ 7619.916073] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 7619.923363] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72a17 [ 7619.928675] RDX: 000000000009b000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7619.934019] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7619.939354] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000009b000 [ 7619.944775] R13: 000055b390afcac0 R14: 000000000009b000 R15: 000055b390afcae8 [ 7619.950175] INFO: task fio:1101751 blocked for more than 120 seconds. [ 7619.955232] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.962889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.970301] task:fio state:D stack: 0 pid:1101751 ppid:1101741 flags:0x00000080 [ 7619.978139] Call Trace: [ 7619.981278] __schedule+0x2d1/0x870 [ 7619.984872] ? rrw_exit+0xc6/0x200 [zfs] [ 7619.989260] schedule+0x55/0xf0 [ 7619.992725] cv_wait_common+0x16d/0x280 [spl] [ 7619.996754] ? finish_wait+0x80/0x80 [ 7620.000414] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7620.005050] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7620.009617] zfs_read+0xaf/0x3f0 [zfs] [ 7620.013503] ? rrw_exit+0xc6/0x200 [zfs] [ 7620.017489] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7620.021774] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7620.026091] zpl_iter_read_direct+0xe0/0x180 [zfs] [ 7620.030508] ? rrw_exit+0xc6/0x200 [zfs] [ 7620.034497] zpl_iter_read+0x94/0xb0 [zfs] [ 7620.038579] new_sync_read+0x10f/0x160 [ 7620.042325] vfs_read+0x91/0x150 [ 7620.045809] ksys_read+0x4f/0xb0 [ 7620.049273] do_syscall_64+0x5b/0x1b0 [ 7620.052965] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7620.057354] RIP: 0033:0x7f8771d72ab4 [ 7620.060988] Code: Unable to access opcode bytes at RIP 0x7f8771d72a8a. [ 7620.066041] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 7620.073256] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72ab4 [ 7620.078553] RDX: 0000000000002000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7620.083878] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7620.089353] R10: 00000001ffffffff R11: 0000000000000246 R12: 0000000000002000 [ 7620.094697] R13: 000055b390afcac0 R14: 0000000000002000 R15: 000055b390afcae8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 10, 2024
995734e added a test for block cloning with mmap files. As a result I began hitting a panic in that test in dbuf_unoverride(). The was if the dirty record was from a cloned block, then the dr_data must be set to NULL. This ASSERT was added in 86e115e. The point of that commit was to make sure that if a cloned block is read before it is synced out, then the associated ARC buffer is set in the dirty record. This became an issue with the O_DIRECT code, because dr_data was set to the ARC buf in dbuf_set_data() after the read. This is the incorrect logic though for the cloned block. In order to fix this issue, I refined how to determine if the dirty record is in fact from a O_DIRECT write by make sure that dr_brtwrite is false. I created the function dbuf_dirty_is_direct_write() to perform the proper check. As part of this, I also cleaned up other code that did the exact same check for an O_DIRECT write to make sure the proper check is taking place everywhere. The trace of the ASSERT that was being tripped before this change is below: [3649972.811039] VERIFY0P(dr->dt.dl.dr_data) failed (NULL == ffff8d58e8183c80) [3649972.817999] PANIC at dbuf.c:1999:dbuf_unoverride() [3649972.822968] Showing stack for process 2365657 [3649972.827502] CPU: 0 PID: 2365657 Comm: clone_mmap_writ Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [3649972.839749] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [3649972.847315] Call Trace: [3649972.849935] dump_stack+0x41/0x60 [3649972.853428] spl_panic+0xd0/0xe8 [spl] [3649972.857370] ? cityhash4+0x75/0x90 [zfs] [3649972.861649] ? _cond_resched+0x15/0x30 [3649972.865577] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [3649972.870548] ? __kmalloc_node+0x10d/0x300 [3649972.874735] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [3649972.879702] ? __list_add+0x12/0x30 [zfs] [3649972.884061] dbuf_unoverride+0x1c1/0x1d0 [zfs] [3649972.888856] dbuf_redirty+0x3b/0xd0 [zfs] [3649972.893204] dbuf_dirty+0xeb1/0x1330 [zfs] [3649972.897643] ? _cond_resched+0x15/0x30 [3649972.901569] ? mutex_lock+0xe/0x30 [3649972.905148] ? dbuf_noread+0x117/0x240 [zfs] [3649972.909760] dmu_write_uio_dnode+0x1d2/0x320 [zfs] [3649972.914900] dmu_write_uio_dbuf+0x47/0x60 [zfs] [3649972.919777] zfs_write+0x57d/0xe00 [zfs] [3649972.924076] ? alloc_set_pte+0xb8/0x3e0 [3649972.928088] zpl_iter_write_buffered+0xb2/0x120 [zfs] [3649972.933507] ? rrw_exit+0xc6/0x200 [zfs] [3649972.937796] zpl_iter_write+0xba/0x110 [zfs] [3649972.942433] new_sync_write+0x112/0x160 [3649972.946445] vfs_write+0xa5/0x1a0 [3649972.949935] ksys_pwrite64+0x61/0xa0 [3649972.953681] do_syscall_64+0x5b/0x1a0 [3649972.957519] entry_SYSCALL_64_after_hwframe+0x65/0xca [3649972.962745] RIP: 0033:0x7f610616f01b Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 10, 2024
Originally I was checking dr->dr_dbuf->db_level == 0 in dbuf_dirty_is_direct_write(). Howver, this can lead to a NULL ponter dereference if the dr_dbuf is no longer set. I updated dbuf_dirty_is_direct_write() to now also take a dmu_buf_impl_t to check if db->db_level == 0. This failure was caught on the Fedora 37 CI running in test enospc_rm. Below is the stack trace. [ 9851.511608] BUG: kernel NULL pointer dereference, address: 0000000000000068 [ 9851.515922] #PF: supervisor read access in kernel mode [ 9851.519462] #PF: error_code(0x0000) - not-present page [ 9851.522992] PGD 0 P4D 0 [ 9851.525684] Oops: 0000 [openzfs#1] PREEMPT SMP PTI [ 9851.528878] CPU: 0 PID: 1272993 Comm: fio Tainted: P OE 6.5.12-100.fc37.x86_64 openzfs#1 [ 9851.535266] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017 [ 9851.539226] RIP: 0010:dbuf_dirty_is_direct_write+0xb/0x40 [zfs] [ 9851.543379] Code: 10 74 02 31 c0 5b c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 31 c0 48 85 ff 74 31 48 8b 57 20 <80> 7a 68 00 75 27 8b 87 64 01 00 00 85 c0 75 1b 83 bf 58 01 00 00 [ 9851.554719] RSP: 0018:ffff9b5b8305f8e8 EFLAGS: 00010286 [ 9851.558276] RAX: 0000000000000000 RBX: ffff9b5b8569b0b8 RCX: 0000000000000000 [ 9851.562481] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8f2e97de9e00 [ 9851.566672] RBP: 0000000000020000 R08: 0000000000000000 R09: ffff8f2f70e94000 [ 9851.570851] R10: 0000000000000001 R11: 0000000000000110 R12: ffff8f2f774ae4c0 [ 9851.575032] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 9851.579209] FS: 00007f57c5542240(0000) GS:ffff8f2faa800000(0000) knlGS:0000000000000000 [ 9851.585357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9851.589064] CR2: 0000000000000068 CR3: 00000001f9a38001 CR4: 00000000007706f0 [ 9851.593256] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9851.597440] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9851.601618] PKRU: 55555554 [ 9851.604341] Call Trace: [ 9851.606981] <TASK> [ 9851.609515] ? __die+0x23/0x70 [ 9851.612388] ? page_fault_oops+0x171/0x4e0 [ 9851.615571] ? exc_page_fault+0x77/0x170 [ 9851.618704] ? asm_exc_page_fault+0x26/0x30 [ 9851.621900] ? dbuf_dirty_is_direct_write+0xb/0x40 [zfs] [ 9851.625828] zfs_get_data+0x407/0x820 [zfs] [ 9851.629400] zil_lwb_commit+0x18d/0x3f0 [zfs] [ 9851.633026] zil_lwb_write_issue+0x92/0xbb0 [zfs] [ 9851.636758] zil_commit_waiter_timeout+0x1f3/0x580 [zfs] [ 9851.640696] zil_commit_waiter+0x1ff/0x3a0 [zfs] [ 9851.644402] zil_commit_impl+0x71/0xd0 [zfs] [ 9851.647998] zfs_write+0xb51/0xdc0 [zfs] [ 9851.651467] zpl_iter_write_buffered+0xc9/0x140 [zfs] [ 9851.655337] zpl_iter_write+0xc0/0x110 [zfs] [ 9851.658920] vfs_write+0x23e/0x420 [ 9851.661871] __x64_sys_pwrite64+0x98/0xd0 [ 9851.665013] do_syscall_64+0x5f/0x90 [ 9851.668027] ? ksys_fadvise64_64+0x57/0xa0 [ 9851.671212] ? syscall_exit_to_user_mode+0x2b/0x40 [ 9851.674594] ? do_syscall_64+0x6b/0x90 [ 9851.677655] ? syscall_exit_to_user_mode+0x2b/0x40 [ 9851.681051] ? do_syscall_64+0x6b/0x90 [ 9851.684128] ? exc_page_fault+0x77/0x170 [ 9851.687256] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 9851.690759] RIP: 0033:0x7f57c563c377 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 10, 2024
There existed a race condition between when a Direct I/O write could complete and if a sync operation was issued. This was due to the fact that a Direct I/O would sleep waiting on previous TXG's to sync out their dirty records assosciated with a dbuf if there was an ARC buffer associated with the dbuf. This was necessay to safely destroy the ARC buffer in case previous dirty records dr_data as pointed at that the db_buf. The main issue with this approach is a Direct I/o write holds the rangelock across the entire block, so when a sync on that same block was issued and tried to grab the rangelock as reader, it would be blocked indefinitely because the Direct I/O that was now sleeping was holding that same rangelock as writer. This led to a complete deadlock. This commit fixes this issue and removes the wait in dmu_write_direct_done(). The way this is now handled is the ARC buffer is destroyed, if there an associated one with dbuf, before ever issuing the Direct I/O write. This implemenation heavily borrows from the block cloning implementation. A new function dmu_buf_wil_clone_or_dio() is called in both dmu_write_direct() and dmu_brt_clone() that does the following: 1. Undirties a dirty record for that db if there one currently associated with the current TXG. 2. Destroys the ARC buffer if the previous dirty record dr_data does not point at the dbufs ARC buffer (db_buf). 3. Sets the dbufs data pointers to NULL. 4. Redirties the dbuf using db_state = DB_NOFILL. As part of this commit, the dmu_write_direct_done() function was also cleaned up. Now dmu_sync_done() is called before undirtying the dbuf dirty record associated with a failed Direct I/O write. This is correct logic and how it always should have been. The additional benefits of these modifications is there is no longer a stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT together. Also it unifies the block cloning and Direct I/O write path as they both need to call dbuf_fix_old_data() before destroying the ARC buffer. As part of this commit, there is also just general code cleanup. Various dbuf stats were removed because they are not necesary any longer. Additionally, useless functions were removed to make the code paths cleaner for Direct I/O. Below is the race condtion stack trace that was being consistently observed in the CI runs for the dio_random test case that prompted these changes: trace: [ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [ 9954.770512] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.773848] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [ 9954.775512] Call Trace: [ 9954.776406] __schedule+0x2d1/0x870 [ 9954.777386] ? free_one_page+0x204/0x530 [ 9954.778466] schedule+0x55/0xf0 [ 9954.779355] cv_wait_common+0x16d/0x280 [spl] [ 9954.780491] ? finish_wait+0x80/0x80 [ 9954.781450] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [ 9954.782889] dmu_write_direct_done+0x90/0x3b0 [zfs] [ 9954.784255] zio_done+0x373/0x1d50 [zfs] [ 9954.785410] zio_execute+0xee/0x210 [zfs] [ 9954.786588] taskq_thread+0x205/0x3f0 [spl] [ 9954.787673] ? wake_up_q+0x60/0x60 [ 9954.788571] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 9954.790079] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 9954.791199] kthread+0x134/0x150 [ 9954.792082] ? set_kthread_struct+0x50/0x50 [ 9954.793189] ret_from_fork+0x35/0x40 [ 9954.794108] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [ 9954.795535] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.798669] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [ 9954.800267] Call Trace: [ 9954.801096] __schedule+0x2d1/0x870 [ 9954.801972] ? __wake_up_common+0x7a/0x190 [ 9954.802963] schedule+0x55/0xf0 [ 9954.803884] schedule_timeout+0x19f/0x320 [ 9954.804837] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.805932] ? taskq_dispatch+0xab/0x280 [spl] [ 9954.806959] io_schedule_timeout+0x19/0x40 [ 9954.807989] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.809110] ? finish_wait+0x80/0x80 [ 9954.810068] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.811103] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.812255] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 9954.813442] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 9954.814648] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [ 9954.816023] spa_sync+0x362/0x8f0 [zfs] [ 9954.817110] txg_sync_thread+0x27a/0x3b0 [zfs] [ 9954.818267] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 9954.819510] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 9954.820643] thread_generic_wrapper+0x63/0x90 [spl] [ 9954.821709] kthread+0x134/0x150 [ 9954.822590] ? set_kthread_struct+0x50/0x50 [ 9954.823584] ret_from_fork+0x35/0x40 [ 9954.824444] INFO: task fio:1055501 blocked for more than 120 seconds. [ 9954.825781] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.828871] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [ 9954.830463] Call Trace: [ 9954.831280] __schedule+0x2d1/0x870 [ 9954.832159] ? dbuf_hold_copy+0xec/0x230 [zfs] [ 9954.833396] schedule+0x55/0xf0 [ 9954.834286] cv_wait_common+0x16d/0x280 [spl] [ 9954.835291] ? finish_wait+0x80/0x80 [ 9954.836235] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 9954.837543] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 9954.838838] zfs_get_data+0x566/0x810 [zfs] [ 9954.840034] zil_lwb_commit+0x194/0x3f0 [zfs] [ 9954.841154] zil_lwb_write_issue+0x68/0xb90 [zfs] [ 9954.842367] ? __list_add+0x12/0x30 [zfs] [ 9954.843496] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.844665] ? zil_alloc_lwb+0x217/0x360 [zfs] [ 9954.845852] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [ 9954.847203] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 9954.848380] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.849550] zfs_fsync+0x66/0x90 [zfs] [ 9954.850640] zpl_fsync+0xe5/0x140 [zfs] [ 9954.851729] do_fsync+0x38/0x70 [ 9954.852585] __x64_sys_fsync+0x10/0x20 [ 9954.853486] do_syscall_64+0x5b/0x1b0 [ 9954.854416] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.855466] RIP: 0033:0x7eff236bb057 [ 9954.856388] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [ 9954.866149] INFO: task fio:1055502 blocked for more than 120 seconds. [ 9954.867490] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.870571] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [ 9954.872162] Call Trace: [ 9954.872947] __schedule+0x2d1/0x870 [ 9954.873844] schedule+0x55/0xf0 [ 9954.874716] schedule_timeout+0x19f/0x320 [ 9954.875645] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.876722] io_schedule_timeout+0x19/0x40 [ 9954.877677] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.878822] ? finish_wait+0x80/0x80 [ 9954.879694] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.880763] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.881865] dmu_write_abd+0x174/0x1c0 [zfs] [ 9954.883074] dmu_write_uio_direct+0x79/0x100 [zfs] [ 9954.884285] dmu_write_uio_dnode+0xb2/0x320 [zfs] [ 9954.885507] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 9954.886687] zfs_write+0x581/0xe20 [zfs] [ 9954.887822] ? iov_iter_get_pages+0xe9/0x390 [ 9954.888862] ? trylock_page+0xd/0x20 [zfs] [ 9954.890005] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.891217] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 9954.892391] zpl_iter_write_direct+0xd4/0x170 [zfs] [ 9954.893663] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.894764] zpl_iter_write+0xd5/0x110 [zfs] [ 9954.895911] new_sync_write+0x112/0x160 [ 9954.896881] vfs_write+0xa5/0x1b0 [ 9954.897701] ksys_write+0x4f/0xb0 [ 9954.898569] do_syscall_64+0x5b/0x1b0 [ 9954.899417] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.900515] RIP: 0033:0x7eff236baa47 [ 9954.901363] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 [ 9954.911129] INFO: task fio:1055504 blocked for more than 120 seconds. [ 9954.912381] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.915434] task:fio state:D stack:0 pid:1055504 ppid:1055493 flags:0x00000080 [ 9954.917082] Call Trace: [ 9954.917773] __schedule+0x2d1/0x870 [ 9954.918648] ? zilog_dirty+0x4f/0xc0 [zfs] [ 9954.919831] schedule+0x55/0xf0 [ 9954.920717] cv_wait_common+0x16d/0x280 [spl] [ 9954.921704] ? finish_wait+0x80/0x80 [ 9954.922639] zfs_rangelock_enter_writer+0x46/0x1c0 [zfs] [ 9954.923940] zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs] [ 9954.925306] zfs_write+0x703/0xe20 [zfs] [ 9954.926406] zpl_iter_write_buffered+0xb2/0x120 [zfs] [ 9954.927687] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.928821] zpl_iter_write+0xbe/0x110 [zfs] [ 9954.930028] new_sync_write+0x112/0x160 [ 9954.930913] vfs_write+0xa5/0x1b0 [ 9954.931758] ksys_write+0x4f/0xb0 [ 9954.932666] do_syscall_64+0x5b/0x1b0 [ 9954.933544] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.934689] RIP: 0033:0x7fcaee8f0a47 [ 9954.935551] Code: Unable to access opcode bytes at RIP 0x7fcaee8f0a1d. [ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fcaee8f0a47 [ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI: 0000000000000006 [ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09: 0000000000000000 [ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000001d000 [ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15: 0000557a2006bae8 [ 9954.945525] INFO: task fio:1055505 blocked for more than 120 seconds. [ 9954.946819] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.949959] task:fio state:D stack:0 pid:1055505 ppid:1055493 flags:0x00004080 [ 9954.951653] Call Trace: [ 9954.952417] __schedule+0x2d1/0x870 [ 9954.953393] ? finish_wait+0x3e/0x80 [ 9954.954315] schedule+0x55/0xf0 [ 9954.955212] cv_wait_common+0x16d/0x280 [spl] [ 9954.956211] ? finish_wait+0x80/0x80 [ 9954.957159] zil_commit_waiter+0xfa/0x3b0 [zfs] [ 9954.958343] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.959524] zfs_fsync+0x66/0x90 [zfs] [ 9954.960626] zpl_fsync+0xe5/0x140 [zfs] [ 9954.961763] do_fsync+0x38/0x70 [ 9954.962638] __x64_sys_fsync+0x10/0x20 [ 9954.963520] do_syscall_64+0x5b/0x1b0 [ 9954.964470] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.965567] RIP: 0033:0x7fcaee8f1057 [ 9954.966490] Code: Unable to access opcode bytes at RIP 0x7fcaee8f102d. [ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fcaee8f1057 [ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI: 0000000000000005 [ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09: 0000000000000000 [ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15: 0000557a2006bae8 [10077.648150] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [10077.649541] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.652782] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [10077.654420] Call Trace: [10077.655267] __schedule+0x2d1/0x870 [10077.656179] ? free_one_page+0x204/0x530 [10077.657192] schedule+0x55/0xf0 [10077.658004] cv_wait_common+0x16d/0x280 [spl] [10077.659018] ? finish_wait+0x80/0x80 [10077.660013] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [10077.661396] dmu_write_direct_done+0x90/0x3b0 [zfs] [10077.662617] zio_done+0x373/0x1d50 [zfs] [10077.663783] zio_execute+0xee/0x210 [zfs] [10077.664921] taskq_thread+0x205/0x3f0 [spl] [10077.665982] ? wake_up_q+0x60/0x60 [10077.666842] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [10077.668295] ? taskq_lowest_id+0xc0/0xc0 [spl] [10077.669360] kthread+0x134/0x150 [10077.670191] ? set_kthread_struct+0x50/0x50 [10077.671209] ret_from_fork+0x35/0x40 [10077.672076] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [10077.673467] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.676612] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [10077.678288] Call Trace: [10077.679024] __schedule+0x2d1/0x870 [10077.679948] ? __wake_up_common+0x7a/0x190 [10077.681042] schedule+0x55/0xf0 [10077.681899] schedule_timeout+0x19f/0x320 [10077.682951] ? __next_timer_interrupt+0xf0/0xf0 [10077.684005] ? taskq_dispatch+0xab/0x280 [spl] [10077.685085] io_schedule_timeout+0x19/0x40 [10077.686080] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.687227] ? finish_wait+0x80/0x80 [10077.688123] __cv_timedwait_io+0x15/0x20 [spl] [10077.689206] zio_wait+0x1ad/0x4f0 [zfs] [10077.690300] dsl_pool_sync+0xcb/0x6c0 [zfs] [10077.691435] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [10077.692636] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [10077.693997] spa_sync+0x362/0x8f0 [zfs] [10077.695112] txg_sync_thread+0x27a/0x3b0 [zfs] [10077.696239] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [10077.697512] ? spl_assert.constprop.0+0x20/0x20 [spl] [10077.698639] thread_generic_wrapper+0x63/0x90 [spl] [10077.699687] kthread+0x134/0x150 [10077.700567] ? set_kthread_struct+0x50/0x50 [10077.701502] ret_from_fork+0x35/0x40 [10077.702430] INFO: task fio:1055501 blocked for more than 120 seconds. [10077.703697] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.706780] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [10077.708479] Call Trace: [10077.709231] __schedule+0x2d1/0x870 [10077.710190] ? dbuf_hold_copy+0xec/0x230 [zfs] [10077.711368] schedule+0x55/0xf0 [10077.712286] cv_wait_common+0x16d/0x280 [spl] [10077.713316] ? finish_wait+0x80/0x80 [10077.714262] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [10077.715566] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [10077.716878] zfs_get_data+0x566/0x810 [zfs] [10077.718032] zil_lwb_commit+0x194/0x3f0 [zfs] [10077.719234] zil_lwb_write_issue+0x68/0xb90 [zfs] [10077.720413] ? __list_add+0x12/0x30 [zfs] [10077.721525] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.722708] ? zil_alloc_lwb+0x217/0x360 [zfs] [10077.723931] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [10077.725273] zil_commit_waiter+0x1d2/0x3b0 [zfs] [10077.726438] zil_commit_impl+0x6d/0xd0 [zfs] [10077.727586] zfs_fsync+0x66/0x90 [zfs] [10077.728675] zpl_fsync+0xe5/0x140 [zfs] [10077.729755] do_fsync+0x38/0x70 [10077.730607] __x64_sys_fsync+0x10/0x20 [10077.731482] do_syscall_64+0x5b/0x1b0 [10077.732415] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.733487] RIP: 0033:0x7eff236bb057 [10077.734399] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [10077.744168] INFO: task fio:1055502 blocked for more than 120 seconds. [10077.745505] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.748642] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [10077.750233] Call Trace: [10077.751011] __schedule+0x2d1/0x870 [10077.751915] schedule+0x55/0xf0 [10077.752811] schedule_timeout+0x19f/0x320 [10077.753762] ? __next_timer_interrupt+0xf0/0xf0 [10077.754824] io_schedule_timeout+0x19/0x40 [10077.755782] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.756922] ? finish_wait+0x80/0x80 [10077.757788] __cv_timedwait_io+0x15/0x20 [spl] [10077.758845] zio_wait+0x1ad/0x4f0 [zfs] [10077.759941] dmu_write_abd+0x174/0x1c0 [zfs] [10077.761144] dmu_write_uio_direct+0x79/0x100 [zfs] [10077.762327] dmu_write_uio_dnode+0xb2/0x320 [zfs] [10077.763523] dmu_write_uio_dbuf+0x47/0x60 [zfs] [10077.764749] zfs_write+0x581/0xe20 [zfs] [10077.765825] ? iov_iter_get_pages+0xe9/0x390 [10077.766842] ? trylock_page+0xd/0x20 [zfs] [10077.767956] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.769189] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [10077.770343] zpl_iter_write_direct+0xd4/0x170 [zfs] [10077.771570] ? rrw_exit+0xc6/0x200 [zfs] [10077.772674] zpl_iter_write+0xd5/0x110 [zfs] [10077.773834] new_sync_write+0x112/0x160 [10077.774805] vfs_write+0xa5/0x1b0 [10077.775634] ksys_write+0x4f/0xb0 [10077.776526] do_syscall_64+0x5b/0x1b0 [10077.777386] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.778488] RIP: 0033:0x7eff236baa47 [10077.779339] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 12, 2024
Varada (varada.kari@gmail.com) pointed out an issue with O_DIRECT reads with the following test case: dd if=/dev/urandom of=/local_zpool/file2 bs=512 count=79 truncate -s 40382 /local_zpool/file2 zpool export local_zpool zpool import -d ~/ local_zpool dd if=/local_zpool/file2 of=/dev/null bs=1M iflag=direct That led to following panic happening: [ 307.769267] VERIFY(IS_P2ALIGNED(size, sizeof (uint32_t))) failed [ 307.782997] PANIC at zfs_fletcher.c:870:abd_fletcher_4_iter() [ 307.788743] Showing stack for process 9665 [ 307.792834] CPU: 47 PID: 9665 Comm: z_rd_int_5 Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [ 307.804298] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 307.811682] Call Trace: [ 307.814131] dump_stack+0x41/0x60 [ 307.817449] spl_panic+0xd0/0xe8 [spl] [ 307.821210] ? irq_work_queue+0x9/0x20 [ 307.824961] ? wake_up_klogd.part.30+0x30/0x40 [ 307.829407] ? vprintk_emit+0x125/0x250 [ 307.833246] ? printk+0x58/0x6f [ 307.836391] spl_assert.constprop.1+0x16/0x20 [zfs] [ 307.841438] abd_fletcher_4_iter+0x6c/0x101 [zfs] [ 307.846343] ? abd_fletcher_4_simd2scalar+0x83/0x83 [zfs] [ 307.851922] abd_iterate_func+0xb1/0x170 [zfs] [ 307.856533] abd_fletcher_4_impl+0x3f/0xa0 [zfs] [ 307.861334] abd_fletcher_4_native+0x52/0x70 [zfs] [ 307.866302] ? enqueue_entity+0xf1/0x6e0 [ 307.870226] ? select_idle_sibling+0x23/0x700 [ 307.874587] ? enqueue_task_fair+0x94/0x710 [ 307.878771] ? select_task_rq_fair+0x351/0x990 [ 307.883208] zio_checksum_error_impl+0xff/0x5f0 [zfs] [ 307.888435] ? abd_fletcher_4_impl+0xa0/0xa0 [zfs] [ 307.893401] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [ 307.898203] ? __wake_up_common+0x7a/0x190 [ 307.902300] ? __switch_to_asm+0x41/0x70 [ 307.906220] ? __switch_to_asm+0x35/0x70 [ 307.910145] ? __switch_to_asm+0x41/0x70 [ 307.914061] ? __switch_to_asm+0x35/0x70 [ 307.917980] ? __switch_to_asm+0x41/0x70 [ 307.921903] ? __switch_to_asm+0x35/0x70 [ 307.925821] ? __switch_to_asm+0x35/0x70 [ 307.929739] ? __switch_to_asm+0x41/0x70 [ 307.933658] ? __switch_to_asm+0x35/0x70 [ 307.937582] zio_checksum_error+0x47/0xc0 [zfs] [ 307.942288] raidz_checksum_verify+0x3a/0x70 [zfs] [ 307.947257] vdev_raidz_io_done+0x4b/0x160 [zfs] [ 307.952049] zio_vdev_io_done+0x7f/0x200 [zfs] [ 307.956669] zio_execute+0xee/0x210 [zfs] [ 307.960855] taskq_thread+0x203/0x420 [spl] [ 307.965048] ? wake_up_q+0x70/0x70 [ 307.968455] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 307.974807] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 307.979260] kthread+0x10a/0x120 [ 307.982485] ? set_kthread_struct+0x40/0x40 [ 307.986670] ret_from_fork+0x35/0x40 The reason this was occuring was because by doing the zpool export that meant the initial read of O_DIRECT would be forced to go down to disk. In this case it was still valid as bs=1M is still page size aligned; howver, the file length was not. So when issuing the O_DIRECT read even after calling make_abd_for_dbuf() we had an extra page allocated in the original ABD along with the linear ABD attached at the end of the gang abd from make_abd_for_dbuf(). This is an issue as it is our expectations with read that the block sizes being read are page aligned. When we do our check we are only checking the request but not the actual size of data we may read such as the entire file. In order to remedy this situation, I updated zfs_read() to attempt to read as much as it can using O_DIRECT based on if the length is page-sized aligned. Any additional bytes that are requested are then read into the ARC. This still stays with our semantics that IO requests must be page sized aligned. There are a bit of draw backs here if there is only a single block being read. In this case the block will be read twice. Once using O_DIRECT and then using buffered to fill in the remaining data for the users request. However, this should not be a big issue most of the time. In the normal case a user may ask for a lot of data from a file and only the stray bytes at the end of the file will have to be read using the ARC. In order to make sure this case was completely covered, I added a new ZTS test case dio_unaligned_filesize to test this out. The main thing with that test case is the first O_DIRECT read will issue out two reads two being O_DIRECT and the third being buffered for the remaining requested bytes. As part of this commit, I also updated stride_dd to take an additional parameter of -e, which says read the entire input file and ingore the count (-c) option. We need to use stride_dd for FreeBSD as dd does not make sure the buffer is page aligned. This udpate to stride_dd allows us to use it to test out this case in dio_unaligned_filesize for both Linux and FreeBSD. While this may not be the most elegant solution, it does stick with the semanatics and still reads all the data the user requested. I am fine with revisiting this and maybe we just return a short read? Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 12, 2024
We were using the generic Linux calls to make sure that the page cache was cleaned out before issuing any Direct I/O reads or writes. However, this only matters in the event the file region being written/read from using O_DIRECT was mmap'ed. One of stipulations with O_DIRECT is that is redirected through the ARC in the event the file range is mmap'ed. Becaues of this, it did not make sense to try and invalidate the page cache if we were never intending to have O_DIRECT to work with mmap'ed regions. Also, calls into the generic Linux calls in writes would often lead to lockups as the page lock is dropped in zfs_putpage(). See the stack dump below. In order to just prevent this, we no longer will use the generic linux direct IO wrappers or try and flush out the page cache. Instead if we find the file range has been mmap'ed in since the initial check in zfs_setup_direct() we will just now directly handle that in zfs_read() and zfs_write(). In most case zfs_setup_direct() will prevent O_DIRECT to mmap'ed regions of the file that have been page faulted in, but if that happen when we are issuing the direct I/O request the the normal parts of the ZFS paths will be taken to account for this. It is highly suggested not to mmap a region of file and then write or read directly to the file. In general, that is kind of an isane thing to do... However, we try our best to still have consistency with the ARC. Also, before making this decision I did explore if we could just add a rangelock in zfs_fillpage(), but we can not do that. The reason is when the page is in zfs_readpage_common() it has already been locked by the kernel. So, if we try and grab the rangelock anywhere in that path we can get stuck if another thread is issuing writes to the file region that was mmap'ed in. The reason is update_pages() holds the rangelock and then tries to lock the page. In this case zfs_fillpage() holds the page lock but is stuck in the rangelock waiting and holding the page lock. Deadlock is unavoidable in this case. [260136.244332] INFO: task fio:3791107 blocked for more than 120 seconds. [260136.250867] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260136.258693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260136.266607] task:fio state:D stack: 0 pid:3791107 ppid:3790841 flags:0x00004080 [260136.275306] Call Trace: [260136.277845] __schedule+0x2d1/0x830 [260136.281432] schedule+0x35/0xa0 [260136.284665] io_schedule+0x12/0x40 [260136.288157] wait_on_page_bit+0x123/0x220 [260136.292258] ? xas_load+0x8/0x80 [260136.295577] ? file_fdatawait_range+0x20/0x20 [260136.300024] filemap_page_mkwrite+0x9b/0xb0 [260136.304295] do_page_mkwrite+0x53/0x90 [260136.308135] ? vm_normal_page+0x1a/0xc0 [260136.312062] do_wp_page+0x298/0x350 [260136.315640] __handle_mm_fault+0x44f/0x6c0 [260136.319826] ? __switch_to_asm+0x41/0x70 [260136.323839] handle_mm_fault+0xc1/0x1e0 [260136.327766] do_user_addr_fault+0x1b5/0x440 [260136.332038] do_page_fault+0x37/0x130 [260136.335792] ? page_fault+0x8/0x30 [260136.339284] page_fault+0x1e/0x30 [260136.342689] RIP: 0033:0x7f6deee7f1b4 [260136.346361] Code: Unable to access opcode bytes at RIP 0x7f6deee7f18a. [260136.352977] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202 [260136.358288] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX: 00007f6d83148fe0 [260136.365508] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI: 00007f6d8309bfa0 [260136.372730] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09: 0000000000000000 [260136.379946] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12: 0000000000000001 [260136.387167] R13: 0000556b63614ec0 R14: 0000000000100000 R15: 0000556b63614ee8 [260136.394387] INFO: task fio:3791108 blocked for more than 120 seconds. [260136.400911] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260136.408739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260136.416651] task:fio state:D stack: 0 pid:3791108 ppid:3790835 flags:0x00004080 [260136.425343] Call Trace: [260136.427883] __schedule+0x2d1/0x830 [260136.431463] ? cv_wait_common+0x12d/0x240 [spl] [260136.436091] schedule+0x35/0xa0 [260136.439321] io_schedule+0x12/0x40 [260136.442814] __lock_page+0x12d/0x230 [260136.446483] ? file_fdatawait_range+0x20/0x20 [260136.450929] zfs_putpage+0x148/0x590 [zfs] [260136.455322] ? rmap_walk_file+0x116/0x290 [260136.459421] ? __mod_memcg_lruvec_state+0x5d/0x160 [260136.464300] zpl_putpage+0x67/0xd0 [zfs] [260136.468495] write_cache_pages+0x197/0x420 [260136.472679] ? zpl_readpage_filler+0x10/0x10 [zfs] [260136.477732] zpl_writepages+0x119/0x130 [zfs] [260136.482352] do_writepages+0xc2/0x1c0 [260136.486103] ? flush_tlb_func_common.constprop.9+0x125/0x220 [260136.491850] __filemap_fdatawrite_range+0xc7/0x100 [260136.496732] filemap_write_and_wait_range+0x30/0x80 [260136.501695] generic_file_direct_write+0x120/0x160 [260136.506575] ? rrw_exit+0xb0/0x1c0 [zfs] [260136.510779] zpl_iter_write+0xdd/0x160 [zfs] [260136.515323] new_sync_write+0x112/0x160 [260136.519255] vfs_write+0xa5/0x1a0 [260136.522662] ksys_write+0x4f/0xb0 [260136.526067] do_syscall_64+0x5b/0x1a0 [260136.529818] entry_SYSCALL_64_after_hwframe+0x65/0xca [260136.534959] RIP: 0033:0x7f9d192c7a17 [260136.538625] Code: Unable to access opcode bytes at RIP 0x7f9d192c79ed. [260136.545236] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [260136.552889] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9d192c7a17 [260136.560108] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI: 0000000000000005 [260136.567329] RBP: 00007f9caea03000 R08: 0000000000000000 R09: 0000000000000000 [260136.574548] R10: 00005558e8975680 R11: 0000000000000293 R12: 0000000000100000 [260136.581767] R13: 00005558e8985ec0 R14: 0000000000100000 R15: 00005558e8985ee8 [260136.588989] INFO: task fio:3791109 blocked for more than 120 seconds. [260136.595513] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260136.603337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260136.611250] task:fio state:D stack: 0 pid:3791109 ppid:3790838 flags:0x00004080 [260136.619943] Call Trace: [260136.622483] __schedule+0x2d1/0x830 [260136.626064] ? zfs_znode_held+0xe6/0x140 [zfs] [260136.630777] schedule+0x35/0xa0 [260136.634009] cv_wait_common+0x153/0x240 [spl] [260136.638466] ? finish_wait+0x80/0x80 [260136.642129] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [260136.647712] zfs_rangelock_enter_impl+0xbf/0x170 [zfs] [260136.653121] zfs_get_data+0x113/0x770 [zfs] [260136.657567] zil_lwb_commit+0x537/0x780 [zfs] [260136.662187] zil_process_commit_list+0x14c/0x460 [zfs] [260136.667585] zil_commit_writer+0xeb/0x160 [zfs] [260136.672376] zil_commit_impl+0x5d/0xa0 [zfs] [260136.676910] zfs_putpage+0x516/0x590 [zfs] [260136.681279] zpl_putpage+0x67/0xd0 [zfs] [260136.685467] write_cache_pages+0x197/0x420 [260136.689649] ? zpl_readpage_filler+0x10/0x10 [zfs] [260136.694705] zpl_writepages+0x119/0x130 [zfs] [260136.699322] do_writepages+0xc2/0x1c0 [260136.703076] __filemap_fdatawrite_range+0xc7/0x100 [260136.707952] filemap_write_and_wait_range+0x30/0x80 [260136.712920] zpl_iter_read_direct+0x86/0x1b0 [zfs] [260136.717972] ? rrw_exit+0xb0/0x1c0 [zfs] [260136.722174] zpl_iter_read+0x90/0xb0 [zfs] [260136.726536] new_sync_read+0x10f/0x150 [260136.730376] vfs_read+0x91/0x140 [260136.733693] ksys_read+0x4f/0xb0 [260136.737012] do_syscall_64+0x5b/0x1a0 [260136.740764] entry_SYSCALL_64_after_hwframe+0x65/0xca [260136.745906] RIP: 0033:0x7f1bd4687ab4 [260136.749574] Code: Unable to access opcode bytes at RIP 0x7f1bd4687a8a. [260136.756181] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [260136.763834] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f1bd4687ab4 [260136.771056] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI: 0000000000000005 [260136.778274] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09: 0000000000000000 [260136.785494] R10: 000000008fd0ea42 R11: 0000000000000246 R12: 0000000000100000 [260136.792714] R13: 000055ca4b405ec0 R14: 0000000000100000 R15: 000055ca4b405ee8 [260259.123003] INFO: task kworker/u128:0:3589938 blocked for more than 120 seconds. [260259.130487] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.138313] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.146224] task:kworker/u128:0 state:D stack: 0 pid:3589938 ppid: 2 flags:0x80004080 [260259.154832] Workqueue: writeback wb_workfn (flush-zfs-540) [260259.160411] Call Trace: [260259.162950] __schedule+0x2d1/0x830 [260259.166531] schedule+0x35/0xa0 [260259.169765] io_schedule+0x12/0x40 [260259.173257] __lock_page+0x12d/0x230 [260259.176921] ? file_fdatawait_range+0x20/0x20 [260259.181368] write_cache_pages+0x1f2/0x420 [260259.185554] ? zpl_readpage_filler+0x10/0x10 [zfs] [260259.190633] zpl_writepages+0x98/0x130 [zfs] [260259.195183] do_writepages+0xc2/0x1c0 [260259.198935] __writeback_single_inode+0x39/0x2f0 [260259.203640] writeback_sb_inodes+0x1e6/0x450 [260259.208002] __writeback_inodes_wb+0x5f/0xc0 [260259.212359] wb_writeback+0x247/0x2e0 [260259.216114] ? get_nr_inodes+0x35/0x50 [260259.219953] wb_workfn+0x37c/0x4d0 [260259.223443] ? __switch_to_asm+0x35/0x70 [260259.227456] ? __switch_to_asm+0x41/0x70 [260259.231469] ? __switch_to_asm+0x35/0x70 [260259.235481] ? __switch_to_asm+0x41/0x70 [260259.239495] ? __switch_to_asm+0x35/0x70 [260259.243505] ? __switch_to_asm+0x41/0x70 [260259.247518] ? __switch_to_asm+0x35/0x70 [260259.251533] ? __switch_to_asm+0x41/0x70 [260259.255545] process_one_work+0x1a7/0x360 [260259.259645] worker_thread+0x30/0x390 [260259.263396] ? create_worker+0x1a0/0x1a0 [260259.267409] kthread+0x10a/0x120 [260259.270730] ? set_kthread_struct+0x40/0x40 [260259.275003] ret_from_fork+0x35/0x40 [260259.278712] INFO: task fio:3791107 blocked for more than 120 seconds. [260259.285240] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.293064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.300976] task:fio state:D stack: 0 pid:3791107 ppid:3790841 flags:0x00004080 [260259.309668] Call Trace: [260259.312210] __schedule+0x2d1/0x830 [260259.315787] schedule+0x35/0xa0 [260259.319020] io_schedule+0x12/0x40 [260259.322511] wait_on_page_bit+0x123/0x220 [260259.326611] ? xas_load+0x8/0x80 [260259.329930] ? file_fdatawait_range+0x20/0x20 [260259.334376] filemap_page_mkwrite+0x9b/0xb0 [260259.338650] do_page_mkwrite+0x53/0x90 [260259.342489] ? vm_normal_page+0x1a/0xc0 [260259.346415] do_wp_page+0x298/0x350 [260259.349994] __handle_mm_fault+0x44f/0x6c0 [260259.354181] ? __switch_to_asm+0x41/0x70 [260259.358193] handle_mm_fault+0xc1/0x1e0 [260259.362117] do_user_addr_fault+0x1b5/0x440 [260259.366391] do_page_fault+0x37/0x130 [260259.370145] ? page_fault+0x8/0x30 [260259.373639] page_fault+0x1e/0x30 [260259.377043] RIP: 0033:0x7f6deee7f1b4 [260259.380714] Code: Unable to access opcode bytes at RIP 0x7f6deee7f18a. [260259.387323] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202 [260259.392633] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX: 00007f6d83148fe0 [260259.399853] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI: 00007f6d8309bfa0 [260259.407074] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09: 0000000000000000 [260259.414291] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12: 0000000000000001 [260259.421512] R13: 0000556b63614ec0 R14: 0000000000100000 R15: 0000556b63614ee8 [260259.428733] INFO: task fio:3791108 blocked for more than 120 seconds. [260259.435258] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.443085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.450997] task:fio state:D stack: 0 pid:3791108 ppid:3790835 flags:0x00004080 [260259.459689] Call Trace: [260259.462228] __schedule+0x2d1/0x830 [260259.465808] ? cv_wait_common+0x12d/0x240 [spl] [260259.470435] schedule+0x35/0xa0 [260259.473669] io_schedule+0x12/0x40 [260259.477161] __lock_page+0x12d/0x230 [260259.480828] ? file_fdatawait_range+0x20/0x20 [260259.485274] zfs_putpage+0x148/0x590 [zfs] [260259.489640] ? rmap_walk_file+0x116/0x290 [260259.493742] ? __mod_memcg_lruvec_state+0x5d/0x160 [260259.498619] zpl_putpage+0x67/0xd0 [zfs] [260259.502813] write_cache_pages+0x197/0x420 [260259.506998] ? zpl_readpage_filler+0x10/0x10 [zfs] [260259.512054] zpl_writepages+0x119/0x130 [zfs] [260259.516672] do_writepages+0xc2/0x1c0 [260259.520423] ? flush_tlb_func_common.constprop.9+0x125/0x220 [260259.526170] __filemap_fdatawrite_range+0xc7/0x100 [260259.531050] filemap_write_and_wait_range+0x30/0x80 [260259.536016] generic_file_direct_write+0x120/0x160 [260259.540896] ? rrw_exit+0xb0/0x1c0 [zfs] [260259.545099] zpl_iter_write+0xdd/0x160 [zfs] [260259.549639] new_sync_write+0x112/0x160 [260259.553566] vfs_write+0xa5/0x1a0 [260259.556971] ksys_write+0x4f/0xb0 [260259.560379] do_syscall_64+0x5b/0x1a0 [260259.564131] entry_SYSCALL_64_after_hwframe+0x65/0xca [260259.569269] RIP: 0033:0x7f9d192c7a17 [260259.572935] Code: Unable to access opcode bytes at RIP 0x7f9d192c79ed. [260259.579549] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [260259.587200] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9d192c7a17 [260259.594419] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI: 0000000000000005 [260259.601639] RBP: 00007f9caea03000 R08: 0000000000000000 R09: 0000000000000000 [260259.608859] R10: 00005558e8975680 R11: 0000000000000293 R12: 0000000000100000 [260259.616078] R13: 00005558e8985ec0 R14: 0000000000100000 R15: 00005558e8985ee8 [260259.623298] INFO: task fio:3791109 blocked for more than 120 seconds. [260259.629827] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260259.637650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260259.645564] task:fio state:D stack: 0 pid:3791109 ppid:3790838 flags:0x00004080 [260259.654254] Call Trace: [260259.656794] __schedule+0x2d1/0x830 [260259.660373] ? zfs_znode_held+0xe6/0x140 [zfs] [260259.665081] schedule+0x35/0xa0 [260259.668313] cv_wait_common+0x153/0x240 [spl] [260259.672768] ? finish_wait+0x80/0x80 [260259.676441] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [260259.682026] zfs_rangelock_enter_impl+0xbf/0x170 [zfs] [260259.687432] zfs_get_data+0x113/0x770 [zfs] [260259.691876] zil_lwb_commit+0x537/0x780 [zfs] [260259.696497] zil_process_commit_list+0x14c/0x460 [zfs] [260259.701895] zil_commit_writer+0xeb/0x160 [zfs] [260259.706689] zil_commit_impl+0x5d/0xa0 [zfs] [260259.711228] zfs_putpage+0x516/0x590 [zfs] [260259.715589] zpl_putpage+0x67/0xd0 [zfs] [260259.719775] write_cache_pages+0x197/0x420 [260259.723959] ? zpl_readpage_filler+0x10/0x10 [zfs] [260259.729013] zpl_writepages+0x119/0x130 [zfs] [260259.733632] do_writepages+0xc2/0x1c0 [260259.737384] __filemap_fdatawrite_range+0xc7/0x100 [260259.742264] filemap_write_and_wait_range+0x30/0x80 [260259.747229] zpl_iter_read_direct+0x86/0x1b0 [zfs] [260259.752286] ? rrw_exit+0xb0/0x1c0 [zfs] [260259.756487] zpl_iter_read+0x90/0xb0 [zfs] [260259.760855] new_sync_read+0x10f/0x150 [260259.764696] vfs_read+0x91/0x140 [260259.768013] ksys_read+0x4f/0xb0 [260259.771332] do_syscall_64+0x5b/0x1a0 [260259.775087] entry_SYSCALL_64_after_hwframe+0x65/0xca [260259.780225] RIP: 0033:0x7f1bd4687ab4 [260259.783893] Code: Unable to access opcode bytes at RIP 0x7f1bd4687a8a. [260259.790503] RSP: 002b:00007fff63f65170 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [260259.798157] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f1bd4687ab4 [260259.805377] RDX: 0000000000100000 RSI: 00007f1b69dc3000 RDI: 0000000000000005 [260259.812592] RBP: 00007f1b69dc3000 R08: 0000000000000000 R09: 0000000000000000 [260259.819814] R10: 000000008fd0ea42 R11: 0000000000000246 R12: 0000000000100000 [260259.827032] R13: 000055ca4b405ec0 R14: 0000000000100000 R15: 000055ca4b405ee8 [260382.001731] INFO: task kworker/u128:0:3589938 blocked for more than 120 seconds. [260382.009227] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260382.017053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260382.024963] task:kworker/u128:0 state:D stack: 0 pid:3589938 ppid: 2 flags:0x80004080 [260382.033568] Workqueue: writeback wb_workfn (flush-zfs-540) [260382.039141] Call Trace: [260382.041683] __schedule+0x2d1/0x830 [260382.045271] schedule+0x35/0xa0 [260382.048503] io_schedule+0x12/0x40 [260382.051994] __lock_page+0x12d/0x230 [260382.055662] ? file_fdatawait_range+0x20/0x20 [260382.060107] write_cache_pages+0x1f2/0x420 [260382.064293] ? zpl_readpage_filler+0x10/0x10 [zfs] [260382.069379] zpl_writepages+0x98/0x130 [zfs] [260382.073919] do_writepages+0xc2/0x1c0 [260382.077672] __writeback_single_inode+0x39/0x2f0 [260382.082379] writeback_sb_inodes+0x1e6/0x450 [260382.086738] __writeback_inodes_wb+0x5f/0xc0 [260382.091097] wb_writeback+0x247/0x2e0 [260382.094850] ? get_nr_inodes+0x35/0x50 [260382.098689] wb_workfn+0x37c/0x4d0 [260382.102181] ? __switch_to_asm+0x35/0x70 [260382.106194] ? __switch_to_asm+0x41/0x70 [260382.110207] ? __switch_to_asm+0x35/0x70 [260382.114221] ? __switch_to_asm+0x41/0x70 [260382.118231] ? __switch_to_asm+0x35/0x70 [260382.122244] ? __switch_to_asm+0x41/0x70 [260382.126256] ? __switch_to_asm+0x35/0x70 [260382.130273] ? __switch_to_asm+0x41/0x70 [260382.134284] process_one_work+0x1a7/0x360 [260382.138384] worker_thread+0x30/0x390 [260382.142136] ? create_worker+0x1a0/0x1a0 [260382.146150] kthread+0x10a/0x120 [260382.149469] ? set_kthread_struct+0x40/0x40 [260382.153741] ret_from_fork+0x35/0x40 [260382.157448] INFO: task fio:3791107 blocked for more than 120 seconds. [260382.163977] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260382.171802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260382.179715] task:fio state:D stack: 0 pid:3791107 ppid:3790841 flags:0x00004080 [260382.188409] Call Trace: [260382.190945] __schedule+0x2d1/0x830 [260382.194527] schedule+0x35/0xa0 [260382.197757] io_schedule+0x12/0x40 [260382.201249] wait_on_page_bit+0x123/0x220 [260382.205350] ? xas_load+0x8/0x80 [260382.208668] ? file_fdatawait_range+0x20/0x20 [260382.213114] filemap_page_mkwrite+0x9b/0xb0 [260382.217386] do_page_mkwrite+0x53/0x90 [260382.221227] ? vm_normal_page+0x1a/0xc0 [260382.225152] do_wp_page+0x298/0x350 [260382.228733] __handle_mm_fault+0x44f/0x6c0 [260382.232919] ? __switch_to_asm+0x41/0x70 [260382.236930] handle_mm_fault+0xc1/0x1e0 [260382.240856] do_user_addr_fault+0x1b5/0x440 [260382.245132] do_page_fault+0x37/0x130 [260382.248883] ? page_fault+0x8/0x30 [260382.252375] page_fault+0x1e/0x30 [260382.255781] RIP: 0033:0x7f6deee7f1b4 [260382.259451] Code: Unable to access opcode bytes at RIP 0x7f6deee7f18a. [260382.266059] RSP: 002b:00007fffe41b6538 EFLAGS: 00010202 [260382.271373] RAX: 00007f6d83049000 RBX: 0000556b63614ec0 RCX: 00007f6d83148fe0 [260382.278591] RDX: 00000000000acfe0 RSI: 00007f6d84e9c030 RDI: 00007f6d8309bfa0 [260382.285813] RBP: 00007f6d84f4a000 R08: ffffffffffffffe0 R09: 0000000000000000 [260382.293030] R10: 00007f6d84f8e810 R11: 00007f6d83049000 R12: 0000000000000001 [260382.300249] R13: 0000556b63614ec0 R14: 0000000000100000 R15: 0000556b63614ee8 [260382.307472] INFO: task fio:3791108 blocked for more than 120 seconds. [260382.313997] Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [260382.321823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [260382.329734] task:fio state:D stack: 0 pid:3791108 ppid:3790835 flags:0x00004080 [260382.338427] Call Trace: [260382.340967] __schedule+0x2d1/0x830 [260382.344547] ? cv_wait_common+0x12d/0x240 [spl] [260382.349173] schedule+0x35/0xa0 [260382.352406] io_schedule+0x12/0x40 [260382.355899] __lock_page+0x12d/0x230 [260382.359563] ? file_fdatawait_range+0x20/0x20 [260382.364010] zfs_putpage+0x148/0x590 [zfs] [260382.368379] ? rmap_walk_file+0x116/0x290 [260382.372479] ? __mod_memcg_lruvec_state+0x5d/0x160 [260382.377358] zpl_putpage+0x67/0xd0 [zfs] [260382.381552] write_cache_pages+0x197/0x420 [260382.385739] ? zpl_readpage_filler+0x10/0x10 [zfs] [260382.390791] zpl_writepages+0x119/0x130 [zfs] [260382.395410] do_writepages+0xc2/0x1c0 [260382.399161] ? flush_tlb_func_common.constprop.9+0x125/0x220 [260382.404907] __filemap_fdatawrite_range+0xc7/0x100 [260382.409790] filemap_write_and_wait_range+0x30/0x80 [260382.414752] generic_file_direct_write+0x120/0x160 [260382.419632] ? rrw_exit+0xb0/0x1c0 [zfs] [260382.423838] zpl_iter_write+0xdd/0x160 [zfs] [260382.428379] new_sync_write+0x112/0x160 [260382.432304] vfs_write+0xa5/0x1a0 [260382.435711] ksys_write+0x4f/0xb0 [260382.439115] do_syscall_64+0x5b/0x1a0 [260382.442866] entry_SYSCALL_64_after_hwframe+0x65/0xca [260382.448007] RIP: 0033:0x7f9d192c7a17 [260382.451675] Code: Unable to access opcode bytes at RIP 0x7f9d192c79ed. [260382.458286] RSP: 002b:00007ffc8e4ba270 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [260382.465938] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9d192c7a17 [260382.473158] RDX: 0000000000100000 RSI: 00007f9caea03000 RDI: 0000000000000005 [260382.480379] RBP: 00007f9caea03000 R08: 0000000000000000 R09: 0000000000000000 [260382.487597] R10: 00005558e8975680 R11: 0000000000000293 R12: 0000000000100000 [260382.494814] R13: 00005558e8985ec0 R14: 0000000000100000 R15: 00005558e8985ee8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 12, 2024
In commit ba30ec9 I got a little overzealous in code cleanup. While I was trying to remove all the stable page code for Linux, I misinterpreted why Brian Behlendorf originally had the try rangelock, drop page lock, and aquire range lock in zfs_fillpage(). This is still necessary even without stable pages. This has to occur to avoid a race condition between direct IO writes and pages being faulted in for mmap files. If the rangelock is not held, then a direct IO write can set db->db_data = NULL either in: 1. dmu_write_direct() -> dmu_buf_will_not_fill() -> dmu_buf_will_fill() -> dbuf_noread() -> dbuf_clear_data() 2. dmu_write_direct_done() This can cause the panic then, withtout the rangelock as dmu_read_imp() can get a NULL pointer for db->db_data when trying to do the memcpy. So this rangelock must be held in zfs_fillpage() not matter what. There are further semantics on when the rangelock should be held in zfs_fillpage(). It must only be held when doing zfs_getpage() -> zfs_fillpage(). The reason for this is mappedread() can call zfs_fillpage() if the page is not uptodate. This can occur becaue filemap_fault() will first add the pages to the inode's address_space mapping and then drop the page lock. This leaves open a window were mappedread() can be called. Since this can occur, mappedread() will hold both the page lock and the rangelock. This is perfectly valid and correct. However, it is important in this case to never grab the rangelock in zfs_fillpage(). If this happens, then a dead lock will occur. Finally it is important to note that the rangelock is first attempted to be grabbed with zfs_rangelock_tryenter(). The reason for this is the page lock must be dropped in order to grab the rangelock in this case. Otherwise there is a race between zfs_fillpage() and zfs_write() -> update_pages(). In update_pages() the rangelock is already held and it then grabs the page lock. So if the page lock is not dropped before acquiring the rangelock in zfs_fillpage() there can be a deadlock. Below is a stack trace showing the NULL pointer dereference that was occuring with the dio_mmap ZTS test case before this commit. [ 7737.430658] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 7737.438486] PGD 0 P4D 0 [ 7737.441024] Oops: 0000 [openzfs#1] SMP NOPTI [ 7737.444692] CPU: 33 PID: 599346 Comm: fio Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [ 7737.455721] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 7737.463106] RIP: 0010:__memcpy+0x12/0x20 [ 7737.467032] Code: ff 0f 31 48 c1 e2 20 48 09 c2 48 31 d3 e9 79 ff ff ff 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4 [ 7737.485770] RSP: 0000:ffffc1db829e3b60 EFLAGS: 00010246 [ 7737.490987] RAX: ffff9ef195b6f000 RBX: 0000000000001000 RCX: 0000000000000200 [ 7737.498111] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ef195b6f000 [ 7737.505235] RBP: ffff9ef195b70000 R08: ffff9eef1d1d0000 R09: ffff9eef1d1d0000 [ 7737.512358] R10: ffff9eef27968218 R11: 0000000000000000 R12: 0000000000000000 [ 7737.519481] R13: ffff9ef041adb6d8 R14: 00000000005e1000 R15: 0000000000000001 [ 7737.526607] FS: 00007f77fe2bae80(0000) GS:ffff9f0cae840000(0000) knlGS:0000000000000000 [ 7737.534683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7737.540423] CR2: 0000000000000000 CR3: 00000003387a6000 CR4: 0000000000350ee0 [ 7737.547553] Call Trace: [ 7737.550003] dmu_read_impl+0x11a/0x210 [zfs] [ 7737.554464] dmu_read+0x56/0x90 [zfs] [ 7737.558292] zfs_fillpage+0x76/0x190 [zfs] [ 7737.562584] zfs_getpage+0x4c/0x80 [zfs] [ 7737.566691] zpl_readpage_common+0x3b/0x80 [zfs] [ 7737.571485] filemap_fault+0x5d6/0xa10 [ 7737.575236] ? obj_cgroup_charge_pages+0xba/0xd0 [ 7737.579856] ? xas_load+0x8/0x80 [ 7737.583088] ? xas_find+0x173/0x1b0 [ 7737.586579] ? filemap_map_pages+0x84/0x410 [ 7737.590759] __do_fault+0x38/0xb0 [ 7737.594077] handle_pte_fault+0x559/0x870 [ 7737.598082] __handle_mm_fault+0x44f/0x6c0 [ 7737.602181] handle_mm_fault+0xc1/0x1e0 [ 7737.606019] do_user_addr_fault+0x1b5/0x440 [ 7737.610207] do_page_fault+0x37/0x130 [ 7737.613873] ? page_fault+0x8/0x30 [ 7737.617277] page_fault+0x1e/0x30 [ 7737.620589] RIP: 0033:0x7f77fbce9140 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 12, 2024
It is important to hold the dbuf mutex (db_mtx) when creating ZIO's in dmu_read_abd(). The BP that is returned dmu_buf_get_gp_from_dbuf() may come from a previous direct IO write. In this case, it is attached to a dirty record in the dbuf. When zio_read() is called, a copy of the BP is made through io_bp_copy to io_bp in zio_create(). Without holding the db_mtx though, the dirty record may be freed in dbuf_read_done(). This can result in garbage beening place BP for the ZIO creatd through zio_read(). By holding the db_mtx, this race can be avoided. Below is a stack trace of the issue that was occuring in vdev_mirror_child_select() without holding the db_mtx and creating the the ZIO. [29882.427056] VERIFY(zio->io_bp == NULL || BP_PHYSICAL_BIRTH(zio->io_bp) == txg) failed [29882.434884] PANIC at vdev_mirror.c:545:vdev_mirror_child_select() [29882.440976] Showing stack for process 1865540 [29882.445336] CPU: 57 PID: 1865540 Comm: fio Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [29882.456457] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [29882.463844] Call Trace: [29882.466296] dump_stack+0x41/0x60 [29882.469618] spl_panic+0xd0/0xe8 [spl] [29882.473376] ? __dprintf+0x10e/0x180 [zfs] [29882.477674] ? kfree+0xd3/0x250 [29882.480819] ? __dprintf+0x10e/0x180 [zfs] [29882.485103] ? vdev_mirror_map_alloc+0x29/0x50 [zfs] [29882.490250] ? vdev_lookup_top+0x20/0x90 [zfs] [29882.494878] spl_assert+0x17/0x20 [zfs] [29882.498893] vdev_mirror_child_select+0x279/0x300 [zfs] [29882.504289] vdev_mirror_io_start+0x11f/0x2b0 [zfs] [29882.509336] zio_vdev_io_start+0x3ee/0x520 [zfs] [29882.514137] zio_nowait+0x105/0x290 [zfs] [29882.518330] dmu_read_abd+0x196/0x460 [zfs] [29882.522691] dmu_read_uio_direct+0x6d/0xf0 [zfs] [29882.527472] dmu_read_uio_dnode+0x12a/0x140 [zfs] [29882.532345] dmu_read_uio_dbuf+0x3f/0x60 [zfs] [29882.536953] zfs_read+0x238/0x3f0 [zfs] [29882.540976] zpl_iter_read_direct+0xe0/0x180 [zfs] [29882.545952] ? rrw_exit+0xc6/0x200 [zfs] [29882.550058] zpl_iter_read+0x90/0xb0 [zfs] [29882.554340] new_sync_read+0x10f/0x150 [29882.558094] vfs_read+0x91/0x140 [29882.561325] ksys_read+0x4f/0xb0 [29882.564557] do_syscall_64+0x5b/0x1a0 [29882.568222] entry_SYSCALL_64_after_hwframe+0x65/0xca [29882.573267] RIP: 0033:0x7f7fe0fa6ab4 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 12, 2024
There existed a race condition that was discovered through the dio_random test. When doing fio with --fsync=32, after 32 writes fsync is called on the file. When this happens, blocks committed to the ZIL will be sync'ed out. However, the code for the O_DIRECT write was updated in 31983d2 to always wait if there was an associated ARC buf with the dbuf for all previous TXG's to sync out. There was an oversight with this update. When waiting on previous TXG's to sync out, the O_DIRECT write is holding the rangelock as a writer the entire time. This causes an issue with the ZIL is commit writes out though `zfs_get_data()` because it will try and grab the rangelock as reader. This will lead to a deadlock. In order to fix this race condition, I updated the `dmu_buf_impl_t` struct to contain a uint8_t variable that is used to signal if the dbuf attached to a O_DIRECT write is the wait hold because of mixed direct and buffered data. Using this new `db_mixed_io_dio_wait` variable in the in the `dmu_buf_impl_t` the code in `zfs_get_data()` can tell that rangelock is already being held across the entire block and there is no need to grab the rangelock at all. Because the rangelock is being held as a writer across the entire block already, no modifications can take place against the block as long as the O_DIRECT write is stalled waiting in `dmu_buf_direct_mixed_io_wait()`. Also as part of this update, I realized the `db_state` in `dmu_buf_direct_mixed_io_wait()` needs to be changed temporarily to `DB_CACHED`. This is necessary so the logic in `dbuf_read()` is correct if `dmu_sync_late_arrival()` is called by `dmu_sync()`. It is completely valid to switch the `db_state` back to `DB_CACHED` is there is still an associated ARC buf that will not be freed till out O_DIRECT write is completed which will only happen after if leaves `dmu_buf_direct_mixed_io_wait()`. Here is the stack trace of the deadlock that happen with `dio_random.ksh` before this commit: [ 5513.663402] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 7496.580415] INFO: task z_wr_int:1098000 blocked for more than 120 seconds. [ 7496.585709] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.593349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.600839] task:z_wr_int state:D stack: 0 pid:1098000 ppid: 2 flags:0x80004080 [ 7496.608622] Call Trace: [ 7496.611770] __schedule+0x2d1/0x870 [ 7496.615404] schedule+0x55/0xf0 [ 7496.618866] cv_wait_common+0x16d/0x280 [spl] [ 7496.622910] ? finish_wait+0x80/0x80 [ 7496.626601] dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs] [ 7496.631327] dmu_write_direct_done+0x90/0x3a0 [zfs] [ 7496.635798] zio_done+0x373/0x1d40 [zfs] [ 7496.639795] zio_execute+0xee/0x210 [zfs] [ 7496.643840] taskq_thread+0x203/0x420 [spl] [ 7496.647836] ? wake_up_q+0x70/0x70 [ 7496.651411] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 7496.656489] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 7496.660604] kthread+0x134/0x150 [ 7496.664092] ? set_kthread_struct+0x50/0x50 [ 7496.668080] ret_from_fork+0x35/0x40 [ 7496.671745] INFO: task txg_sync:1098025 blocked for more than 120 seconds. [ 7496.676991] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.684666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.692060] task:txg_sync state:D stack: 0 pid:1098025 ppid: 2 flags:0x80004080 [ 7496.699888] Call Trace: [ 7496.703012] __schedule+0x2d1/0x870 [ 7496.706658] schedule+0x55/0xf0 [ 7496.710093] schedule_timeout+0x197/0x300 [ 7496.713982] ? __next_timer_interrupt+0xf0/0xf0 [ 7496.718135] io_schedule_timeout+0x19/0x40 [ 7496.722049] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7496.726349] ? finish_wait+0x80/0x80 [ 7496.730039] __cv_timedwait_io+0x15/0x20 [spl] [ 7496.734100] zio_wait+0x1a2/0x4d0 [zfs] [ 7496.738082] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 7496.742205] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7496.746534] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 7496.750842] spa_sync_iterate_to_convergence+0xcf/0x310 [zfs] [ 7496.755742] spa_sync+0x362/0x8d0 [zfs] [ 7496.759689] txg_sync_thread+0x274/0x3b0 [zfs] [ 7496.763928] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 7496.768439] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 7496.772799] thread_generic_wrapper+0x63/0x90 [spl] [ 7496.777097] kthread+0x134/0x150 [ 7496.780616] ? set_kthread_struct+0x50/0x50 [ 7496.784549] ret_from_fork+0x35/0x40 [ 7496.788204] INFO: task fio:1101750 blocked for more than 120 seconds. [ 7496.895852] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7496.903765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7496.911170] task:fio state:D stack: 0 pid:1101750 ppid:1101741 flags:0x00004080 [ 7496.919033] Call Trace: [ 7496.922136] __schedule+0x2d1/0x870 [ 7496.925769] schedule+0x55/0xf0 [ 7496.929245] schedule_timeout+0x197/0x300 [ 7496.933120] ? __next_timer_interrupt+0xf0/0xf0 [ 7496.937213] io_schedule_timeout+0x19/0x40 [ 7496.941126] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7496.945444] ? finish_wait+0x80/0x80 [ 7496.949125] __cv_timedwait_io+0x15/0x20 [spl] [ 7496.953191] zio_wait+0x1a2/0x4d0 [zfs] [ 7496.957180] dmu_write_abd+0x174/0x1c0 [zfs] [ 7496.961319] dmu_write_uio_direct+0x79/0xf0 [zfs] [ 7496.965731] dmu_write_uio_dnode+0xa6/0x2d0 [zfs] [ 7496.970043] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 7496.974305] zfs_write+0x55f/0xea0 [zfs] [ 7496.978325] ? iov_iter_get_pages+0xe9/0x390 [ 7496.982333] ? trylock_page+0xd/0x20 [zfs] [ 7496.986451] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7496.990713] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7496.995031] zpl_iter_write_direct+0xda/0x170 [zfs] [ 7496.999489] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.003476] zpl_iter_write+0xd5/0x110 [zfs] [ 7497.007610] new_sync_write+0x112/0x160 [ 7497.011429] vfs_write+0xa5/0x1b0 [ 7497.014916] ksys_write+0x4f/0xb0 [ 7497.018443] do_syscall_64+0x5b/0x1b0 [ 7497.022150] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.026532] RIP: 0033:0x7f8771d72a17 [ 7497.030195] Code: Unable to access opcode bytes at RIP 0x7f8771d729ed. [ 7497.035263] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 7497.042547] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72a17 [ 7497.047933] RDX: 000000000009b000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7497.053269] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.058660] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000009b000 [ 7497.063960] R13: 000055b390afcac0 R14: 000000000009b000 R15: 000055b390afcae8 [ 7497.069334] INFO: task fio:1101751 blocked for more than 120 seconds. [ 7497.074308] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.081973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.089371] task:fio state:D stack: 0 pid:1101751 ppid:1101741 flags:0x00000080 [ 7497.097147] Call Trace: [ 7497.100263] __schedule+0x2d1/0x870 [ 7497.103897] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.107878] schedule+0x55/0xf0 [ 7497.111386] cv_wait_common+0x16d/0x280 [spl] [ 7497.115391] ? finish_wait+0x80/0x80 [ 7497.119028] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7497.123667] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7497.128240] zfs_read+0xaf/0x3f0 [zfs] [ 7497.132146] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.136091] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7497.140366] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7497.144679] zpl_iter_read_direct+0xe0/0x180 [zfs] [ 7497.149054] ? rrw_exit+0xc6/0x200 [zfs] [ 7497.153040] zpl_iter_read+0x94/0xb0 [zfs] [ 7497.157103] new_sync_read+0x10f/0x160 [ 7497.160855] vfs_read+0x91/0x150 [ 7497.164336] ksys_read+0x4f/0xb0 [ 7497.168004] do_syscall_64+0x5b/0x1b0 [ 7497.171706] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.176105] RIP: 0033:0x7f8771d72ab4 [ 7497.179742] Code: Unable to access opcode bytes at RIP 0x7f8771d72a8a. [ 7497.184807] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 7497.192129] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72ab4 [ 7497.197485] RDX: 0000000000002000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7497.202922] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.208309] R10: 00000001ffffffff R11: 0000000000000246 R12: 0000000000002000 [ 7497.213694] R13: 000055b390afcac0 R14: 0000000000002000 R15: 000055b390afcae8 [ 7497.219063] INFO: task fio:1101755 blocked for more than 120 seconds. [ 7497.224098] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.231786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.239165] task:fio state:D stack: 0 pid:1101755 ppid:1101744 flags:0x00000080 [ 7497.246989] Call Trace: [ 7497.250121] __schedule+0x2d1/0x870 [ 7497.253779] schedule+0x55/0xf0 [ 7497.257240] schedule_preempt_disabled+0xa/0x10 [ 7497.261344] __mutex_lock.isra.7+0x349/0x420 [ 7497.265326] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7497.269674] zil_commit_writer+0x89/0x230 [zfs] [ 7497.273938] zil_commit_impl+0x5f/0xd0 [zfs] [ 7497.278101] zfs_fsync+0x81/0xa0 [zfs] [ 7497.282002] zpl_fsync+0xe5/0x140 [zfs] [ 7497.285985] do_fsync+0x38/0x70 [ 7497.289458] __x64_sys_fsync+0x10/0x20 [ 7497.293208] do_syscall_64+0x5b/0x1b0 [ 7497.296928] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.301260] RIP: 0033:0x7f9559073027 [ 7497.304920] Code: Unable to access opcode bytes at RIP 0x7f9559072ffd. [ 7497.310015] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 7497.317346] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9559073027 [ 7497.322722] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI: 0000000000000005 [ 7497.328126] RBP: 00007f94fb858000 R08: 0000000000000000 R09: 0000000000000000 [ 7497.333514] R10: 0000000000008000 R11: 0000000000000293 R12: 0000000000000003 [ 7497.338887] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15: 0000563adcbf2ae8 [ 7497.344247] INFO: task fio:1101756 blocked for more than 120 seconds. [ 7497.349327] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7497.357032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7497.364517] task:fio state:D stack: 0 pid:1101756 ppid:1101744 flags:0x00004080 [ 7497.372310] Call Trace: [ 7497.375433] __schedule+0x2d1/0x870 [ 7497.379004] schedule+0x55/0xf0 [ 7497.382454] cv_wait_common+0x16d/0x280 [spl] [ 7497.386477] ? finish_wait+0x80/0x80 [ 7497.390137] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7497.394816] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7497.399397] zfs_get_data+0x1a8/0x7e0 [zfs] [ 7497.403515] zil_lwb_commit+0x1a5/0x400 [zfs] [ 7497.407712] zil_lwb_write_close+0x408/0x630 [zfs] [ 7497.412126] zil_commit_waiter_timeout+0x16d/0x520 [zfs] [ 7497.416801] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 7497.421139] zil_commit_impl+0x6d/0xd0 [zfs] [ 7497.425294] zfs_fsync+0x81/0xa0 [zfs] [ 7497.429454] zpl_fsync+0xe5/0x140 [zfs] [ 7497.433396] do_fsync+0x38/0x70 [ 7497.436878] __x64_sys_fsync+0x10/0x20 [ 7497.440586] do_syscall_64+0x5b/0x1b0 [ 7497.444313] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7497.448659] RIP: 0033:0x7f9559073027 [ 7497.452343] Code: Unable to access opcode bytes at RIP 0x7f9559072ffd. [ 7497.457408] RSP: 002b:00007ffdefcd0ff0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 7497.464724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f9559073027 [ 7497.470106] RDX: 0000000000000000 RSI: 0000563adcbf2ac0 RDI: 0000000000000005 [ 7497.475477] RBP: 00007f94fb89ca18 R08: 0000000000000000 R09: 0000000000000000 [ 7497.480806] R10: 00000000000b4cc0 R11: 0000000000000293 R12: 0000000000000003 [ 7497.486158] R13: 0000563adcbf2ac0 R14: 0000000000000000 R15: 0000563adcbf2ae8 [ 7619.459402] INFO: task z_wr_int:1098000 blocked for more than 120 seconds. [ 7619.464605] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.472233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.479659] task:z_wr_int state:D stack: 0 pid:1098000 ppid: 2 flags:0x80004080 [ 7619.487518] Call Trace: [ 7619.490650] __schedule+0x2d1/0x870 [ 7619.494246] schedule+0x55/0xf0 [ 7619.497719] cv_wait_common+0x16d/0x280 [spl] [ 7619.501749] ? finish_wait+0x80/0x80 [ 7619.505411] dmu_buf_direct_mixed_io_wait+0x73/0x190 [zfs] [ 7619.510143] dmu_write_direct_done+0x90/0x3a0 [zfs] [ 7619.514603] zio_done+0x373/0x1d40 [zfs] [ 7619.518594] zio_execute+0xee/0x210 [zfs] [ 7619.522619] taskq_thread+0x203/0x420 [spl] [ 7619.526567] ? wake_up_q+0x70/0x70 [ 7619.530208] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 7619.535302] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 7619.539385] kthread+0x134/0x150 [ 7619.542873] ? set_kthread_struct+0x50/0x50 [ 7619.546810] ret_from_fork+0x35/0x40 [ 7619.550477] INFO: task txg_sync:1098025 blocked for more than 120 seconds. [ 7619.555715] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.563415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.570851] task:txg_sync state:D stack: 0 pid:1098025 ppid: 2 flags:0x80004080 [ 7619.578606] Call Trace: [ 7619.581742] __schedule+0x2d1/0x870 [ 7619.585396] schedule+0x55/0xf0 [ 7619.589006] schedule_timeout+0x197/0x300 [ 7619.592916] ? __next_timer_interrupt+0xf0/0xf0 [ 7619.597027] io_schedule_timeout+0x19/0x40 [ 7619.600947] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7619.709878] ? finish_wait+0x80/0x80 [ 7619.713565] __cv_timedwait_io+0x15/0x20 [spl] [ 7619.717596] zio_wait+0x1a2/0x4d0 [zfs] [ 7619.721567] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 7619.725657] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7619.730050] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 7619.734415] spa_sync_iterate_to_convergence+0xcf/0x310 [zfs] [ 7619.739268] spa_sync+0x362/0x8d0 [zfs] [ 7619.743270] txg_sync_thread+0x274/0x3b0 [zfs] [ 7619.747494] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 7619.751939] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 7619.756279] thread_generic_wrapper+0x63/0x90 [spl] [ 7619.760569] kthread+0x134/0x150 [ 7619.764050] ? set_kthread_struct+0x50/0x50 [ 7619.767978] ret_from_fork+0x35/0x40 [ 7619.771639] INFO: task fio:1101750 blocked for more than 120 seconds. [ 7619.776678] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.784324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.791914] task:fio state:D stack: 0 pid:1101750 ppid:1101741 flags:0x00004080 [ 7619.799712] Call Trace: [ 7619.802816] __schedule+0x2d1/0x870 [ 7619.806427] schedule+0x55/0xf0 [ 7619.809867] schedule_timeout+0x197/0x300 [ 7619.813760] ? __next_timer_interrupt+0xf0/0xf0 [ 7619.817848] io_schedule_timeout+0x19/0x40 [ 7619.821766] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 7619.826097] ? finish_wait+0x80/0x80 [ 7619.829780] __cv_timedwait_io+0x15/0x20 [spl] [ 7619.833857] zio_wait+0x1a2/0x4d0 [zfs] [ 7619.837838] dmu_write_abd+0x174/0x1c0 [zfs] [ 7619.842015] dmu_write_uio_direct+0x79/0xf0 [zfs] [ 7619.846388] dmu_write_uio_dnode+0xa6/0x2d0 [zfs] [ 7619.850760] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 7619.855011] zfs_write+0x55f/0xea0 [zfs] [ 7619.859008] ? iov_iter_get_pages+0xe9/0x390 [ 7619.863036] ? trylock_page+0xd/0x20 [zfs] [ 7619.867084] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7619.871366] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7619.875715] zpl_iter_write_direct+0xda/0x170 [zfs] [ 7619.880164] ? rrw_exit+0xc6/0x200 [zfs] [ 7619.884174] zpl_iter_write+0xd5/0x110 [zfs] [ 7619.888492] new_sync_write+0x112/0x160 [ 7619.892285] vfs_write+0xa5/0x1b0 [ 7619.895829] ksys_write+0x4f/0xb0 [ 7619.899384] do_syscall_64+0x5b/0x1b0 [ 7619.903071] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7619.907394] RIP: 0033:0x7f8771d72a17 [ 7619.911026] Code: Unable to access opcode bytes at RIP 0x7f8771d729ed. [ 7619.916073] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 7619.923363] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72a17 [ 7619.928675] RDX: 000000000009b000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7619.934019] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7619.939354] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000009b000 [ 7619.944775] R13: 000055b390afcac0 R14: 000000000009b000 R15: 000055b390afcae8 [ 7619.950175] INFO: task fio:1101751 blocked for more than 120 seconds. [ 7619.955232] Tainted: P OE --------- - - 4.18.0-477.15.1.el8_8.x86_64 openzfs#1 [ 7619.962889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7619.970301] task:fio state:D stack: 0 pid:1101751 ppid:1101741 flags:0x00000080 [ 7619.978139] Call Trace: [ 7619.981278] __schedule+0x2d1/0x870 [ 7619.984872] ? rrw_exit+0xc6/0x200 [zfs] [ 7619.989260] schedule+0x55/0xf0 [ 7619.992725] cv_wait_common+0x16d/0x280 [spl] [ 7619.996754] ? finish_wait+0x80/0x80 [ 7620.000414] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 7620.005050] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 7620.009617] zfs_read+0xaf/0x3f0 [zfs] [ 7620.013503] ? rrw_exit+0xc6/0x200 [zfs] [ 7620.017489] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 7620.021774] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 7620.026091] zpl_iter_read_direct+0xe0/0x180 [zfs] [ 7620.030508] ? rrw_exit+0xc6/0x200 [zfs] [ 7620.034497] zpl_iter_read+0x94/0xb0 [zfs] [ 7620.038579] new_sync_read+0x10f/0x160 [ 7620.042325] vfs_read+0x91/0x150 [ 7620.045809] ksys_read+0x4f/0xb0 [ 7620.049273] do_syscall_64+0x5b/0x1b0 [ 7620.052965] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 7620.057354] RIP: 0033:0x7f8771d72ab4 [ 7620.060988] Code: Unable to access opcode bytes at RIP 0x7f8771d72a8a. [ 7620.066041] RSP: 002b:00007fffa5b930e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 7620.073256] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8771d72ab4 [ 7620.078553] RDX: 0000000000002000 RSI: 00007f8713454000 RDI: 0000000000000005 [ 7620.083878] RBP: 00007f8713454000 R08: 0000000000000000 R09: 0000000000000000 [ 7620.089353] R10: 00000001ffffffff R11: 0000000000000246 R12: 0000000000002000 [ 7620.094697] R13: 000055b390afcac0 R14: 0000000000002000 R15: 000055b390afcae8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 12, 2024
995734e added a test for block cloning with mmap files. As a result I began hitting a panic in that test in dbuf_unoverride(). The was if the dirty record was from a cloned block, then the dr_data must be set to NULL. This ASSERT was added in 86e115e. The point of that commit was to make sure that if a cloned block is read before it is synced out, then the associated ARC buffer is set in the dirty record. This became an issue with the O_DIRECT code, because dr_data was set to the ARC buf in dbuf_set_data() after the read. This is the incorrect logic though for the cloned block. In order to fix this issue, I refined how to determine if the dirty record is in fact from a O_DIRECT write by make sure that dr_brtwrite is false. I created the function dbuf_dirty_is_direct_write() to perform the proper check. As part of this, I also cleaned up other code that did the exact same check for an O_DIRECT write to make sure the proper check is taking place everywhere. The trace of the ASSERT that was being tripped before this change is below: [3649972.811039] VERIFY0P(dr->dt.dl.dr_data) failed (NULL == ffff8d58e8183c80) [3649972.817999] PANIC at dbuf.c:1999:dbuf_unoverride() [3649972.822968] Showing stack for process 2365657 [3649972.827502] CPU: 0 PID: 2365657 Comm: clone_mmap_writ Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [3649972.839749] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [3649972.847315] Call Trace: [3649972.849935] dump_stack+0x41/0x60 [3649972.853428] spl_panic+0xd0/0xe8 [spl] [3649972.857370] ? cityhash4+0x75/0x90 [zfs] [3649972.861649] ? _cond_resched+0x15/0x30 [3649972.865577] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [3649972.870548] ? __kmalloc_node+0x10d/0x300 [3649972.874735] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [3649972.879702] ? __list_add+0x12/0x30 [zfs] [3649972.884061] dbuf_unoverride+0x1c1/0x1d0 [zfs] [3649972.888856] dbuf_redirty+0x3b/0xd0 [zfs] [3649972.893204] dbuf_dirty+0xeb1/0x1330 [zfs] [3649972.897643] ? _cond_resched+0x15/0x30 [3649972.901569] ? mutex_lock+0xe/0x30 [3649972.905148] ? dbuf_noread+0x117/0x240 [zfs] [3649972.909760] dmu_write_uio_dnode+0x1d2/0x320 [zfs] [3649972.914900] dmu_write_uio_dbuf+0x47/0x60 [zfs] [3649972.919777] zfs_write+0x57d/0xe00 [zfs] [3649972.924076] ? alloc_set_pte+0xb8/0x3e0 [3649972.928088] zpl_iter_write_buffered+0xb2/0x120 [zfs] [3649972.933507] ? rrw_exit+0xc6/0x200 [zfs] [3649972.937796] zpl_iter_write+0xba/0x110 [zfs] [3649972.942433] new_sync_write+0x112/0x160 [3649972.946445] vfs_write+0xa5/0x1a0 [3649972.949935] ksys_pwrite64+0x61/0xa0 [3649972.953681] do_syscall_64+0x5b/0x1a0 [3649972.957519] entry_SYSCALL_64_after_hwframe+0x65/0xca [3649972.962745] RIP: 0033:0x7f610616f01b Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 12, 2024
Originally I was checking dr->dr_dbuf->db_level == 0 in dbuf_dirty_is_direct_write(). Howver, this can lead to a NULL ponter dereference if the dr_dbuf is no longer set. I updated dbuf_dirty_is_direct_write() to now also take a dmu_buf_impl_t to check if db->db_level == 0. This failure was caught on the Fedora 37 CI running in test enospc_rm. Below is the stack trace. [ 9851.511608] BUG: kernel NULL pointer dereference, address: 0000000000000068 [ 9851.515922] #PF: supervisor read access in kernel mode [ 9851.519462] #PF: error_code(0x0000) - not-present page [ 9851.522992] PGD 0 P4D 0 [ 9851.525684] Oops: 0000 [openzfs#1] PREEMPT SMP PTI [ 9851.528878] CPU: 0 PID: 1272993 Comm: fio Tainted: P OE 6.5.12-100.fc37.x86_64 openzfs#1 [ 9851.535266] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017 [ 9851.539226] RIP: 0010:dbuf_dirty_is_direct_write+0xb/0x40 [zfs] [ 9851.543379] Code: 10 74 02 31 c0 5b c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 31 c0 48 85 ff 74 31 48 8b 57 20 <80> 7a 68 00 75 27 8b 87 64 01 00 00 85 c0 75 1b 83 bf 58 01 00 00 [ 9851.554719] RSP: 0018:ffff9b5b8305f8e8 EFLAGS: 00010286 [ 9851.558276] RAX: 0000000000000000 RBX: ffff9b5b8569b0b8 RCX: 0000000000000000 [ 9851.562481] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8f2e97de9e00 [ 9851.566672] RBP: 0000000000020000 R08: 0000000000000000 R09: ffff8f2f70e94000 [ 9851.570851] R10: 0000000000000001 R11: 0000000000000110 R12: ffff8f2f774ae4c0 [ 9851.575032] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 9851.579209] FS: 00007f57c5542240(0000) GS:ffff8f2faa800000(0000) knlGS:0000000000000000 [ 9851.585357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9851.589064] CR2: 0000000000000068 CR3: 00000001f9a38001 CR4: 00000000007706f0 [ 9851.593256] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9851.597440] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9851.601618] PKRU: 55555554 [ 9851.604341] Call Trace: [ 9851.606981] <TASK> [ 9851.609515] ? __die+0x23/0x70 [ 9851.612388] ? page_fault_oops+0x171/0x4e0 [ 9851.615571] ? exc_page_fault+0x77/0x170 [ 9851.618704] ? asm_exc_page_fault+0x26/0x30 [ 9851.621900] ? dbuf_dirty_is_direct_write+0xb/0x40 [zfs] [ 9851.625828] zfs_get_data+0x407/0x820 [zfs] [ 9851.629400] zil_lwb_commit+0x18d/0x3f0 [zfs] [ 9851.633026] zil_lwb_write_issue+0x92/0xbb0 [zfs] [ 9851.636758] zil_commit_waiter_timeout+0x1f3/0x580 [zfs] [ 9851.640696] zil_commit_waiter+0x1ff/0x3a0 [zfs] [ 9851.644402] zil_commit_impl+0x71/0xd0 [zfs] [ 9851.647998] zfs_write+0xb51/0xdc0 [zfs] [ 9851.651467] zpl_iter_write_buffered+0xc9/0x140 [zfs] [ 9851.655337] zpl_iter_write+0xc0/0x110 [zfs] [ 9851.658920] vfs_write+0x23e/0x420 [ 9851.661871] __x64_sys_pwrite64+0x98/0xd0 [ 9851.665013] do_syscall_64+0x5f/0x90 [ 9851.668027] ? ksys_fadvise64_64+0x57/0xa0 [ 9851.671212] ? syscall_exit_to_user_mode+0x2b/0x40 [ 9851.674594] ? do_syscall_64+0x6b/0x90 [ 9851.677655] ? syscall_exit_to_user_mode+0x2b/0x40 [ 9851.681051] ? do_syscall_64+0x6b/0x90 [ 9851.684128] ? exc_page_fault+0x77/0x170 [ 9851.687256] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 9851.690759] RIP: 0033:0x7f57c563c377 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
bwatkinson
added a commit
to bwatkinson/zfs
that referenced
this issue
Sep 12, 2024
There existed a race condition between when a Direct I/O write could complete and if a sync operation was issued. This was due to the fact that a Direct I/O would sleep waiting on previous TXG's to sync out their dirty records assosciated with a dbuf if there was an ARC buffer associated with the dbuf. This was necessay to safely destroy the ARC buffer in case previous dirty records dr_data as pointed at that the db_buf. The main issue with this approach is a Direct I/o write holds the rangelock across the entire block, so when a sync on that same block was issued and tried to grab the rangelock as reader, it would be blocked indefinitely because the Direct I/O that was now sleeping was holding that same rangelock as writer. This led to a complete deadlock. This commit fixes this issue and removes the wait in dmu_write_direct_done(). The way this is now handled is the ARC buffer is destroyed, if there an associated one with dbuf, before ever issuing the Direct I/O write. This implemenation heavily borrows from the block cloning implementation. A new function dmu_buf_wil_clone_or_dio() is called in both dmu_write_direct() and dmu_brt_clone() that does the following: 1. Undirties a dirty record for that db if there one currently associated with the current TXG. 2. Destroys the ARC buffer if the previous dirty record dr_data does not point at the dbufs ARC buffer (db_buf). 3. Sets the dbufs data pointers to NULL. 4. Redirties the dbuf using db_state = DB_NOFILL. As part of this commit, the dmu_write_direct_done() function was also cleaned up. Now dmu_sync_done() is called before undirtying the dbuf dirty record associated with a failed Direct I/O write. This is correct logic and how it always should have been. The additional benefits of these modifications is there is no longer a stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT together. Also it unifies the block cloning and Direct I/O write path as they both need to call dbuf_fix_old_data() before destroying the ARC buffer. As part of this commit, there is also just general code cleanup. Various dbuf stats were removed because they are not necesary any longer. Additionally, useless functions were removed to make the code paths cleaner for Direct I/O. Below is the race condtion stack trace that was being consistently observed in the CI runs for the dio_random test case that prompted these changes: trace: [ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [ 9954.770512] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.773848] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [ 9954.775512] Call Trace: [ 9954.776406] __schedule+0x2d1/0x870 [ 9954.777386] ? free_one_page+0x204/0x530 [ 9954.778466] schedule+0x55/0xf0 [ 9954.779355] cv_wait_common+0x16d/0x280 [spl] [ 9954.780491] ? finish_wait+0x80/0x80 [ 9954.781450] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [ 9954.782889] dmu_write_direct_done+0x90/0x3b0 [zfs] [ 9954.784255] zio_done+0x373/0x1d50 [zfs] [ 9954.785410] zio_execute+0xee/0x210 [zfs] [ 9954.786588] taskq_thread+0x205/0x3f0 [spl] [ 9954.787673] ? wake_up_q+0x60/0x60 [ 9954.788571] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 9954.790079] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 9954.791199] kthread+0x134/0x150 [ 9954.792082] ? set_kthread_struct+0x50/0x50 [ 9954.793189] ret_from_fork+0x35/0x40 [ 9954.794108] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [ 9954.795535] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.798669] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [ 9954.800267] Call Trace: [ 9954.801096] __schedule+0x2d1/0x870 [ 9954.801972] ? __wake_up_common+0x7a/0x190 [ 9954.802963] schedule+0x55/0xf0 [ 9954.803884] schedule_timeout+0x19f/0x320 [ 9954.804837] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.805932] ? taskq_dispatch+0xab/0x280 [spl] [ 9954.806959] io_schedule_timeout+0x19/0x40 [ 9954.807989] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.809110] ? finish_wait+0x80/0x80 [ 9954.810068] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.811103] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.812255] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 9954.813442] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 9954.814648] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [ 9954.816023] spa_sync+0x362/0x8f0 [zfs] [ 9954.817110] txg_sync_thread+0x27a/0x3b0 [zfs] [ 9954.818267] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 9954.819510] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 9954.820643] thread_generic_wrapper+0x63/0x90 [spl] [ 9954.821709] kthread+0x134/0x150 [ 9954.822590] ? set_kthread_struct+0x50/0x50 [ 9954.823584] ret_from_fork+0x35/0x40 [ 9954.824444] INFO: task fio:1055501 blocked for more than 120 seconds. [ 9954.825781] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.828871] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [ 9954.830463] Call Trace: [ 9954.831280] __schedule+0x2d1/0x870 [ 9954.832159] ? dbuf_hold_copy+0xec/0x230 [zfs] [ 9954.833396] schedule+0x55/0xf0 [ 9954.834286] cv_wait_common+0x16d/0x280 [spl] [ 9954.835291] ? finish_wait+0x80/0x80 [ 9954.836235] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 9954.837543] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 9954.838838] zfs_get_data+0x566/0x810 [zfs] [ 9954.840034] zil_lwb_commit+0x194/0x3f0 [zfs] [ 9954.841154] zil_lwb_write_issue+0x68/0xb90 [zfs] [ 9954.842367] ? __list_add+0x12/0x30 [zfs] [ 9954.843496] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.844665] ? zil_alloc_lwb+0x217/0x360 [zfs] [ 9954.845852] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [ 9954.847203] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 9954.848380] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.849550] zfs_fsync+0x66/0x90 [zfs] [ 9954.850640] zpl_fsync+0xe5/0x140 [zfs] [ 9954.851729] do_fsync+0x38/0x70 [ 9954.852585] __x64_sys_fsync+0x10/0x20 [ 9954.853486] do_syscall_64+0x5b/0x1b0 [ 9954.854416] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.855466] RIP: 0033:0x7eff236bb057 [ 9954.856388] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [ 9954.866149] INFO: task fio:1055502 blocked for more than 120 seconds. [ 9954.867490] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.870571] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [ 9954.872162] Call Trace: [ 9954.872947] __schedule+0x2d1/0x870 [ 9954.873844] schedule+0x55/0xf0 [ 9954.874716] schedule_timeout+0x19f/0x320 [ 9954.875645] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.876722] io_schedule_timeout+0x19/0x40 [ 9954.877677] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.878822] ? finish_wait+0x80/0x80 [ 9954.879694] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.880763] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.881865] dmu_write_abd+0x174/0x1c0 [zfs] [ 9954.883074] dmu_write_uio_direct+0x79/0x100 [zfs] [ 9954.884285] dmu_write_uio_dnode+0xb2/0x320 [zfs] [ 9954.885507] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 9954.886687] zfs_write+0x581/0xe20 [zfs] [ 9954.887822] ? iov_iter_get_pages+0xe9/0x390 [ 9954.888862] ? trylock_page+0xd/0x20 [zfs] [ 9954.890005] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.891217] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 9954.892391] zpl_iter_write_direct+0xd4/0x170 [zfs] [ 9954.893663] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.894764] zpl_iter_write+0xd5/0x110 [zfs] [ 9954.895911] new_sync_write+0x112/0x160 [ 9954.896881] vfs_write+0xa5/0x1b0 [ 9954.897701] ksys_write+0x4f/0xb0 [ 9954.898569] do_syscall_64+0x5b/0x1b0 [ 9954.899417] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.900515] RIP: 0033:0x7eff236baa47 [ 9954.901363] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 [ 9954.911129] INFO: task fio:1055504 blocked for more than 120 seconds. [ 9954.912381] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.915434] task:fio state:D stack:0 pid:1055504 ppid:1055493 flags:0x00000080 [ 9954.917082] Call Trace: [ 9954.917773] __schedule+0x2d1/0x870 [ 9954.918648] ? zilog_dirty+0x4f/0xc0 [zfs] [ 9954.919831] schedule+0x55/0xf0 [ 9954.920717] cv_wait_common+0x16d/0x280 [spl] [ 9954.921704] ? finish_wait+0x80/0x80 [ 9954.922639] zfs_rangelock_enter_writer+0x46/0x1c0 [zfs] [ 9954.923940] zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs] [ 9954.925306] zfs_write+0x703/0xe20 [zfs] [ 9954.926406] zpl_iter_write_buffered+0xb2/0x120 [zfs] [ 9954.927687] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.928821] zpl_iter_write+0xbe/0x110 [zfs] [ 9954.930028] new_sync_write+0x112/0x160 [ 9954.930913] vfs_write+0xa5/0x1b0 [ 9954.931758] ksys_write+0x4f/0xb0 [ 9954.932666] do_syscall_64+0x5b/0x1b0 [ 9954.933544] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.934689] RIP: 0033:0x7fcaee8f0a47 [ 9954.935551] Code: Unable to access opcode bytes at RIP 0x7fcaee8f0a1d. [ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fcaee8f0a47 [ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI: 0000000000000006 [ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09: 0000000000000000 [ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000001d000 [ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15: 0000557a2006bae8 [ 9954.945525] INFO: task fio:1055505 blocked for more than 120 seconds. [ 9954.946819] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.949959] task:fio state:D stack:0 pid:1055505 ppid:1055493 flags:0x00004080 [ 9954.951653] Call Trace: [ 9954.952417] __schedule+0x2d1/0x870 [ 9954.953393] ? finish_wait+0x3e/0x80 [ 9954.954315] schedule+0x55/0xf0 [ 9954.955212] cv_wait_common+0x16d/0x280 [spl] [ 9954.956211] ? finish_wait+0x80/0x80 [ 9954.957159] zil_commit_waiter+0xfa/0x3b0 [zfs] [ 9954.958343] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.959524] zfs_fsync+0x66/0x90 [zfs] [ 9954.960626] zpl_fsync+0xe5/0x140 [zfs] [ 9954.961763] do_fsync+0x38/0x70 [ 9954.962638] __x64_sys_fsync+0x10/0x20 [ 9954.963520] do_syscall_64+0x5b/0x1b0 [ 9954.964470] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.965567] RIP: 0033:0x7fcaee8f1057 [ 9954.966490] Code: Unable to access opcode bytes at RIP 0x7fcaee8f102d. [ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fcaee8f1057 [ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI: 0000000000000005 [ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09: 0000000000000000 [ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15: 0000557a2006bae8 [10077.648150] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [10077.649541] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.652782] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [10077.654420] Call Trace: [10077.655267] __schedule+0x2d1/0x870 [10077.656179] ? free_one_page+0x204/0x530 [10077.657192] schedule+0x55/0xf0 [10077.658004] cv_wait_common+0x16d/0x280 [spl] [10077.659018] ? finish_wait+0x80/0x80 [10077.660013] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [10077.661396] dmu_write_direct_done+0x90/0x3b0 [zfs] [10077.662617] zio_done+0x373/0x1d50 [zfs] [10077.663783] zio_execute+0xee/0x210 [zfs] [10077.664921] taskq_thread+0x205/0x3f0 [spl] [10077.665982] ? wake_up_q+0x60/0x60 [10077.666842] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [10077.668295] ? taskq_lowest_id+0xc0/0xc0 [spl] [10077.669360] kthread+0x134/0x150 [10077.670191] ? set_kthread_struct+0x50/0x50 [10077.671209] ret_from_fork+0x35/0x40 [10077.672076] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [10077.673467] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.676612] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [10077.678288] Call Trace: [10077.679024] __schedule+0x2d1/0x870 [10077.679948] ? __wake_up_common+0x7a/0x190 [10077.681042] schedule+0x55/0xf0 [10077.681899] schedule_timeout+0x19f/0x320 [10077.682951] ? __next_timer_interrupt+0xf0/0xf0 [10077.684005] ? taskq_dispatch+0xab/0x280 [spl] [10077.685085] io_schedule_timeout+0x19/0x40 [10077.686080] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.687227] ? finish_wait+0x80/0x80 [10077.688123] __cv_timedwait_io+0x15/0x20 [spl] [10077.689206] zio_wait+0x1ad/0x4f0 [zfs] [10077.690300] dsl_pool_sync+0xcb/0x6c0 [zfs] [10077.691435] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [10077.692636] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [10077.693997] spa_sync+0x362/0x8f0 [zfs] [10077.695112] txg_sync_thread+0x27a/0x3b0 [zfs] [10077.696239] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [10077.697512] ? spl_assert.constprop.0+0x20/0x20 [spl] [10077.698639] thread_generic_wrapper+0x63/0x90 [spl] [10077.699687] kthread+0x134/0x150 [10077.700567] ? set_kthread_struct+0x50/0x50 [10077.701502] ret_from_fork+0x35/0x40 [10077.702430] INFO: task fio:1055501 blocked for more than 120 seconds. [10077.703697] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.706780] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [10077.708479] Call Trace: [10077.709231] __schedule+0x2d1/0x870 [10077.710190] ? dbuf_hold_copy+0xec/0x230 [zfs] [10077.711368] schedule+0x55/0xf0 [10077.712286] cv_wait_common+0x16d/0x280 [spl] [10077.713316] ? finish_wait+0x80/0x80 [10077.714262] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [10077.715566] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [10077.716878] zfs_get_data+0x566/0x810 [zfs] [10077.718032] zil_lwb_commit+0x194/0x3f0 [zfs] [10077.719234] zil_lwb_write_issue+0x68/0xb90 [zfs] [10077.720413] ? __list_add+0x12/0x30 [zfs] [10077.721525] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.722708] ? zil_alloc_lwb+0x217/0x360 [zfs] [10077.723931] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [10077.725273] zil_commit_waiter+0x1d2/0x3b0 [zfs] [10077.726438] zil_commit_impl+0x6d/0xd0 [zfs] [10077.727586] zfs_fsync+0x66/0x90 [zfs] [10077.728675] zpl_fsync+0xe5/0x140 [zfs] [10077.729755] do_fsync+0x38/0x70 [10077.730607] __x64_sys_fsync+0x10/0x20 [10077.731482] do_syscall_64+0x5b/0x1b0 [10077.732415] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.733487] RIP: 0033:0x7eff236bb057 [10077.734399] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [10077.744168] INFO: task fio:1055502 blocked for more than 120 seconds. [10077.745505] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 openzfs#1 [10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.748642] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [10077.750233] Call Trace: [10077.751011] __schedule+0x2d1/0x870 [10077.751915] schedule+0x55/0xf0 [10077.752811] schedule_timeout+0x19f/0x320 [10077.753762] ? __next_timer_interrupt+0xf0/0xf0 [10077.754824] io_schedule_timeout+0x19/0x40 [10077.755782] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.756922] ? finish_wait+0x80/0x80 [10077.757788] __cv_timedwait_io+0x15/0x20 [spl] [10077.758845] zio_wait+0x1ad/0x4f0 [zfs] [10077.759941] dmu_write_abd+0x174/0x1c0 [zfs] [10077.761144] dmu_write_uio_direct+0x79/0x100 [zfs] [10077.762327] dmu_write_uio_dnode+0xb2/0x320 [zfs] [10077.763523] dmu_write_uio_dbuf+0x47/0x60 [zfs] [10077.764749] zfs_write+0x581/0xe20 [zfs] [10077.765825] ? iov_iter_get_pages+0xe9/0x390 [10077.766842] ? trylock_page+0xd/0x20 [zfs] [10077.767956] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.769189] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [10077.770343] zpl_iter_write_direct+0xd4/0x170 [zfs] [10077.771570] ? rrw_exit+0xc6/0x200 [zfs] [10077.772674] zpl_iter_write+0xd5/0x110 [zfs] [10077.773834] new_sync_write+0x112/0x160 [10077.774805] vfs_write+0xa5/0x1b0 [10077.775634] ksys_write+0x4f/0xb0 [10077.776526] do_syscall_64+0x5b/0x1b0 [10077.777386] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.778488] RIP: 0033:0x7eff236baa47 [10077.779339] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Barrier support was added as of zfs-0.4.5. This has been implemented in the vdev_disk layer for 2.6.24+ kernels. This is where generic support first appears for empty bios in the kernel which can be submitted as a WRITE_BARRIER io request. This prevents the elevator from reordering requests from one side of this barrier to the other, and the callback will only run once this io (and previous ios) are all physically on disk. For kernels priors to this there is a more primitive barrier mechanism but the code has not been updated to use it and instead returns ENOTSUP to indicate there is no barrier support. This works needs to be done.
The text was updated successfully, but these errors were encountered: