Limited parallelism due to locking #2023

mahkoh · 2021-10-06T03:54:02Z

I have a program that calls queue_write_buffer concurrently. Most of the time the threads are blocked with the following stack trace:

        [...]
        [ffffffff84545650] do_futex
        [ffffffff845465fe] __x64_sys_futex
        [ffffffff84eb813c] do_syscall_64
        [ffffffff8500007c] entry_SYSCALL_64_after_hwframe
        [7eff1731418d] syscall
        [55ad678f2968] parking_lot_core::thread_parker::imp::ThreadParker::futex_wait
        [55ad678f2968] <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park
        [55ad678f2968] parking_lot_core::parking_lot::park::{{closure}}
        [55ad678f2968] parking_lot_core::parking_lot::with_thread_data
        [55ad678f2968] parking_lot_core::parking_lot::park
        [55ad678f2968] parking_lot::raw_rwlock::RawRwLock::lock_common
        [55ad678f2968] parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow
        [55ad67c492fe] <parking_lot::raw_rwlock::RawRwLock as lock_api::rwlock::RawRwLock>::lock_exclusive
        [55ad67c492fe] lock_api::rwlock::RwLock<R,T>::write
        [55ad67c492fe] wgpu_core::hub::Registry<T,I,F>::write
        [55ad67c492fe] wgpu_core::device::queue::<impl wgpu_core::hub::Global<G>>::queue_write_buffer
        [55ad67bf590d] <wgpu::backend::direct::Context as wgpu::Context>::queue_write_buffer

After switching to thread-local StagingBelts and CommandEncoders the improvements were only marginal because

        let (mut cmd_buf_guard, mut token) = hub.command_buffers.write(&mut token);

in command_encoder_copy_buffer_to_buffer acquires a global lock. This seems unnecessary since CommandEncoders are !Send + !Sync.

Furthermore my program also calls create_bind_group concurrently. This also acquires a global lock in the Vulkan backend:

        let mut vk_sets = self.desc_allocator.lock().allocate(

I locally addressed these issues by

Storing command buffers inside UnsafeCells inside hub.command_buffers and replacing most hub.command_buffers.write calls by read calls.
Using thread-local allocators in the Vulkan backend.

Together these changes reduced the runtime of the encoding step by 50%.

Are using queue.write_buffer and device.create_bind_group concurrently supposed to scale or is this a pattern that should be avoided?

The text was updated successfully, but these errors were encountered:

cwfitzgerald · 2021-10-06T16:20:10Z

They definitely should scale, as much as the underlying driver lets them. We are aware we currently lock way too heavily and there is an inflight refactor that will help with most cases, only locking exactly the resources we need, instead of the whole world.

write_buffer is going to need a bit of special work to get going, but totally should be possible to call in parallel.

CommandEncoders are !Send + !Sync.

They actually no longer need to be, so this will change. Locking the world for the command encoder is still unnecessary though, this will be addressed in the refactor.

mahkoh · 2021-10-06T16:26:29Z

Great. Another thing that I noticed is that all StagingBelt writes seem to synchronize with each other (on the GPU) causing the following timeline on the GPU:

cwfitzgerald · 2021-10-06T17:47:55Z

This is a known problem and unfortunately isn't the easiest to solve generically. I've had some ideas here #1260 (see the "extras" section for talk about write_buffer/transfers). Transfers are probably going to be the easiest one to solve as the problem of "is disjoint" isn't terribly complicated to solve.

mahkoh · 2021-10-07T17:29:18Z

I tried to understand why these barriers completely blocked other transfers even though they looked like they were fine-grained buffer-local barriers. Then I found https://www.mail-archive.com/mesa-commit@lists.freedesktop.org/msg117032.html which suggests that drivers don't really implement this part of vulkan and fall back to full memory barriers.

cwfitzgerald · 2021-10-07T17:40:10Z

Yeah, barriers are basically just "should I flush the L1/L2 cache and wait for the pipeline to drain", the hardware doesn't have fine grained control. We need to batch barriers as much as it is valid to do so.

cwfitzgerald · 2023-10-04T04:57:48Z

Obsoleted by #2710

cwfitzgerald added area: performance How fast things go type: enhancement New feature or request labels Oct 6, 2021

cwfitzgerald mentioned this issue Oct 6, 2021

Remove !Send + !Sync on command encoders #2024

Merged

Imberflur mentioned this issue Jun 2, 2022

Remove Locking From Hot Paths #2710

Closed

cwfitzgerald closed this as completed Oct 4, 2023

teoxoy closed this as not planned Won't fix, can't repro, duplicate, stale Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limited parallelism due to locking #2023

Limited parallelism due to locking #2023

mahkoh commented Oct 6, 2021

cwfitzgerald commented Oct 6, 2021

mahkoh commented Oct 6, 2021 •

edited

Loading

cwfitzgerald commented Oct 6, 2021

mahkoh commented Oct 7, 2021

cwfitzgerald commented Oct 7, 2021

cwfitzgerald commented Oct 4, 2023

Limited parallelism due to locking #2023

Limited parallelism due to locking #2023

Comments

mahkoh commented Oct 6, 2021

cwfitzgerald commented Oct 6, 2021

mahkoh commented Oct 6, 2021 • edited Loading

cwfitzgerald commented Oct 6, 2021

mahkoh commented Oct 7, 2021

cwfitzgerald commented Oct 7, 2021

cwfitzgerald commented Oct 4, 2023

mahkoh commented Oct 6, 2021 •

edited

Loading