Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited parallelism due to locking #2023

Closed
mahkoh opened this issue Oct 6, 2021 · 6 comments
Closed

Limited parallelism due to locking #2023

mahkoh opened this issue Oct 6, 2021 · 6 comments
Labels
area: performance How fast things go type: enhancement New feature or request

Comments

@mahkoh
Copy link
Contributor

mahkoh commented Oct 6, 2021

I have a program that calls queue_write_buffer concurrently. Most of the time the threads are blocked with the following stack trace:

        [...]
        [ffffffff84545650] do_futex
        [ffffffff845465fe] __x64_sys_futex
        [ffffffff84eb813c] do_syscall_64
        [ffffffff8500007c] entry_SYSCALL_64_after_hwframe
        [7eff1731418d] syscall
        [55ad678f2968] parking_lot_core::thread_parker::imp::ThreadParker::futex_wait
        [55ad678f2968] <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park
        [55ad678f2968] parking_lot_core::parking_lot::park::{{closure}}
        [55ad678f2968] parking_lot_core::parking_lot::with_thread_data
        [55ad678f2968] parking_lot_core::parking_lot::park
        [55ad678f2968] parking_lot::raw_rwlock::RawRwLock::lock_common
        [55ad678f2968] parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow
        [55ad67c492fe] <parking_lot::raw_rwlock::RawRwLock as lock_api::rwlock::RawRwLock>::lock_exclusive
        [55ad67c492fe] lock_api::rwlock::RwLock<R,T>::write
        [55ad67c492fe] wgpu_core::hub::Registry<T,I,F>::write
        [55ad67c492fe] wgpu_core::device::queue::<impl wgpu_core::hub::Global<G>>::queue_write_buffer
        [55ad67bf590d] <wgpu::backend::direct::Context as wgpu::Context>::queue_write_buffer

After switching to thread-local StagingBelts and CommandEncoders the improvements were only marginal because

        let (mut cmd_buf_guard, mut token) = hub.command_buffers.write(&mut token);

in command_encoder_copy_buffer_to_buffer acquires a global lock. This seems unnecessary since CommandEncoders are !Send + !Sync.

Furthermore my program also calls create_bind_group concurrently. This also acquires a global lock in the Vulkan backend:

        let mut vk_sets = self.desc_allocator.lock().allocate(

I locally addressed these issues by

  1. Storing command buffers inside UnsafeCells inside hub.command_buffers and replacing most hub.command_buffers.write calls by read calls.
  2. Using thread-local allocators in the Vulkan backend.

Together these changes reduced the runtime of the encoding step by 50%.

Are using queue.write_buffer and device.create_bind_group concurrently supposed to scale or is this a pattern that should be avoided?

@cwfitzgerald
Copy link
Member

They definitely should scale, as much as the underlying driver lets them. We are aware we currently lock way too heavily and there is an inflight refactor that will help with most cases, only locking exactly the resources we need, instead of the whole world.

write_buffer is going to need a bit of special work to get going, but totally should be possible to call in parallel.

CommandEncoders are !Send + !Sync.

They actually no longer need to be, so this will change. Locking the world for the command encoder is still unnecessary though, this will be addressed in the refactor.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 6, 2021

Great. Another thing that I noticed is that all StagingBelt writes seem to synchronize with each other (on the GPU) causing the following timeline on the GPU:

2021-10-06-182143_1343x462_scrot

@cwfitzgerald
Copy link
Member

This is a known problem and unfortunately isn't the easiest to solve generically. I've had some ideas here #1260 (see the "extras" section for talk about write_buffer/transfers). Transfers are probably going to be the easiest one to solve as the problem of "is disjoint" isn't terribly complicated to solve.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 7, 2021

I tried to understand why these barriers completely blocked other transfers even though they looked like they were fine-grained buffer-local barriers. Then I found https://www.mail-archive.com/mesa-commit@lists.freedesktop.org/msg117032.html which suggests that drivers don't really implement this part of vulkan and fall back to full memory barriers.

@cwfitzgerald
Copy link
Member

Yeah, barriers are basically just "should I flush the L1/L2 cache and wait for the pipeline to drain", the hardware doesn't have fine grained control. We need to batch barriers as much as it is valid to do so.

@cwfitzgerald
Copy link
Member

Obsoleted by #2710

@teoxoy teoxoy closed this as not planned Won't fix, can't repro, duplicate, stale Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance How fast things go type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants