Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wgpu panics with "Epoch mismatch for .." #1996

Closed
AnotherNathan opened this issue Sep 25, 2021 · 1 comment · Fixed by #1999
Closed

wgpu panics with "Epoch mismatch for .." #1996

AnotherNathan opened this issue Sep 25, 2021 · 1 comment · Fixed by #1999
Assignees
Labels
type: bug Something isn't working

Comments

@AnotherNathan
Copy link

Description
I really do not know what else to say besides: it crashes.
The only guess that I can make is, that it is a concurrency issue, since leaving the extra thread out of it (see the example) makes everything go away.

Repro steps
Run the example. You may have to run it more than once, since the panic does not happen consistently.

Expected vs observed behavior
It should run without any issues.

Extra materials

Example: wgpu-epoch-bug.zip

Trace: trace.zip

Panic message and backtrace:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `1`: Epoch mismatch for Valid((1, 1, Vulkan))', C:\Users\BergerN\.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-core-0.10.4\src\track\mod.rs:241:21
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b\/library\std\src\panicking.rs:515
   1: core::panicking::panic_fmt
             at /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b\/library\core\src\panicking.rs:92
   2: core::fmt::Arguments::new_v1
             at /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b\/library\core\src\fmt\mod.rs:341
   3: core::panicking::assert_failed_inner
             at /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b\/library\core\src\panicking.rs:154
   4: core::panicking::assert_failed<u32,u32>
             at /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b\library\core\src\panicking.rs:117
   5: wgpu_core::track::ResourceTracker<core::marker::PhantomData<wgpu_core::id::Id<wgpu_core::resource::TextureView<wgpu_hal::empty::Api> > > >::remove_abandoned<core::marker::PhantomData<wgpu_core::id::Id<wgpu_core::resource::TextureView<wgpu_hal::empty::Api>
             at C:\Users\BergerN\.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-core-0.10.4\src\track\mod.rs:241
   6: wgpu_core::device::life::LifetimeTracker<wgpu_hal::vulkan::Api>::triage_suspected<wgpu_hal::vulkan::Api,wgpu_core::hub::IdentityManagerFactory>
             at C:\Users\BergerN\.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-core-0.10.4\src\device\life.rs:388
   7: wgpu_core::device::Device<wgpu_hal::vulkan::Api>::maintain<wgpu_hal::vulkan::Api,wgpu_core::hub::IdentityManagerFactory>
             at C:\Users\BergerN\.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-core-0.10.4\src\device\mod.rs:374
   8: wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::queue_submit<wgpu_core::hub::IdentityManagerFactory,wgpu_hal::vulkan::Api>
             at C:\Users\BergerN\.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-core-0.10.4\src\device\queue.rs:816
   9: wgpu::backend::direct::impl$3::queue_submit<core::iter::adapters::map::Map<core::option::IntoIter<wgpu::CommandBuffer>,wgpu::impl$59::submit::closure$0> >
             at C:\Users\BergerN\.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-0.10.2\src\backend\direct.rs:2074
  10: wgpu::Queue::submit<enum$<core::option::Option<wgpu::CommandBuffer>, 1, 18446744073709551615, Some> >
             at C:\Users\BergerN\.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-0.10.2\src\lib.rs:3042
  11: wgpu_epoch_bug::main
             at .\src\main.rs:145
  12: core::ops::function::FnOnce::call_once<void (*)(),tuple$<> >
             at /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b\library\core\src\ops\function.rs:227

Platform
Windows 10
wgpu-0.10 Vulkan backend
RX 570 driver version 21.6.1
Ryzen 7 3700X

@kvark kvark added the type: bug Something isn't working label Sep 25, 2021
@kvark kvark self-assigned this Sep 25, 2021
@kvark
Copy link
Member

kvark commented Sep 25, 2021

This is a very interesting issue. Thank you for reducing the test case, it's hugely important for us to get right!
The general logic for dropping resources looks like this:

pub fn xxx_drop<A: HalApi>(&self, id) {
        let hub = A::hub(self);
        let mut token = Token::root();

        let (last_submit_index, device_id) = {
            let (mut guard, _) = hub.xxx.write(&mut token);
            match guard.get_mut(id) {
                Ok(resource) => {
                       ...
                }
            }
        };

        let (device_guard, mut token) = hub.devices.read(&mut token);
        let device = &device_guard[device_id];
        device
            .lock_life(&mut token)
            .suspected_resources
            .xxx
            .push(id::Valid(id));
    }

What happened here is:

  1. the dropping can quickly go through the first half where we remove the "user" ref count.
  2. Before we can reach the lock_life() step, the resource has already been fully removed.
  3. So we are adding this ID to the suspect list, even though the ID is already out of sync with the device tracker.
  4. Next triage_suspected sees this ID and creeps out.

I could think of 2 possible solutions here: either lock the life tracker more aggressively, which would mean more locking on any resource being dropped (yikes!). Or just ignore suspected IDs that aren't relevant any more. I think that's an OK way to go for now, and we should revisit this if/once Arcanization work lands (driven by @pythonesque ).

cwfitzgerald pushed a commit that referenced this issue Oct 25, 2023
Fixes #1745: Support out-of-order module scope declarations in WGSL
Fixes #1044: Forbid local variable shadowing in WGSL
Fixes #2076: [wgsl-in] no error for duplicated type definition
Fixes #2071: Global item does not support 'const'
Fixes #2105: [wgsl-in] Type aliases for a vecN<T> doesn't work when constructing vec from a single argument
Fixes #1775: Referencing a function without a return type yields an unknown identifier error.
Fixes #2089: Error span reported on the declaration of a variable instead of its use
Fixes #1996: [wgsl-in] Confusing error: "expected unsigned/signed integer literal, found '1'"

Separate parsing from lowering by generating an AST, which desugars as
much as possible down to something like Naga IR. The AST is then used
to resolve identifiers while lowering to Naga IR.

Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
Co-authored-by: Jim Blandy <jimb@red-bean.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants