Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probably naga bug in multiDrawIndirectCount and multiDrawIndirectIndexedCount #592

Open
5 tasks
fyellin opened this issue Sep 20, 2024 · 21 comments
Open
5 tasks
Labels
bug Something isn't working

Comments

@fyellin
Copy link
Contributor

fyellin commented Sep 20, 2024

AK: problems fixed, now to get the fixes to our users:

  • Wait for gfx-rs/wgpu to create a new release (out of our control).
  • Have wgpu-native update to that release (depending on Rust skills, can help do this if changes are not too hard).
  • Release wgpu-native.
  • Update wgpu-py to new wgpu-native.
  • Release wgpu-py.

After implementing multi_draw_indirect and multi_draw_indexed_indirect, I decided to try the ..._count version of these functions. These functions do not exist on the Mac, but I could get use my test_wgpu_vertex_instance.py file to test my implementation on the GitHub linux machines.

The implementation is straightforward. In extra.py

def multi_draw_indirect_count(
     render_pass_encoder, buffer,*,offset=0,count_buffer, count_buffer_offset=0, max_count
):
    render_pass_encoder._multi_draw_indirect_count(
        buffer, offset, count_buffer, count_buffer_offset, max_count
    )
def multi_draw_indexed_indirect_count(
     render_pass_encoder, buffer,*,offset=0,count_buffer, count_buffer_offset=0, max_count
):
    render_pass_encoder._multi_draw_indexed_indirect_count(
        buffer, offset, count_buffer, count_buffer_offset, max_count
    )

and in api.py:

def _multi_draw_indirect_count(
        self, buffer, offset, count_buffer, count_buffer_offset, max_count
):
        libf.wgpuRenderPassEncoderMultiDrawIndirectCount(
            self._internal, buffer._internal, int(offset), count_buffer._internal, int(count_buffer_offset), int(max_count),
        )

and identically for multi_draw_indexed_indirect_count adding _indexed in the function name and function call.

This definition of these functions are identical to the the non _count versions of the function, except for the determination of the number of draws to perform. That number is the minimum of max_count and the u32 value at byte offset count_buffer_offset in count_buffer.

The format of the buffer is supposed to be the same as the non_count versions of these functions. In particular for the non-indexed version, they are supposed to be packed groups of 4 u32s, starting at byte offset offset. Each group of 4 is
[vertex_count, instance_count, first_vertex, first_index]. So my vertex/instance test should generate the set

{ itertools.product(range(first_vertex, first_vertex + vertex_count), range(first_instance, first_instance + instance_count))}

for each of these groups of four. This works correctly on all functions so far.

I've discovered two anomalies running tests on github.

First, the contents of count_buffer is completely ignored. Instead, the count value is fetched from count_buffer_offset in buffer. I double checked that there wasn't a bug in my code and that I wasn't passing buffer as an argument where count_buffer was expected.

I rewrote my tests to put the count into an unused space in buffer. The _indexed_ version worked, but the non-_indexed_ version continued to fail.
It seems to be completely mis-parsing the arguments in buffer. It is taking each group of five numbers (not four) [a, b, c, d, e] and then acting as if these were [1, b, d, e] as described above. Then after visiting the (vertex, instance) pairs as described above, it throws in some extra (vertex, instance) pairs that I can't make any sense of. [I did this by creating the buffer 1, 2, 3, .... 100 and then trying to make sense of the results for various offsets.]

I was hoping I could dig through the naga source code and find where the actual count is determined and where the parsing of the values in the buffer occurred. I completely failed. I was hoping to either submit a bug report saying "here's the mistake". I don't know Rust well enough to generate a Rust test case.

I will submit as a PR my current code as a work-in-progress.

@fyellin fyellin added the bug Something isn't working label Sep 20, 2024
@Korijn
Copy link
Collaborator

Korijn commented Sep 20, 2024

If you can trigger the bug with a unit test in python, that would already be very useful in enabling others to investigate.

@fyellin
Copy link
Contributor Author

fyellin commented Sep 20, 2024

Thanks.

And now I think I'm going mad. After working on this for several hours, the code is suddenly working and my tests are passing. And I haven't (intentionally) changed anything! I'm going to close this bug for now and see if I can work out what has changed.

I was clearly tickling something to cause the bizarre behavior. I just have no idea what. Maybe I'll re-open this tomorrow.

@fyellin fyellin closed this as completed Sep 20, 2024
@almarklein
Copy link
Member

You hunted the bug, now the bug will hunt you 😉

@Vipitis
Copy link
Contributor

Vipitis commented Sep 20, 2024

I am not sure if related, but I encountered another fatal error where the rust stack trace points towards wgpu_render_bundle_draw_indexed_indirect when running this shader with wgpu-shadertoy.

Don't have the time to minimize and backtrack right now since a deadline is coming up on Monday - but might be some overlap?
Will have time next week to do some investigation myself and open an issue on all the panic/fatal errors

E: it's unrelated I rootcaused this to be something else

@fyellin fyellin reopened this Sep 20, 2024
@fyellin
Copy link
Contributor Author

fyellin commented Sep 20, 2024

The bug is still there. I had accidentally neutralized a test. I'll update my code to show that that first bug is real! Putting the count into the data buffer makes things work.

Where is the right place to report this bug?

@Vipitis
Copy link
Contributor

Vipitis commented Sep 20, 2024

if the issue is within naga or the rust side of wgpu, it will be https://github.com/gfx-rs/wgpu, otherwide wgpu-native if you can recreate and maybe fix it there.

@fyellin
Copy link
Contributor Author

fyellin commented Sep 20, 2024

@Vipitis Not sure if this helps, but I get get crashes in the Rust code when I forget to set the pipeline in a pass. In particular, forgetting to set the pipeline in a RenderEncoder (which don't inherit the pipeline from the surrounding code) can be an issue. Not sure if that's what's causing your problems.

@Vipitis
Copy link
Contributor

Vipitis commented Sep 20, 2024

@fyellin honestly not sure. Open an issue in wgpu with a minimal reproducer, and if they don't tag it in the next few days they might redirect you elsewhere.
I will have time on Monday to investigate which commit changed the behavior for the issue I encounter.

I posted a variety of translation issues with naga and some of them took half a year before getting addressed. Sadly I am not comfortable myself with rust to try and fix them. 😕

@fyellin
Copy link
Contributor Author

fyellin commented Sep 21, 2024

Just out of curiosity. When we test this on a Github Linux device, do we have a machine with an actually GPU in it, or are we testing it against an emulator? I could easily believe this was an error in the emulation code.

Looking for every occurrence of draw_indexed_indirect_count everywhere in the gfx-rx/ workspace, the only thing I find is functions calling other functions and passing along the arguments they were received. I haven't yet found where the code actually does something GPU specific

@Vipitis
Copy link
Contributor

Vipitis commented Sep 21, 2024

@fyellin currently all CI runners for this org are CPU only. So they use lava pipe to run graphics.

I think using real GPUs and other backend has been proposed, but not done yet: #459

If you believe there is an issue that only show sup locally on a real GPU but passes on CI that's definitely a possibility. We were having windows specific problems with surfaces in the recent wgpu-native update, and CI was all green.

@fyellin
Copy link
Contributor Author

fyellin commented Sep 21, 2024

@Vipitis: Just a hypothesis. I admit my search wasn't completely thorough, but I couldn't find any mistakes in the gfx-rs/ code. I have a Mac, and metal doesn't support those two instructions (yet). Any chance that someone with a real Linux box with a real GPU could try grabbing this code running the test tests/test_wgpu_vertex_instance.py, and letting me know what happens? At least we'd know if the bug was gfx-rs or the emulator.

@fyellin
Copy link
Contributor Author

fyellin commented Sep 21, 2024

@Vipitis @almarklein @Korijn

I've created a small test program that illustrates the bugs without any need to pull in my branch. It can be run on any machine that has a recent wgpu. If one of you has a Linux machine with a hardware GPU, would you mind running it and let me know what the output is? This would be definitive as to whether the problem is in gfx-rs or lavapipe.

The code is just a stripped down version of test_wgpu_vertex_instance, and I call libf.wgpuXXX directly. MultiDrawIndirect and MultiDrawIndirectCount should give the same results when the count of the former is the same as the computed count of the latter. This runs both and prints out both results.

And crap. I forgot github won't let me upload a file with extension .py. (Why?) I've renamed it to runner.txt, but it's really runner.py

runner.txt

@fyellin
Copy link
Contributor Author

fyellin commented Sep 23, 2024

Opened gfx-rs/wgpu-native#429

I should probably close this.

@almarklein
Copy link
Member

almarklein commented Sep 24, 2024

I ran the runner.py on a Linux machine. It has integrated graphics, so it does run on hardware (no emulation), but not a dedicated GPU. I'm not sure if this counts as a "real GPU".

Intel(R) UHD Graphics 730 (ADL-S GT1) (IntegratedGPU) via Vulkan
Running script: "/home/almar/Downloads/runner.py"
-------------------------
data=[0, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[1], count_buffer_offset=0, max_count=1
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: []
-------------------------
data=[1, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[0], count_buffer_offset=0, max_count=1
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: [[4, 5], [4, 6]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect:      [[3, 4], [3, 5], [4, 5], [4, 6], [4, 7], [5, 5], [5, 6], [5, 7]]
DrawIndirectCount: [[3, 4], [3, 5], [3, 6], [3, 7], [4, 2], [4, 3]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 0, 2, 3, 4, 5, 0, 3, 4, 5, 6, 0], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: [[4, 0], [4, 1], [5, 0], [5, 1], [5, 2]]

I also ran it in software mode, by tweaking one line in your script:

        # adapter = wgpu.gpu.request_adapter(power_preference="high-performance")
        adapter = wgpu.gpu.enumerate_adapters()[1]
        print(adapter.summary)

This produces a result that is wrong in a different way, I think.

llvmpipe (LLVM 15.0.7, 256 bits) (CPU) via Vulkan
-------------------------
data=[0, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[1], count_buffer_offset=0, max_count=1
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: []
-------------------------
data=[1, 0, 1, 2, 3, 4, 5, 6, 7, 8], offset=8, count_data=[0], count_buffer_offset=0, max_count=1
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: [[3, 4], [3, 5]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect:      [[3, 4], [3, 5], [4, 5], [4, 6], [4, 7], [5, 5], [5, 6], [5, 7]]
DrawIndirectCount: [[3, 4], [3, 5], [5, 3], [5, 4], [5, 5], [5, 6], [6, 3], [6, 4], [6, 5], [6, 6], [7, 3], [7, 4], [7, 5], [7, 6]]
-------------------------
data=[2, 0, 1, 2, 3, 4, 0, 2, 3, 4, 5, 0, 3, 4, 5, 6, 0], offset=8, count_data=[0], count_buffer_offset=0, max_count=2
DrawIndirect:      [[3, 4], [3, 5]]
DrawIndirectCount: [[3, 4], [3, 5], [4, 5], [4, 6], [4, 7], [5, 5], [5, 6], [5, 7]]

@almarklein
Copy link
Member

Let me know if you have other code to run.

@Vipitis
Copy link
Contributor

Vipitis commented Sep 26, 2024

At that point you hit the Graphics APIs (Vulkan, metal, dx12) and have to read up on their specs.

@fyellin
Copy link
Contributor Author

fyellin commented Sep 26, 2024

I deleted my comment because I realized I was in Windows Land.

I think I've given up on finding the bug. I think I have ample evidence that it is occurring, though.

@fyellin
Copy link
Contributor Author

fyellin commented Sep 26, 2024

I am curious if there are unit tests for the Rust code (or for the C interface). Maybe I can see something I'm missing.

@fyellin
Copy link
Contributor Author

fyellin commented Sep 28, 2024

Bug has been found. Unfortunately, it's in two different workspaces.

Count being pulled from the wrong buffer:
wgpu-core/src/command/render.rs.
In two different places the code has:

     let count_buffer = buffers
            .get_owned(buffer_id)
            .map_err(|_| RenderPassErrorInner::InvalidBuffer(count_buffer_id))
            .map_pass_err(scope)?;

The second line needs to have buffer_id replaced with count_buffer_id.

The second issue is more straightforward.

In wgpu-native/src/lib.rs, the implementation of multiDrawIndirectCount was calling multiDrawIndexedIndirectCount. It was staring me in the face and I didn't see it.

Apparently no one is using the multi-draw-...count commands

@fyellin
Copy link
Contributor Author

fyellin commented Sep 28, 2024

I've submitted a PR to wgpu-native.
gfx-rs/wgpu-native#432

The change to wgpu had already been fixed and was submitted last month
gfx-rs/wgpu#6194
Can we make sure that this change makes it into our release?

@almarklein
Copy link
Member

So it looks like all bugs are fixed and now its a matter of getting upstream releases. I added a small tasklist to the top post.

I think gfx-rs/wgpu-native#429 can be closed, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants