Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not work on Raspberry Pi 4 #14

Open
turol opened this issue Jan 10, 2023 · 10 comments
Open

Does not work on Raspberry Pi 4 #14

turol opened this issue Jan 10, 2023 · 10 comments

Comments

@turol
Copy link
Contributor

turol commented Jan 10, 2023

Does not run on Raspberry Pi 4. Fails with

maxColorAttachments >= 8 required!

maxColorAttachments is only 4. It should only refuse to run the tests which actually require that many attachments.

@zmike zmike closed this as completed in f41b399 Jan 10, 2023
@turol
Copy link
Contributor Author

turol commented Jan 11, 2023

You didn't actually remove the check from renderpass.c

@turol
Copy link
Contributor Author

turol commented Jan 11, 2023

Manually removing the check still doesn't work. Now it crashes with assertion failure inside the driver.

../src/broadcom/vulkan/v3dvx_pipeline.c:86: pack_blend: Assertion `pipeline->subpass->color_count == cb_info->attachmentCount' failed.

However I can't tell if this is a vkoverhead problem or a driver problem because the OS comes with an old version of the validation layer which doesn't understand several extensions and compiling it myself is a massive PITA.

zmike added a commit that referenced this issue Jan 11, 2023
and remove the abort() this time

ref #14
@zmike
Copy link
Owner

zmike commented Jan 11, 2023

Should be fixed now.

@turol
Copy link
Contributor Author

turol commented Jan 12, 2023

Well it starts running now but eventually crashes in test 6.

Failed to allocate device memory for BO
vkoverhead: ../src/broadcom/vulkan/v3dv_cmd_buffer.c:1755: v3dv_cmd_buffer_subpass_resume: Assertion `subpass_idx < cmd_buffer->state.pass->subpass_count' failed.

Running test 6 alone passes. Again not sure if vkoverhead or driver bug and at this point I'm not going to debug it any further.

@zmike zmike reopened this Jan 12, 2023
@itoral
Copy link

itoral commented Jan 16, 2023

I am looking into this from the perspective of v3dv. The issue is not specific to any test in particular, but to accumulated memory usage over time. As far as I can see, there is an ever increasing number of BOs being allocated. At the point it fails to allocate, I am seeing almost 32K BOs allocated that take ~200MB of memory. 200MB is not too much, but I think 32K BOs might be hitting some limits for the number of BO handles we can allocate in the kernel. I'd have to confirm this.

With that said, I wonder if this ever increasing BO allocation number is expected or may point to vkoverhead leaking GPU resources from the tests.

FWIW, if I run vkoverhead on my Intel laptop it progresses much further, but ends up crashing too (not sure if for the same reason though).

@zmike
Copy link
Owner

zmike commented Jan 16, 2023

It'd be interesting to know what's creating so many BOs, whether it's just command stream recording or something else. I don't think vkoverhead itself creates anywhere near that many?

@itoral
Copy link

itoral commented Jan 17, 2023

We're still looking into it but vkoverhead does create some pretty large command buffers and doesn't seem to immediately release these resources after each test, so I think this is in line with the growing number of BOs we see.

With that said, I think there might be some issue within the kernel side that is causing us to fail BO allocation without an obvious reason that we are trying to track down.

@itoral
Copy link

itoral commented Jan 19, 2023

I have some more info to share:

First, memory requirements from vkoverhead can be quite high with some tests. Particularly, the render pass tests hit a worst case scenario for us, since they create a render pass for each draw call and record many thousands of these commands into each command buffer. For us, each render pass requires to allocate some BOs so the BO count and memory usage blows up. At the point of failure I have seen it reach between 20K and 30K BOs and up to 1.9GB of memory just for BOs.

This is made worse by the fact that we also have a BO cache in the user-space driver to help with performance. The BO cache is freed if the kernel fails to allocate a BO, so this should not be adding to the problem in theory, however, in practice when we run with the cache enabled we end up failing to allocate even after freeing the BO cache completely. This is surprising because when we disable the BO cache from the start we don't run into this problem. I was, in fact, able to complete execution of vkoverhead by disabling this cache with the following environment variable:

V3DV_MAX_BO_CACHE_SIZE=0

For what is worth, we have not been able to reproduce the problem if we use an upstream kernel, so the issue may be related to Raspberry Pi's downstream kernel changes (we are investigating this).

With all that said, I do have a few suggestions for vkoverhead:

  1. vkoverhead could do a better job cleaning up at exit. I was trying to use valgrind to check for memory leaks and the report is pretty much useless because there are tons of leaks reported due to vkovehead not cleaning up on exit (i.e. destroying the device, command pools, pipelines, descriptors, etc).

  2. vkoverhead doesn't always end recording in a command buffer by calling vkEndCommandBuffer(). I think it might be a good idea to ensure this is always called. Particularly, because by doing this vkoverhead could check the return value to identify if there has been an OOM during command buffer recording (assuming drivers are robust enough to not crash in that scenario of course). If vkEndCommandBuffer() returns OOM, then vkoverhead could, for example, immediately free all its command pools and retry the test.

@itoral
Copy link

itoral commented Jan 19, 2023

BTW, I also observe a weird behavior with vkoverhead, when running without parameters I see that the draw_vertex test shows about 50% of the draw calls of the base draw test, which doesn't make sense since from the point of view of the driver there is no significant difference between the two. However, if I run the tests separately (using the -test parameter) both tests score similarly, which would the expected result. This is with the CPU governor set to performance to avoid CPU throttling. The same occurs for other tests, they all score better in number of draw calls when they are executed standalone with the -test parameter.

UPDATE: this behavior seems specific to Raspberry Pi though, on my Intel laptop scores with -test are about the same as when running the whole suite.

@zmike
Copy link
Owner

zmike commented Jan 19, 2023

Regarding memory requirements, I'm wondering if we might want to just use smaller iteration numbers on your builds? It could be a build-time configuration thing so that different embedded devices could tune the loops a bit--maybe only iterating 500 or 1000 times instead of 10-20x those numbers.

Freeing everything on exit is a bit cumbersome with the way the code is structured. Historically I've found vkoverhead leaks by checking for memory ballooning; with how fast the loops iterate, they show up pretty fast.

It's intentional that vkEndCommandBuffer is always called at the end of a command buffer. If there's cases where that's not happening then I need to fix them.

BTW, I also observe a weird behavior with vkoverhead, when running without parameters I see that the draw_vertex test shows about 50% of the draw calls of the base draw test,

I haven't seen this behavior on any other driver.

mairacanal added a commit to mairacanal/linux-rpi that referenced this issue Jul 4, 2024
Currently, we are using an alignment of 128 kB to insert a node, which
ends up wasting memory as we perform plenty of small BOs allocations
(<= 4 kB). We require that allocations are aligned to 128Kb so for any
allocation smaller than that, we are wasting the difference.

This implies that we cannot effectively use the whole 4 GB address space
available for the GPU in the RPi 4. Currently, we can allocate up to
32000 BOs of 4 kB (~140 MB) and 3000 BOs of 400 kB (~1,3 GB). This can be
quite limiting for applications that have a high memory requirement, such
as vkoverhead [1].

By reducing the page alignment to 4 kB, we can allocate up to 1000000 BOs
of 4 kB (~4 GB) and 10000 BOs of 400 kB (~4 GB). Moreover, by performing
benchmarks, we were able to attest that reducing the page alignment to
4 kB can provide a general performance improvement in OpenGL
applications (e.g. glmark2).

Therefore, this patch reduces the alignment of the node allocation to 4
kB, which will allow RPi users to explore the whole 4GB virtual
address space provided by the hardware. Also, this patch allow users to
fully run vkoverhead in the RPi 4/5, solving the issue reported in [1].

[1] zmike/vkoverhead#14

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
mairacanal added a commit to mairacanal/linux-rpi that referenced this issue Jul 12, 2024
Currently, we are using an alignment of 128 kB to insert a node, which
ends up wasting memory as we perform plenty of small BOs allocations
(<= 4 kB). We require that allocations are aligned to 128Kb so for any
allocation smaller than that, we are wasting the difference.

This implies that we cannot effectively use the whole 4 GB address space
available for the GPU in the RPi 4. Currently, we can allocate up to
32000 BOs of 4 kB (~140 MB) and 3000 BOs of 400 kB (~1,3 GB). This can be
quite limiting for applications that have a high memory requirement, such
as vkoverhead [1].

By reducing the page alignment to 4 kB, we can allocate up to 1000000 BOs
of 4 kB (~4 GB) and 10000 BOs of 400 kB (~4 GB). Moreover, by performing
benchmarks, we were able to attest that reducing the page alignment to
4 kB can provide a general performance improvement in OpenGL
applications (e.g. glmark2).

Therefore, this patch reduces the alignment of the node allocation to 4
kB, which will allow RPi users to explore the whole 4GB virtual
address space provided by the hardware. Also, this patch allow users to
fully run vkoverhead in the RPi 4/5, solving the issue reported in [1].

[1] zmike/vkoverhead#14

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
mairacanal added a commit to mairacanal/linux-rpi that referenced this issue Aug 27, 2024
Currently, we are using an alignment of 128 kB to insert a node, which
ends up wasting memory as we perform plenty of small BOs allocations
(<= 4 kB). We require that allocations are aligned to 128Kb so for any
allocation smaller than that, we are wasting the difference.

This implies that we cannot effectively use the whole 4 GB address space
available for the GPU in the RPi 4. Currently, we can allocate up to
32000 BOs of 4 kB (~140 MB) and 3000 BOs of 400 kB (~1,3 GB). This can be
quite limiting for applications that have a high memory requirement, such
as vkoverhead [1].

By reducing the page alignment to 4 kB, we can allocate up to 1000000 BOs
of 4 kB (~4 GB) and 10000 BOs of 400 kB (~4 GB). Moreover, by performing
benchmarks, we were able to attest that reducing the page alignment to
4 kB can provide a general performance improvement in OpenGL
applications (e.g. glmark2).

Therefore, this patch reduces the alignment of the node allocation to 4
kB, which will allow RPi users to explore the whole 4GB virtual
address space provided by the hardware. Also, this patch allow users to
fully run vkoverhead in the RPi 4/5, solving the issue reported in [1].

[1] zmike/vkoverhead#14

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Aug 29, 2024
Currently, we are using an alignment of 128 kB to insert a node, which
ends up wasting memory as we perform plenty of small BOs allocations
(<= 4 kB). We require that allocations are aligned to 128Kb so for any
allocation smaller than that, we are wasting the difference.

This implies that we cannot effectively use the whole 4 GB address space
available for the GPU in the RPi 4. Currently, we can allocate up to
32000 BOs of 4 kB (~140 MB) and 3000 BOs of 400 kB (~1,3 GB). This can be
quite limiting for applications that have a high memory requirement, such
as vkoverhead [1].

By reducing the page alignment to 4 kB, we can allocate up to 1000000 BOs
of 4 kB (~4 GB) and 10000 BOs of 400 kB (~4 GB). Moreover, by performing
benchmarks, we were able to attest that reducing the page alignment to
4 kB can provide a general performance improvement in OpenGL
applications (e.g. glmark2).

Therefore, this patch reduces the alignment of the node allocation to 4
kB, which will allow RPi users to explore the whole 4GB virtual
address space provided by the hardware. Also, this patch allow users to
fully run vkoverhead in the RPi 4/5, solving the issue reported in [1].

[1] zmike/vkoverhead#14

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
mairacanal added a commit to mairacanal/linux-rpi that referenced this issue Aug 29, 2024
Currently, we are using an alignment of 128 kB to insert a node, which
ends up wasting memory as we perform plenty of small BOs allocations
(<= 4 kB). We require that allocations are aligned to 128Kb so for any
allocation smaller than that, we are wasting the difference.

This implies that we cannot effectively use the whole 4 GB address space
available for the GPU in the RPi 4. Currently, we can allocate up to
32000 BOs of 4 kB (~140 MB) and 3000 BOs of 400 kB (~1,3 GB). This can be
quite limiting for applications that have a high memory requirement, such
as vkoverhead [1].

By reducing the page alignment to 4 kB, we can allocate up to 1000000 BOs
of 4 kB (~4 GB) and 10000 BOs of 400 kB (~4 GB). Moreover, by performing
benchmarks, we were able to attest that reducing the page alignment to
4 kB can provide a general performance improvement in OpenGL
applications (e.g. glmark2).

Therefore, this patch reduces the alignment of the node allocation to 4
kB, which will allow RPi users to explore the whole 4GB virtual
address space provided by the hardware. Also, this patch allow users to
fully run vkoverhead in the RPi 4/5, solving the issue reported in [1].

[1] zmike/vkoverhead#14

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Sep 23, 2024
Currently, we are using an alignment of 128 kB to insert a node, which
ends up wasting memory as we perform plenty of small BOs allocations
(<= 4 kB). We require that allocations are aligned to 128Kb so for any
allocation smaller than that, we are wasting the difference.

This implies that we cannot effectively use the whole 4 GB address space
available for the GPU in the RPi 4. Currently, we can allocate up to
32000 BOs of 4 kB (~140 MB) and 3000 BOs of 400 kB (~1,3 GB). This can be
quite limiting for applications that have a high memory requirement, such
as vkoverhead [1].

By reducing the page alignment to 4 kB, we can allocate up to 1000000 BOs
of 4 kB (~4 GB) and 10000 BOs of 400 kB (~4 GB). Moreover, by performing
benchmarks, we were able to attest that reducing the page alignment to
4 kB can provide a general performance improvement in OpenGL
applications (e.g. glmark2).

Therefore, this patch reduces the alignment of the node allocation to 4
kB, which will allow RPi users to explore the whole 4GB virtual
address space provided by the hardware. Also, this patch allow users to
fully run vkoverhead in the RPi 4/5, solving the issue reported in [1].

[1] zmike/vkoverhead#14

Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants