Does not work on Raspberry Pi 4 #14

turol · 2023-01-10T13:58:12Z

Does not run on Raspberry Pi 4. Fails with

maxColorAttachments >= 8 required!

maxColorAttachments is only 4. It should only refuse to run the tests which actually require that many attachments.

The text was updated successfully, but these errors were encountered:

turol · 2023-01-11T09:55:42Z

You didn't actually remove the check from renderpass.c

turol · 2023-01-11T12:15:45Z

Manually removing the check still doesn't work. Now it crashes with assertion failure inside the driver.

../src/broadcom/vulkan/v3dvx_pipeline.c:86: pack_blend: Assertion `pipeline->subpass->color_count == cb_info->attachmentCount' failed.

However I can't tell if this is a vkoverhead problem or a driver problem because the OS comes with an old version of the validation layer which doesn't understand several extensions and compiling it myself is a massive PITA.

and remove the abort() this time ref #14

zmike · 2023-01-11T15:01:54Z

Should be fixed now.

turol · 2023-01-12T10:39:00Z

Well it starts running now but eventually crashes in test 6.

Failed to allocate device memory for BO
vkoverhead: ../src/broadcom/vulkan/v3dv_cmd_buffer.c:1755: v3dv_cmd_buffer_subpass_resume: Assertion `subpass_idx < cmd_buffer->state.pass->subpass_count' failed.

Running test 6 alone passes. Again not sure if vkoverhead or driver bug and at this point I'm not going to debug it any further.

itoral · 2023-01-16T10:13:09Z

I am looking into this from the perspective of v3dv. The issue is not specific to any test in particular, but to accumulated memory usage over time. As far as I can see, there is an ever increasing number of BOs being allocated. At the point it fails to allocate, I am seeing almost 32K BOs allocated that take ~200MB of memory. 200MB is not too much, but I think 32K BOs might be hitting some limits for the number of BO handles we can allocate in the kernel. I'd have to confirm this.

With that said, I wonder if this ever increasing BO allocation number is expected or may point to vkoverhead leaking GPU resources from the tests.

FWIW, if I run vkoverhead on my Intel laptop it progresses much further, but ends up crashing too (not sure if for the same reason though).

zmike · 2023-01-16T14:31:55Z

It'd be interesting to know what's creating so many BOs, whether it's just command stream recording or something else. I don't think vkoverhead itself creates anywhere near that many?

itoral · 2023-01-17T11:25:37Z

We're still looking into it but vkoverhead does create some pretty large command buffers and doesn't seem to immediately release these resources after each test, so I think this is in line with the growing number of BOs we see.

With that said, I think there might be some issue within the kernel side that is causing us to fail BO allocation without an obvious reason that we are trying to track down.

itoral · 2023-01-19T10:01:48Z

I have some more info to share:

First, memory requirements from vkoverhead can be quite high with some tests. Particularly, the render pass tests hit a worst case scenario for us, since they create a render pass for each draw call and record many thousands of these commands into each command buffer. For us, each render pass requires to allocate some BOs so the BO count and memory usage blows up. At the point of failure I have seen it reach between 20K and 30K BOs and up to 1.9GB of memory just for BOs.

This is made worse by the fact that we also have a BO cache in the user-space driver to help with performance. The BO cache is freed if the kernel fails to allocate a BO, so this should not be adding to the problem in theory, however, in practice when we run with the cache enabled we end up failing to allocate even after freeing the BO cache completely. This is surprising because when we disable the BO cache from the start we don't run into this problem. I was, in fact, able to complete execution of vkoverhead by disabling this cache with the following environment variable:

V3DV_MAX_BO_CACHE_SIZE=0

For what is worth, we have not been able to reproduce the problem if we use an upstream kernel, so the issue may be related to Raspberry Pi's downstream kernel changes (we are investigating this).

With all that said, I do have a few suggestions for vkoverhead:

vkoverhead could do a better job cleaning up at exit. I was trying to use valgrind to check for memory leaks and the report is pretty much useless because there are tons of leaks reported due to vkovehead not cleaning up on exit (i.e. destroying the device, command pools, pipelines, descriptors, etc).
vkoverhead doesn't always end recording in a command buffer by calling vkEndCommandBuffer(). I think it might be a good idea to ensure this is always called. Particularly, because by doing this vkoverhead could check the return value to identify if there has been an OOM during command buffer recording (assuming drivers are robust enough to not crash in that scenario of course). If vkEndCommandBuffer() returns OOM, then vkoverhead could, for example, immediately free all its command pools and retry the test.

itoral · 2023-01-19T11:12:48Z

BTW, I also observe a weird behavior with vkoverhead, when running without parameters I see that the draw_vertex test shows about 50% of the draw calls of the base draw test, which doesn't make sense since from the point of view of the driver there is no significant difference between the two. However, if I run the tests separately (using the -test parameter) both tests score similarly, which would the expected result. This is with the CPU governor set to performance to avoid CPU throttling. The same occurs for other tests, they all score better in number of draw calls when they are executed standalone with the -test parameter.

UPDATE: this behavior seems specific to Raspberry Pi though, on my Intel laptop scores with -test are about the same as when running the whole suite.

zmike · 2023-01-19T19:32:24Z

Regarding memory requirements, I'm wondering if we might want to just use smaller iteration numbers on your builds? It could be a build-time configuration thing so that different embedded devices could tune the loops a bit--maybe only iterating 500 or 1000 times instead of 10-20x those numbers.

Freeing everything on exit is a bit cumbersome with the way the code is structured. Historically I've found vkoverhead leaks by checking for memory ballooning; with how fast the loops iterate, they show up pretty fast.

It's intentional that vkEndCommandBuffer is always called at the end of a command buffer. If there's cases where that's not happening then I need to fix them.

BTW, I also observe a weird behavior with vkoverhead, when running without parameters I see that the draw_vertex test shows about 50% of the draw calls of the base draw test,

I haven't seen this behavior on any other driver.

Currently, we are using an alignment of 128 kB to insert a node, which ends up wasting memory as we perform plenty of small BOs allocations (<= 4 kB). We require that allocations are aligned to 128Kb so for any allocation smaller than that, we are wasting the difference. This implies that we cannot effectively use the whole 4 GB address space available for the GPU in the RPi 4. Currently, we can allocate up to 32000 BOs of 4 kB (~140 MB) and 3000 BOs of 400 kB (~1,3 GB). This can be quite limiting for applications that have a high memory requirement, such as vkoverhead [1]. By reducing the page alignment to 4 kB, we can allocate up to 1000000 BOs of 4 kB (~4 GB) and 10000 BOs of 400 kB (~4 GB). Moreover, by performing benchmarks, we were able to attest that reducing the page alignment to 4 kB can provide a general performance improvement in OpenGL applications (e.g. glmark2). Therefore, this patch reduces the alignment of the node allocation to 4 kB, which will allow RPi users to explore the whole 4GB virtual address space provided by the hardware. Also, this patch allow users to fully run vkoverhead in the RPi 4/5, solving the issue reported in [1]. [1] zmike/vkoverhead#14 Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

zmike closed this as completed in f41b399 Jan 10, 2023

zmike added a commit that referenced this issue Jan 11, 2023

more comprehensively handle maxColorAttachments < 8

5ae3a31

and remove the abort() this time ref #14

zmike reopened this Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not work on Raspberry Pi 4 #14

Does not work on Raspberry Pi 4 #14

turol commented Jan 10, 2023 •

edited

Loading

turol commented Jan 11, 2023

turol commented Jan 11, 2023

zmike commented Jan 11, 2023

turol commented Jan 12, 2023

itoral commented Jan 16, 2023

zmike commented Jan 16, 2023

itoral commented Jan 17, 2023

itoral commented Jan 19, 2023

itoral commented Jan 19, 2023 •

edited

Loading

zmike commented Jan 19, 2023

Does not work on Raspberry Pi 4 #14

Does not work on Raspberry Pi 4 #14

Comments

turol commented Jan 10, 2023 • edited Loading

turol commented Jan 11, 2023

turol commented Jan 11, 2023

zmike commented Jan 11, 2023

turol commented Jan 12, 2023

itoral commented Jan 16, 2023

zmike commented Jan 16, 2023

itoral commented Jan 17, 2023

itoral commented Jan 19, 2023

itoral commented Jan 19, 2023 • edited Loading

zmike commented Jan 19, 2023

turol commented Jan 10, 2023 •

edited

Loading

itoral commented Jan 19, 2023 •

edited

Loading