egl-wayland: Fix an unbounded array growth issue #120

dkorkmazturk · 2024-07-26T15:39:21Z

We do not seem to shrink the dynamically allocated streamImages array that we use for storing the resources associated with the swapchain images. When an entry in this array is destroyed in the destroy_stream_image() function, its fields are simply reverted to default/invalid values. Later on, the entries in this array are tried to be recycled in the add_surface_image() function. However, this approach is error-prone since we keep valid and invalid entries together, which requires us to check the validity of the entries every time we access them. Also, when explicit sync is in use, this array just keeps growing over time, especially during the application window resizes, due to a bug in the entry destruction logic.

So, to solve these problems, this change converts the dynamically allocated streamImages array into a linked list for simplified insertions and deletions. Each entry in this linked list is removed from the list and deallocated once they are no longer needed. So, all of the entries in the list stay valid.

Per-entry mutexes are replaced with a single mutex that guards accesses to the entire list to make sure that the linked list does not get corrupted when it is accessed from multiple threads. This only happens when explicit sync is not in use. The sizes of the critical sections that are protected by this new mutex are very small. To test to see if this change creates lock contention issues, weston-simple-egl application with swap interval of 0 was run on a Wayland compositor that does not support explicit sync to create as much lock contention as possible. However, no measurable difference in performance was observed after this change was applied.

As a side effect of this change, a bug in the
wlEglSurfaceCheckReleasePoints() function, where we wrongly assumed that all the entries in the streamImages array were valid, is fixed. This bug caused us to pass destroyed DRM syncobjs to the drmSyncobjTimelineWait() function, which led to random application freezes as a result in some cases since it prevented images from being released back to the EGL stream.

Another side effect of this change is that, it makes the maximum number of entries in this list known when explicit sync is in use, allowing us to avoid dynamically allocating the arrays for the list of DRM syncobjs and timeline points in wlEglSurfaceCheckReleasePoints(). This fixes a memory leak issue that can happen if only one of these allocations fails.

amshafer

Thanks for implementing this! I'm happy that this cleaned up a lot of the ugly array handling.

amshafer · 2024-07-29T14:25:23Z

src/wayland-eglsurface.c

-                               DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE,
-                               &firstSignaled) != 0) {
-        goto end;
+    {


What's the purpose of this bracketed block now that the if condition is gone? Is it just so syncobjWaitRes has a bounded lifetime? If so I think we should get rid of the extra brackets and declare syncobjWaitRes above with the rest of the local variables.

I believe it is a good practice to keep the scope of the variables minimal. The rest of the code in this function does not need to know the returned value. But I do not mind removing this scope and moving the declaration of the variable if you prefer that way.

Limiting variable scope is nice when you can do it, but not at the readability cost of disrupting the block structure.

Stylistically this isn't something we really seem to do, but I think if you want to attain the same scope reduction we could create a small helper function to wrap drmSyncobjTimelineWait. I think that could look nicer than this block scoping

In this case, you could also remove syncobjWaitRes entirely and just check errno in your assert.

Checking errno makes sense instead of using a separate variable, which helps to clean this up. Thank you for all the suggestions.

amshafer · 2024-07-29T14:27:41Z

src/wayland-eglsurface.c

+    pthread_mutex_lock(&surface->ctx.streamImagesMutex);
+
+    // Locate the corresponding WlEglStreamImage
+    {


Same question about the brackets here.

Similarly, scope reduction. The rest of the code does not need to know if the image is found or not.

amshafer · 2024-07-29T14:31:53Z

src/wayland-eglsurface.c

-    if (!image) {
-        goto fail_destroy_sync;
+        if (!found) {
+            pthread_mutex_unlock(&surface->ctx.streamImagesMutex);


I think it would make sense to move this pthread_mutex_unlock into fail_destroy_sync. This matches how we unlock the mutex in fail_release and also prevents us from accidentally forgetting to unlock it if we use fail_destroy_sync from a second location in the future.

I believe jumping to fail_release actually makes more sense since the eglImage here is already acquired. It also looks like fail_release needed some changes for how image is handled now.

src/wayland-eglsurface.c

kbrenneman · 2024-07-29T16:27:10Z

src/wayland-eglsurface.c

-             */
-            syncPoints[i] = UINT64_MAX;
+
+            if (numSyncPoints >= MAX_IMAGES) {


This is incorrect if we have exactly MAX_IMAGES elements, because we'll have already incremented numSyncPoints to MAX_IMAGES.

Instead, this check needs to go before the assignment and increment above.

Or alternatively, we could have an extra loop to count the number of elements we'll need and then call alloca to allocate the arrays for them.

numSyncPoints is incremented to n right after the elements at the index n-1 is assigned. We check for this condition right after the increment and before the loop which we assign the elements at the index n. So as long as the initial value of numSyncPoints is less than MAX_IMAGES, which currently is, this should work. Though I believe moving the check above the assignment as you suggested would be less error-prone. We wouldn't be relying on an assumption of the initial value of numSyncPoints.

We do not seem to shrink the dynamically allocated streamImages array that we use for storing the resources associated with the swapchain images. When an entry in this array is destroyed in the destroy_stream_image() function, its fields are simply reverted to default/invalid values. Later on, the entries in this array are tried to be recycled in the add_surface_image() function. However, this approach is error-prone since we keep valid and invalid entries together, which requires us to check the validity of the entries every time we access them. Also, when explicit sync is in use, this array just keeps growing over time, especially during the application window resizes, due to a bug in the entry destruction logic. So, to solve these problems, this change converts the dynamically allocated streamImages array into a linked list for simplified insertions and deletions. Each entry in this linked list is removed from the list and deallocated once they are no longer needed. So, all of the entries in the list stay valid. Per-entry mutexes are replaced with a single mutex that guards accesses to the entire list to make sure that the linked list does not get corrupted when it is accessed from multiple threads. This only happens when explicit sync is not in use. The sizes of the critical sections that are protected by this new mutex are very small. To test to see if this change creates lock contention issues, weston-simple-egl application with swap interval of 0 was run on a Wayland compositor that does not support explicit sync to create as much lock contention as possible. However, no measurable difference in performance was observed after this change was applied. As a side effect of this change, a bug in the wlEglSurfaceCheckReleasePoints() function, where we wrongly assumed that all the entries in the streamImages array were valid, is fixed. This bug caused us to pass destroyed DRM syncobjs to the drmSyncobjTimelineWait() function, which led to random application freezes as a result in some cases since it prevented images from being released back to the EGL stream. Another side effect of this change is that, it makes the maximum number of entries in this list known when explicit sync is in use, allowing us to avoid dynamically allocating the arrays for the list of DRM syncobjs and timeline points in wlEglSurfaceCheckReleasePoints(). This fixes a memory leak issue that can happen if only one of these allocations fails.

amshafer

Thanks, I think this looks good now pending Kyle confirming the response to his feedback.

kbrenneman · 2024-07-30T15:59:15Z

src/wayland-eglsurface.c

         */
-        if (image->buffer) {


Probably not something for this change, but I think if we made WlEglStreamImage refcounted, then we could simplify this teardown dance quite a lot.

kbrenneman

I think this looks right.

dkorkmazturk requested review from cubanismo, kbrenneman and amshafer July 26, 2024 15:39

amshafer reviewed Jul 29, 2024

View reviewed changes

kbrenneman reviewed Jul 29, 2024

View reviewed changes

dkorkmazturk force-pushed the master branch from 45e374e to 81012a4 Compare July 29, 2024 18:11

dkorkmazturk force-pushed the master branch from 81012a4 to bbf4a82 Compare July 30, 2024 13:23

amshafer approved these changes Jul 30, 2024

View reviewed changes

kbrenneman reviewed Jul 30, 2024

View reviewed changes

kbrenneman approved these changes Jul 30, 2024

View reviewed changes

amshafer merged commit 59a60d6 into NVIDIA:master Jul 31, 2024

dkorkmazturk mentioned this pull request Jul 31, 2024

Qt apps freeze on master egl-wayland #111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

egl-wayland: Fix an unbounded array growth issue #120

egl-wayland: Fix an unbounded array growth issue #120

dkorkmazturk commented Jul 26, 2024

amshafer left a comment

amshafer Jul 29, 2024

dkorkmazturk Jul 29, 2024

kbrenneman Jul 29, 2024

amshafer Jul 29, 2024

kbrenneman Jul 29, 2024

dkorkmazturk Jul 29, 2024

amshafer Jul 29, 2024

dkorkmazturk Jul 29, 2024

amshafer Jul 29, 2024

dkorkmazturk Jul 29, 2024

kbrenneman Jul 29, 2024

dkorkmazturk Jul 29, 2024

amshafer left a comment

kbrenneman Jul 30, 2024

kbrenneman left a comment

egl-wayland: Fix an unbounded array growth issue #120

egl-wayland: Fix an unbounded array growth issue #120

Conversation

dkorkmazturk commented Jul 26, 2024

amshafer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amshafer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbrenneman left a comment

Choose a reason for hiding this comment