Re-acquire locks via iterative instead of recursive execution #447 #450

HeikoKlare · 2023-05-12T13:01:03Z

When multiple OrderedLocks are acquired by different threads, the deadlock recovery mechanism suspending and reacquiring the locks requires an indefinite number of tries until one thread holds all required locks. Since reacquisition is performed by recursive method invocation, the stack can become infinitely large with the change of resulting in a StackOverflowError.

These changes replace the recursive lock acqusition by an iterative one, such that still an indefinite number of tries for acquiring a set of locks is required but the chance of resulting in an error is eliminated. The OrderedLockTest.testComplex, which was randomly failing on Windows systems due to the recursive implementation and thus disabled in #455, is re-enabled.

The added test case does not deterministically reproduce the erroneous behavior, but since a proper regression test is very hard to define (as specific lock order across a magnitude of retries has to be ensured and coordinated between different threads), it at least executes more sophisticated locking scenarios to ensure proper lock retrieval and deadlock management.

Fixes #447. In particular, with this fix the build timeouts after 6h should hopefully disappear (which were only worked around by #455).

github-actions · 2023-05-12T13:17:10Z

Test Results

    30 files ±0     30 suites ±0 46m 7s ⏱️ + 5m 10s
2 380 tests +1 2 377 ✔️ ±0   2 💤 ±0 1 ❌ +1
7 143 runs +3 7 118 ✔️ +3 24 💤 - 1 1 ❌ +1

For more details on these failures, see this check.

Results for commit 05feb7f. ± Comparison against base commit 8b87e40.

♻️ This comment has been updated with latest results.

runtime/bundles/org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs/LockManager.java

fedejeanne · 2023-05-15T04:56:12Z

Another occurrence: https://github.com/eclipse-platform/eclipse.platform/actions/runs/4957027490/jobs/8868205300#step:7:6021

iloveeclipse · 2023-05-15T11:17:15Z

The change is not trivial, and since this is not a recent regression, I would propose to postpone with it for 4.29 M1. Objections?

HeikoKlare · 2023-05-15T17:19:10Z

No objections. Makes sense to postpone merge to after 4.28 release 👍

I propose to have a separate PR that temporarily disables the problematic test case on Windows systems. Then we hopefully already have (more) stable builds for the next weeks while still being aware of regressions (by still executing the test on the other platforms). This PR will then enable the test on Windows again. If you think that's a bad idea, let me know (either here or in the separate PR to come).

iloveeclipse · 2023-05-15T17:33:36Z

I propose to have a separate PR that temporarily disables the problematic test case on Windows systems.

Sure. Please link to this issue.

HeikoKlare · 2023-05-15T18:36:31Z

I've temporarily disabled the problematic test case on Windows in #455 and changed this PR to re-enable the test.

HeikoKlare · 2023-06-16T12:28:24Z

@iloveeclipse We postponed merging this after 4.28 release. Any objections on merging this now?

iloveeclipse · 2023-06-18T21:45:53Z

@iloveeclipse We postponed merging this after 4.28 release. Any objections on merging this now?

I've rebased to see if there is something to update after release change.

HeikoKlare · 2023-06-21T16:12:22Z

For documentation: I made a minor change after the review, only affecting the added test case.

Slow CI hardware made the new test case run for quite a long time, particularly on macOS machines, ranging to 60 seconds or even timeouts.

Thus, I have reduced the number of locks and threads used in that test case to reach acceptable execution times.

…-platform#447 When multiple OrderedLocks are acquired by different threads, the deadlock recovery mechanism suspending and reacquiring the locks requires an indefinite number of tries until one thread holds all required locks. Since reacquisition is performed by recursive method invocation, the stack can become infinitely large with the change of resulting in a StackOverflowError. These changes replace the recursive lock acqusition by an iterative one, such that still an indefinite number of tries for acquiring a set of locks is required but the chance of resulting in an error is eliminated. The OrderedLockTest.testComplex, which was randomly failing on Windows systems due to the recursive implementation, is re-enabled. The added test case does not deterministically reproduce the erroneous behavior, but since a proper regression test is very hard to define (as specific lock order across a magnitude of retries has to be ensured and coordinated between different threads), it at least executes more sophisticated locking scenarios to ensure proper lock retrieval and deadlock management.

HeikoKlare · 2023-06-22T09:19:51Z

Failing test documented in #488.

HeikoKlare force-pushed the issue-447 branch from e51284d to cfd9261 Compare May 12, 2023 14:08

HeikoKlare marked this pull request as ready for review May 12, 2023 14:32

iloveeclipse reviewed May 12, 2023

View reviewed changes

runtime/bundles/org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs/LockManager.java Show resolved Hide resolved

HeikoKlare force-pushed the issue-447 branch 2 times, most recently from 228a38b to d052003 Compare May 13, 2023 07:58

HeikoKlare mentioned this pull request May 15, 2023

Temporarily disable OrderedLockTest.testComplex on Windows #447 #455

Merged

HeikoKlare force-pushed the issue-447 branch from d052003 to b4948dc Compare May 15, 2023 18:34

iloveeclipse force-pushed the issue-447 branch from b4948dc to e9262ea Compare June 18, 2023 21:44

HeikoKlare force-pushed the issue-447 branch 5 times, most recently from 79e36d3 to 356e22c Compare June 21, 2023 16:05

HeikoKlare force-pushed the issue-447 branch from 356e22c to c115542 Compare June 21, 2023 16:25

HeikoKlare force-pushed the issue-447 branch from c115542 to 05feb7f Compare June 22, 2023 07:25

HeikoKlare merged commit 65bdb6f into eclipse-platform:master Jun 22, 2023

HeikoKlare deleted the issue-447 branch September 12, 2023 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-acquire locks via iterative instead of recursive execution #447 #450

Re-acquire locks via iterative instead of recursive execution #447 #450

HeikoKlare commented May 12, 2023 •

edited

Loading

github-actions bot commented May 12, 2023 •

edited

Loading

fedejeanne commented May 15, 2023

iloveeclipse commented May 15, 2023

HeikoKlare commented May 15, 2023

iloveeclipse commented May 15, 2023

HeikoKlare commented May 15, 2023

HeikoKlare commented Jun 16, 2023

iloveeclipse commented Jun 18, 2023

HeikoKlare commented Jun 21, 2023

HeikoKlare commented Jun 22, 2023

Re-acquire locks via iterative instead of recursive execution #447 #450

Re-acquire locks via iterative instead of recursive execution #447 #450

Conversation

HeikoKlare commented May 12, 2023 • edited Loading

github-actions bot commented May 12, 2023 • edited Loading

Test Results

fedejeanne commented May 15, 2023

iloveeclipse commented May 15, 2023

HeikoKlare commented May 15, 2023

iloveeclipse commented May 15, 2023

HeikoKlare commented May 15, 2023

HeikoKlare commented Jun 16, 2023

iloveeclipse commented Jun 18, 2023

HeikoKlare commented Jun 21, 2023

HeikoKlare commented Jun 22, 2023

HeikoKlare commented May 12, 2023 •

edited

Loading

github-actions bot commented May 12, 2023 •

edited

Loading