Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-acquire locks via iterative instead of recursive execution #447 #450

Merged
merged 1 commit into from
Jun 22, 2023

Conversation

HeikoKlare
Copy link
Contributor

@HeikoKlare HeikoKlare commented May 12, 2023

When multiple OrderedLocks are acquired by different threads, the deadlock recovery mechanism suspending and reacquiring the locks requires an indefinite number of tries until one thread holds all required locks. Since reacquisition is performed by recursive method invocation, the stack can become infinitely large with the change of resulting in a StackOverflowError.

These changes replace the recursive lock acqusition by an iterative one, such that still an indefinite number of tries for acquiring a set of locks is required but the chance of resulting in an error is eliminated. The OrderedLockTest.testComplex, which was randomly failing on Windows systems due to the recursive implementation and thus disabled in #455, is re-enabled.

The added test case does not deterministically reproduce the erroneous behavior, but since a proper regression test is very hard to define (as specific lock order across a magnitude of retries has to be ensured and coordinated between different threads), it at least executes more sophisticated locking scenarios to ensure proper lock retrieval and deadlock management.

Fixes #447. In particular, with this fix the build timeouts after 6h should hopefully disappear (which were only worked around by #455).

@github-actions
Copy link
Contributor

github-actions bot commented May 12, 2023

Test Results

     30 files  ±0       30 suites  ±0   46m 7s ⏱️ + 5m 10s
2 380 tests +1  2 377 ✔️ ±0    2 💤 ±0  1 +1 
7 143 runs  +3  7 118 ✔️ +3  24 💤  - 1  1 +1 

For more details on these failures, see this check.

Results for commit 05feb7f. ± Comparison against base commit 8b87e40.

♻️ This comment has been updated with latest results.

@HeikoKlare HeikoKlare marked this pull request as ready for review May 12, 2023 14:32
@HeikoKlare HeikoKlare force-pushed the issue-447 branch 2 times, most recently from 228a38b to d052003 Compare May 13, 2023 07:58
@fedejeanne
Copy link
Contributor

@iloveeclipse
Copy link
Member

The change is not trivial, and since this is not a recent regression, I would propose to postpone with it for 4.29 M1. Objections?

@HeikoKlare
Copy link
Contributor Author

No objections. Makes sense to postpone merge to after 4.28 release 👍

I propose to have a separate PR that temporarily disables the problematic test case on Windows systems. Then we hopefully already have (more) stable builds for the next weeks while still being aware of regressions (by still executing the test on the other platforms). This PR will then enable the test on Windows again. If you think that's a bad idea, let me know (either here or in the separate PR to come).

@iloveeclipse
Copy link
Member

I propose to have a separate PR that temporarily disables the problematic test case on Windows systems.

Sure. Please link to this issue.

@HeikoKlare
Copy link
Contributor Author

I've temporarily disabled the problematic test case on Windows in #455 and changed this PR to re-enable the test.

@HeikoKlare
Copy link
Contributor Author

@iloveeclipse We postponed merging this after 4.28 release. Any objections on merging this now?

@iloveeclipse
Copy link
Member

@iloveeclipse We postponed merging this after 4.28 release. Any objections on merging this now?

I've rebased to see if there is something to update after release change.

@HeikoKlare HeikoKlare force-pushed the issue-447 branch 5 times, most recently from 79e36d3 to 356e22c Compare June 21, 2023 16:05
@HeikoKlare
Copy link
Contributor Author

For documentation: I made a minor change after the review, only affecting the added test case.

Slow CI hardware made the new test case run for quite a long time, particularly on macOS machines, ranging to 60 seconds or even timeouts.

Thus, I have reduced the number of locks and threads used in that test case to reach acceptable execution times.

…-platform#447

When multiple OrderedLocks are acquired by different threads, the
deadlock recovery mechanism suspending and reacquiring the locks
requires an indefinite number of tries until one thread holds all
required locks. Since reacquisition is performed by recursive method
invocation, the stack can become infinitely large with the change of
resulting in a StackOverflowError.

These changes replace the recursive lock acqusition by an iterative one,
such that still an indefinite number of tries for acquiring a set of
locks is required but the chance of resulting in an error is eliminated.
The OrderedLockTest.testComplex, which was randomly failing on Windows
systems due to the recursive implementation, is re-enabled.

The added test case does not deterministically reproduce the erroneous
behavior, but since a proper regression test is very hard to define (as
specific lock order across a magnitude of retries has to be ensured and
coordinated between different threads), it at least executes more
sophisticated locking scenarios to ensure proper lock retrieval and
deadlock management.
@HeikoKlare
Copy link
Contributor Author

Failing test documented in #488.

@HeikoKlare HeikoKlare merged commit 65bdb6f into eclipse-platform:master Jun 22, 2023
@HeikoKlare HeikoKlare deleted the issue-447 branch September 12, 2023 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StackOverflowError in LockManager.resumeSuspendedLocks
3 participants