-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-acquire locks via iterative instead of recursive execution #447 #450
Conversation
Test Results 30 files ±0 30 suites ±0 46m 7s ⏱️ + 5m 10s For more details on these failures, see this check. Results for commit 05feb7f. ± Comparison against base commit 8b87e40. ♻️ This comment has been updated with latest results. |
runtime/bundles/org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs/LockManager.java
Show resolved
Hide resolved
228a38b
to
d052003
Compare
The change is not trivial, and since this is not a recent regression, I would propose to postpone with it for 4.29 M1. Objections? |
No objections. Makes sense to postpone merge to after 4.28 release 👍 I propose to have a separate PR that temporarily disables the problematic test case on Windows systems. Then we hopefully already have (more) stable builds for the next weeks while still being aware of regressions (by still executing the test on the other platforms). This PR will then enable the test on Windows again. If you think that's a bad idea, let me know (either here or in the separate PR to come). |
Sure. Please link to this issue. |
I've temporarily disabled the problematic test case on Windows in #455 and changed this PR to re-enable the test. |
@iloveeclipse We postponed merging this after 4.28 release. Any objections on merging this now? |
I've rebased to see if there is something to update after release change. |
79e36d3
to
356e22c
Compare
For documentation: I made a minor change after the review, only affecting the added test case. Slow CI hardware made the new test case run for quite a long time, particularly on macOS machines, ranging to 60 seconds or even timeouts. Thus, I have reduced the number of locks and threads used in that test case to reach acceptable execution times. |
…-platform#447 When multiple OrderedLocks are acquired by different threads, the deadlock recovery mechanism suspending and reacquiring the locks requires an indefinite number of tries until one thread holds all required locks. Since reacquisition is performed by recursive method invocation, the stack can become infinitely large with the change of resulting in a StackOverflowError. These changes replace the recursive lock acqusition by an iterative one, such that still an indefinite number of tries for acquiring a set of locks is required but the chance of resulting in an error is eliminated. The OrderedLockTest.testComplex, which was randomly failing on Windows systems due to the recursive implementation, is re-enabled. The added test case does not deterministically reproduce the erroneous behavior, but since a proper regression test is very hard to define (as specific lock order across a magnitude of retries has to be ensured and coordinated between different threads), it at least executes more sophisticated locking scenarios to ensure proper lock retrieval and deadlock management.
Failing test documented in #488. |
When multiple
OrderedLocks
are acquired by different threads, the deadlock recovery mechanism suspending and reacquiring the locks requires an indefinite number of tries until one thread holds all required locks. Since reacquisition is performed by recursive method invocation, the stack can become infinitely large with the change of resulting in aStackOverflowError
.These changes replace the recursive lock acqusition by an iterative one, such that still an indefinite number of tries for acquiring a set of locks is required but the chance of resulting in an error is eliminated. The
OrderedLockTest.testComplex
, which was randomly failing on Windows systems due to the recursive implementation and thus disabled in #455, is re-enabled.The added test case does not deterministically reproduce the erroneous behavior, but since a proper regression test is very hard to define (as specific lock order across a magnitude of retries has to be ensured and coordinated between different threads), it at least executes more sophisticated locking scenarios to ensure proper lock retrieval and deadlock management.
Fixes #447. In particular, with this fix the build timeouts after 6h should hopefully disappear (which were only worked around by #455).