-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handle BerkeleyJE DB interruption [tp-tests] #4425
fix: handle BerkeleyJE DB interruption [tp-tests] #4425
Conversation
|
6472db4
to
d17a1fe
Compare
@porunov @li-boxuan struggling on how to test this, the test that I copied from #3990 pass but when I check it actually doesn't do anything. |
d36b5de
to
c9e5d4d
Compare
Update: I was able to write a test to consistently trigger the error n prove that my change should work. |
be7dc33
to
88f00f5
Compare
Have made a change to only re/initialize on interruption. |
912bc67
to
771d46e
Compare
Oops, forgot to enable my test since it works now, have done so. |
771d46e
to
c7e7eca
Compare
@li-boxuan @porunov all tests passed, can you guys take a look 👀? Keen to get this merged soon 💪. |
I am sorry but I seem to have missed some context. Do you know why @mad 's previous PR didn't pass the test, and what did you do differently to solve the problem? |
@li-boxuan I'm not sure why that one didn't pass the test. I'm using a completely different approach in this PR that involves a lot less changes. |
...h-berkeleyje/src/main/java/org/janusgraph/diskstorage/berkeleyje/BerkeleyJEStoreManager.java
Outdated
Show resolved
Hide resolved
c7e7eca
to
fb85ab0
Compare
Thank you @mad, I have committed the suggested changes. |
@mad Look like doing this will actually fail the tests. I think it's because BerkeleyJE DB technically still does not support interruption, as in it will throw an error upon being interrupted. We are only reopening the environment here so that future requests won't all fail. |
fb85ab0
to
2e7eaf3
Compare
@mad @li-boxuan I have implemented the interruption retry logic. |
3a56be5
to
7c176ce
Compare
Co-authored-by: Pavel Ershov <owner.mad.epa@gmail.com> Signed-off-by: Tiến Nguyễn Khắc <tien.nguyenkhac@icloud.com>
7c176ce
to
4b86cbd
Compare
Have made the best effort implementation, so BerkeleyJE DB environment will be reset after interruption & also that race condition around thread are covered so that the interruption test from tp-test will always works. |
@li-boxuan @porunov all tests passed :D There's the coverage check, but I don't think I can get it any higher though, cuz a lot of the uncovered code are catch clauses. Which are impossible to trigger consistently. |
Codecov sometimes seems buggy and we don't know why. Don't worry about that. Congrats on passing all tests! @mad would you be able to take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finishing the job!
Awesome, thanks for all the helps and supports guys 🙏 |
I am gonna merge this following lazy merge consensus. Thank you @tien for fixing this! |
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation and see the Github Action logs for details |
After merging this PR
I will try to re-run those tests again. If they don't pass we should revert this commit. |
I'm pretty sure this is just a flaky error due to Berkeley DB can't handle many concurrent reads/writes, which now happen cuz we no longer exclude it from some of the tests. Like I can see that this pass https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965523610 Look like a whole other issue: #1623 |
Yes, tests for berkeleyje with Java 8 passed after restart, but failed again for Java 11. I restarted the job again for Java 11. |
@porunov I can see that it failed again :( It's like a flaky race condition where the test try to create an entirely new Berkeley DB graph when the previous one is still invalid/re-intializing (Probably due to the lock issue). This isn't handled by this PR: the BerkeleyDB store doesn't have control over other instances created before or after it. This won't happen in production because you'll most likely & should have only one Janusgraph with Berkeley DB instance up and running. Maybe we should just continue what we did before and disable this particular test for BerkeleyDB? Or push on with the flaky test, It has always pass for me when I run this locally :p |
Though like you said @porunov increasing the lock time-out might just make this problem go away :p |
…s [tp-tests] Related to JanusGraph#1623 and JanusGraph#4425 Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
…s [tp-tests] Related to JanusGraph#1623 and JanusGraph#4425 Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
…s [tp-tests] Related to JanusGraph#1623 and JanusGraph#4425 Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
This PR is going to be reverted in #4702 after I unsure that the revert fixed |
Related to JanusGraph#1623 and JanusGraph#4425 Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
Related to JanusGraph#1623 and JanusGraph#4425 Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
Currently, any interruption to BerkeleyJE DB will cause Janusgraph to stop working entirely & require a complete restart of Janusgraph. This PR will make it so interruption to BerkeleyJE DB will close & reopen the Berkeley's environment instead.
Fixes #2120