-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All Translog inner closes should happen after tragedy exception is set #32674
All Translog inner closes should happen after tragedy exception is set #32674
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I've left some comments. I think we should also beef up the testing here to ensure that we indeed always have a tragic exception at hand in case a tragic event happened. A good starting point for this is TranslogTests#testWithRandomException
, where we can sprinkle assertNotNull(failableTLog.getTragicException());
over the place (this would have actually found the rollGeneration
one), and call some more of the methods on failableTLog (e.g. trimOperations
) so that we get good coverage.
private synchronized void closeWithTragicEvent(final Exception ex) { | ||
assert ex != null; | ||
if (tragedy == null) { | ||
tragedy = ex; | ||
} else if (tragedy != ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this case got lost
@@ -310,8 +313,21 @@ public boolean isOpen() { | |||
return closed.get() == false; | |||
} | |||
|
|||
private boolean calledFromOutsideOrViaTragedyClose(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this method static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we test this method somehow?
@ywelsch please also discuss how far this can be backported (I only see labels for 7.0.0) |
Could you please review my changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra test. I've left some more comments.
@@ -51,7 +51,7 @@ | |||
/* the number of translog operations written to this file */ | |||
private volatile int operationCounter; | |||
/* if we hit an exception that we can't recover from we assign it to this var and ship it with every AlreadyClosedException we throw */ | |||
private volatile Exception tragedy; | |||
private TragicExceptionHolder tragedy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this final
@@ -2508,8 +2509,10 @@ public void testWithRandomException() throws IOException { | |||
// fair enough | |||
} catch (IOException ex) { | |||
assertEquals(ex.getMessage(), "__FAKE__ no space left on device"); | |||
assertEquals(failableTLog.getTragicException(), ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the case where we catch TranslogException | MockDirectoryWrapper.FakeIOException
?
@@ -2931,6 +2934,46 @@ public void testCloseSnapshotTwice() throws Exception { | |||
} | |||
} | |||
|
|||
public void testTranslogCloseInvariant() throws IOException { | |||
//close method should never be called directly from Translog (the only exception is closeOnTragicEvent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a space behind //
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also move this comment outside of the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally, add an assumeTrue("test only works with assertions enabled", Assertions.ENABLED);
to the beginning of the method
close(); | ||
} catch (final IOException | RuntimeException e) { | ||
ex.addSuppressed(e); | ||
if (tragedy.setTragicException(ex)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the semantics here to only call close when setting the tragedy the first time? Let's keep the existing semantics and make setTragicException
return void
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch what is the reason to call close twice if every time we set tragic exception close is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpler as it decouples these two things. And no need for having this method return a boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch ok, it makes sense. There is already "closed" atomic that will prevent multiple closes.
tragedy.set(ex); | ||
return true; | ||
} else { | ||
if (tragedy.get() != ex) { //to ensure there is no self-suppression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add space after //
public boolean setTragicException(Exception ex){ | ||
assert ex != null; | ||
if (tragedy.compareAndSet(null, ex)) { | ||
tragedy.set(ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why double-set here? Isn't the previous line already doing this?
@ywelsch thank you for the review! |
@ywelsch done with changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's backport this to 6.4.1 and 6.5.0 as well.
* Sets the tragic exception or if the tragic exception is already set adds passed exception as suppressed exception | ||
* @param ex tragic exception to set | ||
*/ | ||
public void setTragicException(Exception ex){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space missing between )
and {
try { | ||
close(); | ||
} catch (final IOException | RuntimeException e) { | ||
ex.addSuppressed(e); | ||
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra newline
Merged to master, pending backport to 6.5.0 and 6.4.1 |
* master: Generalize remote license checker (#32971) Trim translog when safe commit advanced (#32967) Fix an inaccuracy in the dynamic templates documentation. (#32890) Logging: Use settings when building daemon threads (#32751) All Translog inner closes should happen after tragedy exception is set (#32674) HLREST: AwaitsFix ML Test Pass DiscoveryNode to initiateChannel (#32958) Add mzn and dz to unsupported locales (#32957) Use settings from the context in BootstrapChecks (#32908) Update docs for node specifications (#30468) HLRC: Forbid all Elasticsearch logging infra (#32784) Only configure publishing if it's applied externally (#32351) Fixes libs:dissect when in eclipse Protect ScriptedMetricIT test cases against failures on 0-doc shards (#32959) (#32968) [Kerberos] Add documentation for Kerberos realm (#32662) Watcher: Properly find next valid date in cron expressions (#32734) Fix some small issues in the getting started docs (#30346) Set forbidden APIs target compatibility to compiler java version (#32935) Move connection listener to ConnectionManager (#32956)
…e-types * elastic/master: (89 commits) Fix assertion in AbstractSimpleTransportTestCase (elastic#32991) [DOC] Splits role mapping APIs into separate pages (elastic#32797) HLRC: ML Close Job (elastic#32943) Generalize remote license checker (elastic#32971) Trim translog when safe commit advanced (elastic#32967) Fix an inaccuracy in the dynamic templates documentation. (elastic#32890) Logging: Use settings when building daemon threads (elastic#32751) All Translog inner closes should happen after tragedy exception is set (elastic#32674) HLREST: AwaitsFix ML Test Pass DiscoveryNode to initiateChannel (elastic#32958) Add mzn and dz to unsupported locales (elastic#32957) Use settings from the context in BootstrapChecks (elastic#32908) Update docs for node specifications (elastic#30468) HLRC: Forbid all Elasticsearch logging infra (elastic#32784) Only configure publishing if it's applied externally (elastic#32351) Fixes libs:dissect when in eclipse Protect ScriptedMetricIT test cases against failures on 0-doc shards (elastic#32959) (elastic#32968) [Kerberos] Add documentation for Kerberos realm (elastic#32662) Watcher: Properly find next valid date in cron expressions (elastic#32734) Fix some small issues in the getting started docs (elastic#30346) ...
#32674) We faced with the nasty race condition. See #32526 InternalEngine.failOnTragic method has thrown AssertionError. If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set. For now, let us concentrate on the Translog class. We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set. This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter. Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation. In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method. Closes #32526 (cherry picked from commit 0749b18)
#32674) We faced with the nasty race condition. See #32526 InternalEngine.failOnTragic method has thrown AssertionError. If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set. For now, let us concentrate on the Translog class. We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set. This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter. Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation. In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method. Closes #32526 (cherry picked from commit 0749b18)
Backported to 6.x and 6.4.1 |
We faced with the nasty race condition. See #32526
InternalEngine.failOnTragic
method has thrownAssertionError
.If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene
IndexWriter
has closed from inside orTranslog
, has closed from inside, but tragedy exception is not set.For now, let us concentrate on the
Translog
class.We found out that there are two methods in Translog - namely
rollGeneration
andtrimOperations
that are closing Translog in case of Exception without tragedy exception being set.This pull request fixes these 2 methods. To fix it, I've decided to pull
tragedyException
fromTranslogWriter
up-toTranslog
class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed.Also to protect us in the future and make sure
close
method is never called from insideTranslog
special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation.