Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Translog inner closes should happen after tragedy exception is set #32674

Merged
merged 13 commits into from
Aug 20, 2018

Conversation

andrershov
Copy link
Contributor

@andrershov andrershov commented Aug 7, 2018

We faced with the nasty race condition. See #32526
InternalEngine.failOnTragic method has thrown AssertionError.
If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriter has closed from inside or Translog, has closed from inside, but tragedy exception is not set.
For now, let us concentrate on the Translog class.
We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set.
This pull request fixes these 2 methods. To fix it, I've decided to pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed.
Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation.

@andrershov andrershov added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Aug 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov andrershov added the >bug label Aug 7, 2018
@andrershov andrershov changed the title All Translog inner closes should be because tragedy exception has happened All Translog inner closes should happen after tragedy exception is set Aug 7, 2018
@andrershov andrershov requested review from bleskes and ywelsch August 7, 2018 10:54
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I've left some comments. I think we should also beef up the testing here to ensure that we indeed always have a tragic exception at hand in case a tragic event happened. A good starting point for this is TranslogTests#testWithRandomException, where we can sprinkle assertNotNull(failableTLog.getTragicException()); over the place (this would have actually found the rollGeneration one), and call some more of the methods on failableTLog (e.g. trimOperations) so that we get good coverage.

private synchronized void closeWithTragicEvent(final Exception ex) {
assert ex != null;
if (tragedy == null) {
tragedy = ex;
} else if (tragedy != ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case got lost

@@ -310,8 +313,21 @@ public boolean isOpen() {
return closed.get() == false;
}

private boolean calledFromOutsideOrViaTragedyClose(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this method static?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we test this method somehow?

@bleskes
Copy link
Contributor

bleskes commented Aug 10, 2018

@ywelsch please also discuss how far this can be backported (I only see labels for 7.0.0)

@andrershov
Copy link
Contributor Author

@ywelsch

  • I've wrapped AtomicReference with the tragic exception with TragedyExceptionHolder to make sure we do not lose, but instead suppress subsequent exceptions
  • Made stack checking method static
  • Added a test for stack checking method - the best thing we can do here is subclassing and check for subclasses in stack checking method
  • Added non-null tragic exception assertions to tests that might fail during rollGeneration and trimOperations.

Could you please review my changes?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra test. I've left some more comments.

@@ -51,7 +51,7 @@
/* the number of translog operations written to this file */
private volatile int operationCounter;
/* if we hit an exception that we can't recover from we assign it to this var and ship it with every AlreadyClosedException we throw */
private volatile Exception tragedy;
private TragicExceptionHolder tragedy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this final

@@ -2508,8 +2509,10 @@ public void testWithRandomException() throws IOException {
// fair enough
} catch (IOException ex) {
assertEquals(ex.getMessage(), "__FAKE__ no space left on device");
assertEquals(failableTLog.getTragicException(), ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the case where we catch TranslogException | MockDirectoryWrapper.FakeIOException?

@@ -2931,6 +2934,46 @@ public void testCloseSnapshotTwice() throws Exception {
}
}

public void testTranslogCloseInvariant() throws IOException {
//close method should never be called directly from Translog (the only exception is closeOnTragicEvent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a space behind //

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also move this comment outside of the method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally, add an assumeTrue("test only works with assertions enabled", Assertions.ENABLED); to the beginning of the method

close();
} catch (final IOException | RuntimeException e) {
ex.addSuppressed(e);
if (tragedy.setTragicException(ex)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the semantics here to only call close when setting the tragedy the first time? Let's keep the existing semantics and make setTragicException return void

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch what is the reason to call close twice if every time we set tragic exception close is called?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpler as it decouples these two things. And no need for having this method return a boolean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch ok, it makes sense. There is already "closed" atomic that will prevent multiple closes.

tragedy.set(ex);
return true;
} else {
if (tragedy.get() != ex) { //to ensure there is no self-suppression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add space after //

public boolean setTragicException(Exception ex){
assert ex != null;
if (tragedy.compareAndSet(null, ex)) {
tragedy.set(ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why double-set here? Isn't the previous line already doing this?

@andrershov
Copy link
Contributor Author

@ywelsch thank you for the review!
I've fixed everything except close once. I've commented on it.

@andrershov
Copy link
Contributor Author

@ywelsch done with changes

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's backport this to 6.4.1 and 6.5.0 as well.

* Sets the tragic exception or if the tragic exception is already set adds passed exception as suppressed exception
* @param ex tragic exception to set
*/
public void setTragicException(Exception ex){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space missing between ) and {

try {
close();
} catch (final IOException | RuntimeException e) {
ex.addSuppressed(e);
}
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra newline

@andrershov andrershov merged commit 0749b18 into elastic:master Aug 20, 2018
@andrershov
Copy link
Contributor Author

Merged to master, pending backport to 6.5.0 and 6.4.1

jasontedor added a commit that referenced this pull request Aug 20, 2018
* master:
  Generalize remote license checker (#32971)
  Trim translog when safe commit advanced (#32967)
  Fix an inaccuracy in the dynamic templates documentation. (#32890)
  Logging: Use settings when building daemon threads (#32751)
  All Translog inner closes should happen after tragedy exception is set (#32674)
  HLREST: AwaitsFix ML Test
  Pass DiscoveryNode to initiateChannel (#32958)
  Add mzn and dz to unsupported locales (#32957)
  Use settings from the context in BootstrapChecks (#32908)
  Update docs for node specifications (#30468)
  HLRC: Forbid all Elasticsearch logging infra (#32784)
  Only configure publishing if it's applied externally (#32351)
  Fixes libs:dissect when in eclipse
  Protect ScriptedMetricIT test cases against failures on 0-doc shards (#32959) (#32968)
  [Kerberos] Add documentation for Kerberos realm (#32662)
  Watcher: Properly find next valid date in cron expressions (#32734)
  Fix some small issues in the getting started docs (#30346)
  Set forbidden APIs target compatibility to compiler java version   (#32935)
  Move connection listener to ConnectionManager (#32956)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Aug 21, 2018
…e-types

* elastic/master: (89 commits)
  Fix assertion in AbstractSimpleTransportTestCase (elastic#32991)
  [DOC] Splits role mapping APIs into separate pages (elastic#32797)
  HLRC: ML Close Job (elastic#32943)
  Generalize remote license checker (elastic#32971)
  Trim translog when safe commit advanced (elastic#32967)
  Fix an inaccuracy in the dynamic templates documentation. (elastic#32890)
  Logging: Use settings when building daemon threads (elastic#32751)
  All Translog inner closes should happen after tragedy exception is set (elastic#32674)
  HLREST: AwaitsFix ML Test
  Pass DiscoveryNode to initiateChannel (elastic#32958)
  Add mzn and dz to unsupported locales (elastic#32957)
  Use settings from the context in BootstrapChecks (elastic#32908)
  Update docs for node specifications (elastic#30468)
  HLRC: Forbid all Elasticsearch logging infra (elastic#32784)
  Only configure publishing if it's applied externally (elastic#32351)
  Fixes libs:dissect when in eclipse
  Protect ScriptedMetricIT test cases against failures on 0-doc shards (elastic#32959) (elastic#32968)
  [Kerberos] Add documentation for Kerberos realm (elastic#32662)
  Watcher: Properly find next valid date in cron expressions (elastic#32734)
  Fix some small issues in the getting started docs (elastic#30346)
  ...
andrershov pushed a commit that referenced this pull request Aug 21, 2018
#32674)

We faced with the nasty race condition. See #32526
InternalEngine.failOnTragic method has thrown AssertionError.
If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set.
For now, let us concentrate on the Translog class.
We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set.
This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter.
Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation.
In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method.

Closes #32526

(cherry picked from commit 0749b18)
andrershov pushed a commit that referenced this pull request Aug 21, 2018
#32674)

We faced with the nasty race condition. See #32526
InternalEngine.failOnTragic method has thrown AssertionError.
If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set.
For now, let us concentrate on the Translog class.
We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set.
This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter.
Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation.
In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method.

Closes #32526

(cherry picked from commit 0749b18)
@andrershov
Copy link
Contributor Author

Backported to 6.x and 6.4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.4.1 v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants