Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress interrupted status during pool closure #13521

Closed

Conversation

werkt
Copy link
Contributor

@werkt werkt commented May 27, 2021

awaitTermination will throw InterruptedException if the interrupted
status is set initially when it is called, even if no wait is required.
Pool closure should not respect active interrupted status when shutting
down and awaiting termination as a result of its call from
executionPhaseEnding, which will occur during abnormal exits from
ExecutionTool. Ignore this status initially and restore the flag upon
exit of the factory close. An external interrupt which occurs during the
awaitTermination will still trigger an InterruptedException, as
expected.

Fixes #13512

awaitTermination will throw InterruptedException if the interrupted
status is set initially when it is called, even if no wait is required.
Pool closure should not respect active interrupted status when shutting
down and awaiting termination as a result of its call from
executionPhaseEnding, which will occur during abnormal exits from
ExecutionTool. Ignore this status initially and restore the flag upon
exit of the factory close. An external interrupt which occurs during the
awaitTermination will still trigger an InterruptedException, as
expected.

Fixes bazelbuild#13512
@google-cla google-cla bot added the cla: yes label May 27, 2021
@shirchen shirchen mentioned this pull request Jun 1, 2021
@coeuvre coeuvre self-assigned this Jun 2, 2021
@coeuvre coeuvre self-requested a review June 2, 2021 03:57
@coeuvre coeuvre added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jun 2, 2021
Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@bazel-io bazel-io closed this in fd9cffd Jun 2, 2021
@werkt werkt deleted the channel-await-termination-interrupted branch June 9, 2021 17:45
katre pushed a commit that referenced this pull request Jul 12, 2021
awaitTermination will throw InterruptedException if the interrupted
status is set initially when it is called, even if no wait is required.
Pool closure should not respect active interrupted status when shutting
down and awaiting termination as a result of its call from
executionPhaseEnding, which will occur during abnormal exits from
ExecutionTool. Ignore this status initially and restore the flag upon
exit of the factory close. An external interrupt which occurs during the
awaitTermination will still trigger an InterruptedException, as
expected.

Fixes #13512

Closes #13521.

PiperOrigin-RevId: 377006347
katre pushed a commit to katre/bazel that referenced this pull request Jul 13, 2021
awaitTermination will throw InterruptedException if the interrupted
status is set initially when it is called, even if no wait is required.
Pool closure should not respect active interrupted status when shutting
down and awaiting termination as a result of its call from
executionPhaseEnding, which will occur during abnormal exits from
ExecutionTool. Ignore this status initially and restore the flag upon
exit of the factory close. An external interrupt which occurs during the
awaitTermination will still trigger an InterruptedException, as
expected.

Fixes bazelbuild#13512

Closes bazelbuild#13521.

PiperOrigin-RevId: 377006347
katre pushed a commit to katre/bazel that referenced this pull request Jul 13, 2021
awaitTermination will throw InterruptedException if the interrupted
status is set initially when it is called, even if no wait is required.
Pool closure should not respect active interrupted status when shutting
down and awaiting termination as a result of its call from
executionPhaseEnding, which will occur during abnormal exits from
ExecutionTool. Ignore this status initially and restore the flag upon
exit of the factory close. An external interrupt which occurs during the
awaitTermination will still trigger an InterruptedException, as
expected.

Fixes bazelbuild#13512

Closes bazelbuild#13521.

PiperOrigin-RevId: 377006347
@gregmagolan
Copy link
Contributor

gregmagolan commented Apr 20, 2023

@werkt If there was an InterruptedException during the awaitTermination would you expect that to bubble up into a RuntimeException and kill the Bazel server?

That seems to be the case we hit below.

Bazel version is 6.1.0.

The line in the stack trace "com.google.devtools.build.lib.remote.grpc.ChannelConnectionFactory$ChannelConnection.close(ChannelConnectionFactory.java:56)" corresponds to the code https://github.com/bazelbuild/bazel/blob/release-6.1.0/src/main/java/com/google/devtools/build/lib/remote/grpc/ChannelConnectionFactory.java#L56.

230420 18:22:31.802:I 2700 [com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec] Exit status was DetailedExitCode{exitCode=BLAZE_INTERNAL_ERROR, failureDetail=message: "Crashed: (java.lang.RuntimeException) java.util.concurrent.ExecutionException: java.lang.AssertionError, (java.util.concurrent.ExecutionException) java.lang.AssertionError, (java.lang.AssertionError) , (java.io.IOException) , (java.lang.InterruptedException) "
crash {
  causes {
    throwable_class: "java.lang.RuntimeException"
    message: "java.util.concurrent.ExecutionException: java.lang.AssertionError"
    stack_trace: "com.google.devtools.build.lib.runtime.BlockWaitingModule.afterCommand(BlockWaitingModule.java:99)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:625)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:634)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:234)"
    stack_trace: "com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)"
    stack_trace: "com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:614)"
    stack_trace: "io.grpc.Context$1.run(Context.java:566)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)"
    stack_trace: "java.base/java.lang.Thread.run(Unknown Source)"
  }
  causes {
    throwable_class: "java.util.concurrent.ExecutionException"
    message: "java.lang.AssertionError"
    stack_trace: "java.base/java.util.concurrent.FutureTask.report(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.FutureTask.get(Unknown Source)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlockWaitingModule.afterCommand(BlockWaitingModule.java:90)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:625)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:634)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:234)"
    stack_trace: "com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)"
    stack_trace: "com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:614)"
    stack_trace: "io.grpc.Context$1.run(Context.java:566)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)"
    stack_trace: "java.base/java.lang.Thread.run(Unknown Source)"
  }
  causes {
    throwable_class: "java.lang.AssertionError"
    stack_trace: "com.google.devtools.build.lib.remote.ReferenceCountedChannel$1.deallocate(ReferenceCountedChannel.java:51)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.handleRelease(AbstractReferenceCounted.java:86)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.release(AbstractReferenceCounted.java:76)"
    stack_trace: "com.google.devtools.build.lib.remote.ReferenceCountedChannel.release(ReferenceCountedChannel.java:148)"
    stack_trace: "com.google.devtools.build.lib.remote.GrpcCacheClient.close(GrpcCacheClient.java:167)"
    stack_trace: "com.google.devtools.build.lib.remote.disk.DiskAndRemoteCacheClient.close(DiskAndRemoteCacheClient.java:72)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteCache.deallocate(RemoteCache.java:429)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.handleRelease(AbstractReferenceCounted.java:86)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.release(AbstractReferenceCounted.java:76)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteExecutionService.shutdown(RemoteExecutionService.java:1513)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteActionContextProvider.afterCommand(RemoteActionContextProvider.java:223)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteModule.afterCommandTask(RemoteModule.java:821)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteModule.lambda$afterCommand$1(RemoteModule.java:799)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlockWaitingModule.lambda$submit$0(BlockWaitingModule.java:73)"
    stack_trace: "java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.FutureTask.run(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)"
    stack_trace: "java.base/java.lang.Thread.run(Unknown Source)"
  }
  causes {
    throwable_class: "java.io.IOException"
    stack_trace: "com.google.devtools.build.lib.remote.grpc.ChannelConnectionFactory$ChannelConnection.close(ChannelConnectionFactory.java:58)"
    stack_trace: "com.google.devtools.build.lib.remote.grpc.SharedConnectionFactory.close(SharedConnectionFactory.java:75)"
    stack_trace: "com.google.devtools.build.lib.remote.grpc.DynamicConnectionPool.close(DynamicConnectionPool.java:64)"
    stack_trace: "com.google.devtools.build.lib.remote.ReferenceCountedChannel$1.deallocate(ReferenceCountedChannel.java:49)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.handleRelease(AbstractReferenceCounted.java:86)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.release(AbstractReferenceCounted.java:76)"
    stack_trace: "com.google.devtools.build.lib.remote.ReferenceCountedChannel.release(ReferenceCountedChannel.java:148)"
    stack_trace: "com.google.devtools.build.lib.remote.GrpcCacheClient.close(GrpcCacheClient.java:167)"
    stack_trace: "com.google.devtools.build.lib.remote.disk.DiskAndRemoteCacheClient.close(DiskAndRemoteCacheClient.java:72)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteCache.deallocate(RemoteCache.java:429)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.handleRelease(AbstractReferenceCounted.java:86)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.release(AbstractReferenceCounted.java:76)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteExecutionService.shutdown(RemoteExecutionService.java:1513)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteActionContextProvider.afterCommand(RemoteActionContextProvider.java:223)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteModule.afterCommandTask(RemoteModule.java:821)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteModule.lambda$afterCommand$1(RemoteModule.java:799)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlockWaitingModule.lambda$submit$0(BlockWaitingModule.java:73)"
    stack_trace: "java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.FutureTask.run(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)"
    stack_trace: "java.base/java.lang.Thread.run(Unknown Source)"
  }
  causes {
    throwable_class: "java.lang.InterruptedException"
    stack_trace: "java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.CountDownLatch.await(Unknown Source)"
    stack_trace: "io.grpc.internal.ManagedChannelImpl.awaitTermination(ManagedChannelImpl.java:909)"
    stack_trace: "io.grpc.internal.ForwardingManagedChannel.awaitTermination(ForwardingManagedChannel.java:57)"
    stack_trace: "com.google.devtools.build.lib.remote.grpc.ChannelConnectionFactory$ChannelConnection.close(ChannelConnectionFactory.java:56)"
    stack_trace: "com.google.devtools.build.lib.remote.grpc.SharedConnectionFactory.close(SharedConnectionFactory.java:75)"
    stack_trace: "com.google.devtools.build.lib.remote.grpc.DynamicConnectionPool.close(DynamicConnectionPool.java:64)"
    stack_trace: "com.google.devtools.build.lib.remote.ReferenceCountedChannel$1.deallocate(ReferenceCountedChannel.java:49)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.handleRelease(AbstractReferenceCounted.java:86)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.release(AbstractReferenceCounted.java:76)"
    stack_trace: "com.google.devtools.build.lib.remote.ReferenceCountedChannel.release(ReferenceCountedChannel.java:148)"
    stack_trace: "com.google.devtools.build.lib.remote.GrpcCacheClient.close(GrpcCacheClient.java:167)"
    stack_trace: "com.google.devtools.build.lib.remote.disk.DiskAndRemoteCacheClient.close(DiskAndRemoteCacheClient.java:72)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteCache.deallocate(RemoteCache.java:429)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.handleRelease(AbstractReferenceCounted.java:86)"
    stack_trace: "io.netty.util.AbstractReferenceCounted.release(AbstractReferenceCounted.java:76)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteExecutionService.shutdown(RemoteExecutionService.java:1513)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteActionContextProvider.afterCommand(RemoteActionContextProvider.java:223)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteModule.afterCommandTask(RemoteModule.java:821)"
    stack_trace: "com.google.devtools.build.lib.remote.RemoteModule.lambda$afterCommand$1(RemoteModule.java:799)"
    stack_trace: "com.google.devtools.build.lib.runtime.BlockWaitingModule.lambda$submit$0(BlockWaitingModule.java:73)"
    stack_trace: "java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.FutureTask.run(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)"
    stack_trace: "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)"
    stack_trace: "java.base/java.lang.Thread.run(Unknown Source)"
  }
}
}

@werkt
Copy link
Contributor Author

werkt commented Apr 20, 2023

Yes, from your stack traces there, and from confirming via code sources, I expect that, particularly given the block of submittedTasks processing in BlockWaitingModule, only AbruptExitExceptions are handled in any non-terminating fashion, with your Interrupted -> IO -> Assertion -> Execution -> Runtime exception path there resulting in a shutdown in BlazeCommandDispatcher with the copy "Shutting down due to exception". Ostensibly, you really don't want anything but a complete failure+shutdown in the bazel daemon here, as it just failed to shutdown a pool, so it would leak resources if it continued.

@gregmagolan
Copy link
Contributor

Roger. Thanks for confirming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failing to rebuild monorepo with GRPC cache with 4.1.0
3 participants