Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed tests executed with REv2 hang indefinitely with Bazel 0.25 #8320

Closed
cphang99 opened this issue May 14, 2019 · 17 comments
Closed

Failed tests executed with REv2 hang indefinitely with Bazel 0.25 #8320

cphang99 opened this issue May 14, 2019 · 17 comments
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged

Comments

@cphang99
Copy link

Description of the problem / feature request:

Bazel 0.25 has introduced a regression, causing the reporting of failed
tests executed using REv2 to hang indefinitely.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Launch a recent version of Buildbarn.
On a Linux-based system, this can for example be done by using
the Docker Compose based setup.

Create a simple Bazel workspace that contains the following files:

WORKSPACE: May remain empty.

BUILD.bazel:

cc_test(
    name = "cc_failure_test",
    srcs = ["cc_failure_test.c"],
)

py_test(
    name = "py_failure_test",
    srcs = ["py_failure_test.py"],
)

cc_failure_test.c:

#include <stdio.h>

int main() {
  printf("Fail!\n");
  return 1;
}

py_failure_test.py:

import sys

print("Fail!")
sys.exit(1)

.bazelrc:

build --experimental_strict_action_env
build --genrule_strategy=remote
build --spawn_strategy=remote
build --strategy=Closure=remote
build --strategy=Javac=remote

build --bes_backend=localhost:8983
build --bes_results_url=http://localhost:7984/build_events/bb-event-service/
build --jobs=8
build --remote_executor=localhost:8980
build --remote_instance_name=local

Now run a bazel test ... to execute both unit tests. Buildbarn will
immediately execute all build actions, returning that the execution of
both tests have failed. Logging/tracing on Buildbarn's side confirms
this. Still, Bazel will remain stuck in a state where it won't
process the results it receives from the build cluster:

$ bazel test ...
INFO: Analyzed 2 targets (20 packages loaded, 505 targets configured).
INFO: Found 2 test targets...
[8 / 10] 2 actions running
    Testing //:py_failure_test; 72s remote
    Testing //:cc_failure_test; 72s remote

The invocation of Bazel never terminates, regardless of any timeouts
configured. Bazel will not voluntarily shut down when pressing Ctrl+C.
It only terminates after pressing Ctrl+C three times.

This regression has been introduced by commit
4a5e1b7. It only affects the execution
of failing tests; tests that succeed do complete properly.

Below is a list of Bazel versions tested.

What operating system are you running Bazel on?

macOS 10.14.x and Ubuntu 18.04.

Have you found anything relevant by searching the web?

Apart from the issues linked to from within the messages of the commits
above (e.g., #6394), nothing relevant has been found.

@buchgr
Copy link
Contributor

buchgr commented May 16, 2019

@cphang99 thanks for the detailed bug report. Do I understand correctly that you managed to pin pointed the root cause to be 4a5e1b7 ?

@cphang99
Copy link
Author

@buchgr yes. Reverting 4a5e1b7 (and associated commits to resolve conflicts) results in bazel test reporting test failures correctly without hanging again.

@dslomov dslomov added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels May 17, 2019
@buchgr
Copy link
Contributor

buchgr commented May 21, 2019

@ulfjack are you aware of this? has this been fixed in master?

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

This is the first time I hear about this. 4a5e1b7 adds a separate implementation and should not result in any user-visible change. However, 895c43d enables the new implementation by default.

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

A stack trace would be useful.

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

It's odd that this only seems to happen on buildbarn. We don't have any other reports, AFAIK.

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

I can't get buildbarn to work:

ERROR: Failed to query remote execution capabilities: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception

That's literally all the error message I get.

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

While I'm here: it's not a good idea to attempt a network connection in BlazeModule.beforeCommand.

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

java.io.IOException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at com.google.devtools.build.lib.remote.RemoteServerCapabilities.get(RemoteServerCapabilities.java:79)
	at com.google.devtools.build.lib.remote.RemoteModule.beforeCommand(RemoteModule.java:228)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:313)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:206)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:749)
	at com.google.devtools.build.lib.server.GrpcServerImpl.access$1600(GrpcServerImpl.java:103)
	at com.google.devtools.build.lib.server.GrpcServerImpl$2.lambda$run$0(GrpcServerImpl.java:818)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
	at build.bazel.remote.execution.v2.CapabilitiesGrpc$CapabilitiesBlockingStub.getCapabilities(CapabilitiesGrpc.java:200)
	at com.google.devtools.build.lib.remote.RemoteServerCapabilities.lambda$get$0(RemoteServerCapabilities.java:77)
	at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
	at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:104)
	at com.google.devtools.build.lib.remote.RemoteServerCapabilities.get(RemoteServerCapabilities.java:77)
	... 9 more
Caused by: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record: 000000040000000000
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1204)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1272)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 more

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

That's with TLS disabled, AFAICT, using all TLS-disabling options I can find.

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

Ah, I got it. It was picking up options from my global bazelrc. However, it still doesn't work:

ERROR: /usr/local/google/home/ulfjack/Projects/test-bazel/cpp/BUILD:6:1: undeclared inclusion(s) in rule '//cpp:cc_failure_test':
this rule is missing dependency declarations for the following files included by 'cpp/cc_failure_test.c':
  '/usr/local/lib/clang/8.0.0/include/stddef.h'
  '/usr/local/lib/clang/8.0.0/include/stdarg.h'

I had to upgrade to a newer version of the bazel_toolchains, as 0.25.2 didn't work with the old one.

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

I can reproduce with a shell test. Here's a partial stack trace:

"skyframe-evaluator 59" #415 prio=5 os_prio=0 cpu=13643.69ms elapsed=13.86s tid=0x00007f7434001000 nid=0x220fd runnable  [0x00007f74966e4000]
   java.lang.Thread.State: RUNNABLE
	at com.google.devtools.build.lib.unix.NativePosixFiles.write(Native Method)
	at com.google.devtools.build.lib.unix.UnixFileSystem$NativeFileOutputStream.write(UnixFileSystem.java:494)
	at com.google.devtools.build.lib.unix.UnixFileSystem$ProfiledNativeFileOutputStream.write(UnixFileSystem.java:510)
	- locked <0x00000005c59d2478> (a com.google.devtools.build.lib.unix.UnixFileSystem$ProfiledNativeFileOutputStream)
	at com.google.common.io.ByteStreams.copy(ByteStreams.java:113)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.appendStderr(StandaloneTestStrategy.java:332)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.access$400(StandaloneTestStrategy.java:71)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy$BazelTestAttemptContinuation.execute(StandaloneTestStrategy.java:566)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.beginTestAttempt(StandaloneTestStrategy.java:316)

@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

Ok, I found it. Sorry about that.

@buchgr
Copy link
Contributor

buchgr commented May 21, 2019

The fix for this will need a patch release for 0.25 and 0.26

This was referenced May 21, 2019
@ulfjack
Copy link
Contributor

ulfjack commented May 21, 2019

The problem is right here:
4a5e1b7#diff-7a4f8f06ca574919bd90a2fae061a43bR402

dkelmer pushed a commit that referenced this issue May 21, 2019
As of 4a5e1b7, it was using getErrorPath twice, which could cause it to
loop indefinitely, trying to append the error file to itself. This
happened rarely, as the test runner script redirects stderr to stdout.
However, it could happen if the SpawnRunner wrote any extra output to
stderr, which the RemoteSpawnRunner does in some cases.

I have manually checked that this fixes the issue, and also added a
regression test.

Fixes #8320.

PiperOrigin-RevId: 249258656
@dkelmer
Copy link
Contributor

dkelmer commented May 21, 2019

@ulfjack The added test fails on buildkite and locally when cherrypicked into the 0.25 release: https://buildkite.com/bazel/bazel-bazel/builds/8388#b6c12c75-4480-4fe8-93c4-f5ec9a99e4b7

@ulfjack
Copy link
Contributor

ulfjack commented May 22, 2019

@dkelmer Sorry about that. The test relies on the directory being created in the test strategy, which is only true after fb04c5b. You can either ignore the failure, disable the test in the release branch or cherrypick fb04c5b as well. It applies cleanly and also fixes the test failure, but I did not try running all the tests.

Technically, we could also patch the test to create the directory before writing to the test.err, but I think disabling the test is fine, too.

aehlig pushed a commit that referenced this issue May 22, 2019
As of 4a5e1b7, it was using getErrorPath twice, which could cause it to
loop indefinitely, trying to append the error file to itself. This
happened rarely, as the test runner script redirects stderr to stdout.
However, it could happen if the SpawnRunner wrote any extra output to
stderr, which the RemoteSpawnRunner does in some cases.

I have manually checked that this fixes the issue, and also added a
regression test.

Fixes #8320.

PiperOrigin-RevId: 249258656
aehlig pushed a commit that referenced this issue May 23, 2019
As of 4a5e1b7, it was using getErrorPath twice, which could cause it to
loop indefinitely, trying to append the error file to itself. This
happened rarely, as the test runner script redirects stderr to stdout.
However, it could happen if the SpawnRunner wrote any extra output to
stderr, which the RemoteSpawnRunner does in some cases.

I have manually checked that this fixes the issue, and also added a
regression test.

Fixes #8320.

PiperOrigin-RevId: 249258656
aehlig pushed a commit that referenced this issue May 23, 2019
As of 4a5e1b7, it was using getErrorPath twice, which could cause it to
loop indefinitely, trying to append the error file to itself. This
happened rarely, as the test runner script redirects stderr to stdout.
However, it could happen if the SpawnRunner wrote any extra output to
stderr, which the RemoteSpawnRunner does in some cases.

I have manually checked that this fixes the issue, and also added a
regression test.

Fixes #8320.

PiperOrigin-RevId: 249258656
aehlig pushed a commit that referenced this issue May 24, 2019
As of 4a5e1b7, it was using getErrorPath twice, which could cause it to
loop indefinitely, trying to append the error file to itself. This
happened rarely, as the test runner script redirects stderr to stdout.
However, it could happen if the SpawnRunner wrote any extra output to
stderr, which the RemoteSpawnRunner does in some cases.

I have manually checked that this fixes the issue, and also added a
regression test.

Fixes #8320.

PiperOrigin-RevId: 249258656
irengrig pushed a commit to irengrig/bazel that referenced this issue Jun 18, 2019
As of 4a5e1b7, it was using getErrorPath twice, which could cause it to
loop indefinitely, trying to append the error file to itself. This
happened rarely, as the test runner script redirects stderr to stdout.
However, it could happen if the SpawnRunner wrote any extra output to
stderr, which the RemoteSpawnRunner does in some cases.

I have manually checked that this fixes the issue, and also added a
regression test.

Fixes bazelbuild#8320.

PiperOrigin-RevId: 249258656
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged
Projects
None yet
Development

No branches or pull requests

5 participants