Failed tests executed with REv2 hang indefinitely with Bazel 0.25 #8320

cphang99 · 2019-05-14T20:56:37Z

Description of the problem / feature request:

Bazel 0.25 has introduced a regression, causing the reporting of failed
tests executed using REv2 to hang indefinitely.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Launch a recent version of Buildbarn.
On a Linux-based system, this can for example be done by using
the Docker Compose based setup.

Create a simple Bazel workspace that contains the following files:

WORKSPACE: May remain empty.

BUILD.bazel:

cc_test(
    name = "cc_failure_test",
    srcs = ["cc_failure_test.c"],
)

py_test(
    name = "py_failure_test",
    srcs = ["py_failure_test.py"],
)

cc_failure_test.c:

#include <stdio.h>

int main() {
  printf("Fail!\n");
  return 1;
}

py_failure_test.py:

import sys

print("Fail!")
sys.exit(1)

.bazelrc:

build --experimental_strict_action_env
build --genrule_strategy=remote
build --spawn_strategy=remote
build --strategy=Closure=remote
build --strategy=Javac=remote

build --bes_backend=localhost:8983
build --bes_results_url=http://localhost:7984/build_events/bb-event-service/
build --jobs=8
build --remote_executor=localhost:8980
build --remote_instance_name=local

Now run a bazel test ... to execute both unit tests. Buildbarn will
immediately execute all build actions, returning that the execution of
both tests have failed. Logging/tracing on Buildbarn's side confirms
this. Still, Bazel will remain stuck in a state where it won't
process the results it receives from the build cluster:

$ bazel test ...
INFO: Analyzed 2 targets (20 packages loaded, 505 targets configured).
INFO: Found 2 test targets...
[8 / 10] 2 actions running
    Testing //:py_failure_test; 72s remote
    Testing //:cc_failure_test; 72s remote

The invocation of Bazel never terminates, regardless of any timeouts
configured. Bazel will not voluntarily shut down when pressing Ctrl+C.
It only terminates after pressing Ctrl+C three times.

This regression has been introduced by commit
4a5e1b7. It only affects the execution
of failing tests; tests that succeed do complete properly.

Below is a list of Bazel versions tested.

Bazel 0.24: works
Bazel 0.25: doesn't work
Bazel 7aace9c: works
Bazel 4a5e1b7: doesn't work
Bazel d990a0a (today's master):
doesn't work
Bazel 0.25.2: doesn't work
Bazel 0.25.2 with 32f6001
895c43d
3bd9a10
0ebc034
4a5e1b7 reverted: works

What operating system are you running Bazel on?

macOS 10.14.x and Ubuntu 18.04.

Have you found anything relevant by searching the web?

Apart from the issues linked to from within the messages of the commits
above (e.g., #6394), nothing relevant has been found.

The text was updated successfully, but these errors were encountered:

buchgr · 2019-05-16T20:13:11Z

@cphang99 thanks for the detailed bug report. Do I understand correctly that you managed to pin pointed the root cause to be 4a5e1b7 ?

cphang99 · 2019-05-16T20:41:32Z

@buchgr yes. Reverting 4a5e1b7 (and associated commits to resolve conflicts) results in bazel test reporting test failures correctly without hanging again.

buchgr · 2019-05-21T12:20:41Z

@ulfjack are you aware of this? has this been fixed in master?

ulfjack · 2019-05-21T13:20:13Z

This is the first time I hear about this. 4a5e1b7 adds a separate implementation and should not result in any user-visible change. However, 895c43d enables the new implementation by default.

ulfjack · 2019-05-21T13:20:27Z

A stack trace would be useful.

ulfjack · 2019-05-21T13:21:17Z

It's odd that this only seems to happen on buildbarn. We don't have any other reports, AFAIK.

ulfjack · 2019-05-21T14:06:35Z

I can't get buildbarn to work:

ERROR: Failed to query remote execution capabilities: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception

That's literally all the error message I get.

ulfjack · 2019-05-21T14:10:20Z

While I'm here: it's not a good idea to attempt a network connection in BlazeModule.beforeCommand.

ulfjack · 2019-05-21T14:17:54Z

java.io.IOException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at com.google.devtools.build.lib.remote.RemoteServerCapabilities.get(RemoteServerCapabilities.java:79)
	at com.google.devtools.build.lib.remote.RemoteModule.beforeCommand(RemoteModule.java:228)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:313)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:206)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:749)
	at com.google.devtools.build.lib.server.GrpcServerImpl.access$1600(GrpcServerImpl.java:103)
	at com.google.devtools.build.lib.server.GrpcServerImpl$2.lambda$run$0(GrpcServerImpl.java:818)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
	at build.bazel.remote.execution.v2.CapabilitiesGrpc$CapabilitiesBlockingStub.getCapabilities(CapabilitiesGrpc.java:200)
	at com.google.devtools.build.lib.remote.RemoteServerCapabilities.lambda$get$0(RemoteServerCapabilities.java:77)
	at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
	at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:104)
	at com.google.devtools.build.lib.remote.RemoteServerCapabilities.get(RemoteServerCapabilities.java:77)
	... 9 more
Caused by: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record: 000000040000000000
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1204)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1272)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 more

ulfjack · 2019-05-21T14:20:17Z

That's with TLS disabled, AFAICT, using all TLS-disabling options I can find.

ulfjack · 2019-05-21T14:23:29Z

Ah, I got it. It was picking up options from my global bazelrc. However, it still doesn't work:

ERROR: /usr/local/google/home/ulfjack/Projects/test-bazel/cpp/BUILD:6:1: undeclared inclusion(s) in rule '//cpp:cc_failure_test':
this rule is missing dependency declarations for the following files included by 'cpp/cc_failure_test.c':
  '/usr/local/lib/clang/8.0.0/include/stddef.h'
  '/usr/local/lib/clang/8.0.0/include/stdarg.h'

I had to upgrade to a newer version of the bazel_toolchains, as 0.25.2 didn't work with the old one.

ulfjack · 2019-05-21T14:30:13Z

I can reproduce with a shell test. Here's a partial stack trace:

"skyframe-evaluator 59" #415 prio=5 os_prio=0 cpu=13643.69ms elapsed=13.86s tid=0x00007f7434001000 nid=0x220fd runnable  [0x00007f74966e4000]
   java.lang.Thread.State: RUNNABLE
	at com.google.devtools.build.lib.unix.NativePosixFiles.write(Native Method)
	at com.google.devtools.build.lib.unix.UnixFileSystem$NativeFileOutputStream.write(UnixFileSystem.java:494)
	at com.google.devtools.build.lib.unix.UnixFileSystem$ProfiledNativeFileOutputStream.write(UnixFileSystem.java:510)
	- locked <0x00000005c59d2478> (a com.google.devtools.build.lib.unix.UnixFileSystem$ProfiledNativeFileOutputStream)
	at com.google.common.io.ByteStreams.copy(ByteStreams.java:113)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.appendStderr(StandaloneTestStrategy.java:332)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.access$400(StandaloneTestStrategy.java:71)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy$BazelTestAttemptContinuation.execute(StandaloneTestStrategy.java:566)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.beginTestAttempt(StandaloneTestStrategy.java:316)

ulfjack · 2019-05-21T14:38:30Z

Ok, I found it. Sorry about that.

buchgr · 2019-05-21T14:38:37Z

The fix for this will need a patch release for 0.25 and 0.26

ulfjack · 2019-05-21T14:41:52Z

The problem is right here:
4a5e1b7#diff-7a4f8f06ca574919bd90a2fae061a43bR402

As of 4a5e1b7, it was using getErrorPath twice, which could cause it to loop indefinitely, trying to append the error file to itself. This happened rarely, as the test runner script redirects stderr to stdout. However, it could happen if the SpawnRunner wrote any extra output to stderr, which the RemoteSpawnRunner does in some cases. I have manually checked that this fixes the issue, and also added a regression test. Fixes #8320. PiperOrigin-RevId: 249258656

dkelmer · 2019-05-21T21:50:30Z

@ulfjack The added test fails on buildkite and locally when cherrypicked into the 0.25 release: https://buildkite.com/bazel/bazel-bazel/builds/8388#b6c12c75-4480-4fe8-93c4-f5ec9a99e4b7

ulfjack · 2019-05-22T07:41:21Z

@dkelmer Sorry about that. The test relies on the directory being created in the test strategy, which is only true after fb04c5b. You can either ignore the failure, disable the test in the release branch or cherrypick fb04c5b as well. It applies cleanly and also fixes the test failure, but I did not try running all the tests.

Technically, we could also patch the test to create the directory before writing to the test.err, but I think disabling the test is fine, too.

As of 4a5e1b7, it was using getErrorPath twice, which could cause it to loop indefinitely, trying to append the error file to itself. This happened rarely, as the test runner script redirects stderr to stdout. However, it could happen if the SpawnRunner wrote any extra output to stderr, which the RemoteSpawnRunner does in some cases. I have manually checked that this fixes the issue, and also added a regression test. Fixes #8320. PiperOrigin-RevId: 249258656

As of 4a5e1b7, it was using getErrorPath twice, which could cause it to loop indefinitely, trying to append the error file to itself. This happened rarely, as the test runner script redirects stderr to stdout. However, it could happen if the SpawnRunner wrote any extra output to stderr, which the RemoteSpawnRunner does in some cases. I have manually checked that this fixes the issue, and also added a regression test. Fixes bazelbuild#8320. PiperOrigin-RevId: 249258656

dslomov added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels May 17, 2019

This was referenced May 21, 2019

Release 0.25 - April 2019 #7498

Closed

Release 0.26 - May 2019 #7499

Closed

bazel-io closed this as completed in 0ff19c6 May 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed tests executed with REv2 hang indefinitely with Bazel 0.25 #8320

Failed tests executed with REv2 hang indefinitely with Bazel 0.25 #8320

cphang99 commented May 14, 2019

buchgr commented May 16, 2019

cphang99 commented May 16, 2019

buchgr commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

buchgr commented May 21, 2019

ulfjack commented May 21, 2019

dkelmer commented May 21, 2019

ulfjack commented May 22, 2019 •

edited

Loading

Failed tests executed with REv2 hang indefinitely with Bazel 0.25 #8320

Failed tests executed with REv2 hang indefinitely with Bazel 0.25 #8320

Comments

cphang99 commented May 14, 2019

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

Have you found anything relevant by searching the web?

buchgr commented May 16, 2019

cphang99 commented May 16, 2019

buchgr commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

ulfjack commented May 21, 2019

buchgr commented May 21, 2019

ulfjack commented May 21, 2019

dkelmer commented May 21, 2019

ulfjack commented May 22, 2019 • edited Loading

ulfjack commented May 22, 2019 •

edited

Loading