Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Exception 'Must not call uploadBlobs after shutdown.' when closing BEP transports, this is a bug. #12575

Closed
siggisim opened this issue Nov 26, 2020 · 6 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Core Skyframe, bazel query, BEP, options parsing, bazelrc

Comments

@siggisim
Copy link
Contributor

Description of the problem:

With a gRPC remote cache and BES backend enabled, Bazel intermittently fails to write build event file with the following error:

java.lang.RuntimeException: Unexpected Exception 'Must not call uploadBlobs after shutdown.' when closing BEP transports, this is a bug.
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:503)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:581)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:599)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:626)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:604)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:231)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:543)
	at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:606)
	at io.grpc.Context$1.run(Context.java:579)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Must not call uploadBlobs after shutdown.
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:545)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:102)
	at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:237)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:487)
	... 11 more

I've traced this down FileTransport.java and BuildEventServiceUploader.java both calling shutdown on the same ByteStreamBuildEventArtifactUploader. Normally this is fine, because they shutdown around the same time, after all uploads are complete.

However, if findMissingDigests takes a while, FileTransport's shutdown is called which shuts down the ByteStreamBuildEventArtifactUploader. Then when findMissingDigests returns - upload is attempted on an uploader that has already been shut down.

This happens in practice on builds with thousands of outputs that need to be uploaded. It can also be triggered artificially by adding an intermittent sleep into Bazel in GrpcCacheClient here before returning the missing digests.

if (Math.random() < .1) {
    try { Thread.sleep((long)(Math.random() * 3000)); } catch(InterruptedException e) {}
}

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Create the following BUILD file that generates 10 dummy outputs to be downloaded.

$ cat BUILD
[genrule(
    name = "target-{}".format(i),
    outs = ["output-{}.txt".format(i)],
    cmd = "echo {} > $@".format(i),
    visibility = ["//visibility:public"],
) for i in range(10)]

Add intermittent sleep to Bazel in GrpcCacheClient here before returning the missing digests.

if (Math.random() < .1) {
    try { Thread.sleep((long)(Math.random() * 3000)); } catch(InterruptedException e) {}
}

Run a build with --remote_cache, --bes_backend, --build_event_json_file set. Setting a random --remote_instance_name makes sure that the outputs will be freshly uploaded on each run. This fails >50% of the time.

bazel build //... --remote_cache=cloud.buildbuddy.dev --bes_backend=cloud.buildbuddy.dev --build_event_json_file=/tmp/bep.json --remote_instance_name=$(date +%s)

What operating system are you running Bazel on?

Linux

What's the output of bazel info release?

release 3.7.1

I've reproduced on 3.1.0 and 3.7.1, but haven't tried outside of that range. It's easiest to reproduce with a custom Bazel version with the sleep.

Any other information, logs, or outputs that you want to share?

Some less than ideal fixes I've found for this include removing the shutdown call from FileTransport, or wrapping uploadLocalFiles in BuildEventArtifactUploader with uploader.retain() and uploader.release(). Don't love either of these.

Would love any advice from someone who is more familiar with this code on what a good fix would look like - would be happy to send a pull request. I've seen this error with 4 different companies we've been working with.

Not sure if this this is related, but I'm hoping that the fix here might also fix another bug which we see much more frequently and doesn't seem to require setting build_event_json_file, but I've been having a harder time reproducing reliably enough to file a detailed bug report.

FAILED: Build did NOT complete successfully
WARNING: BES was not properly closed
Internal error thrown during build. 
Printing stack trace: java.util.concurrent.RejectedExecutionException: 
Task com.google.common.util.concurrent.TrustedListenableFutureTask@12febc19
[status=PENDING, info=[task=[running=[NOT STARTED YET], 
com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda$1058/0x00000008008b5c40@633e8b27]]] 
rejected from java.util.concurrent.ThreadPoolExecutor@b0ef88a
[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 172]

My hunch for that one is that after commit d82341d that ThreadPool is shut down and some race condition is causing uploads to be added to the ThreadPool after it has been shutdown.

Really appreciate any and all help - would be happy to send pull requests for fixes. Just looking for some guidance on what the right fix looks like.

@siggisim
Copy link
Contributor Author

CC @michaeledgar who seems to know a thing or two about BES artifact uploading

@aiuto aiuto added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc untriaged labels Nov 30, 2020
@janakdr janakdr added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Nov 30, 2020
@michaeledgar
Copy link
Contributor

I've successfully reproduced this issue using the steps provided. I'm looking into a fix now. Thanks for reporting this!

@siggisim
Copy link
Contributor Author

siggisim commented Dec 4, 2020

Thanks @michaeledgar ! Really appreciate you taking the time to look into this. Let me know if I can help in any way.

If you have any thoughts on the ThreadPool shutdown stacktrace at the bottom (whether the fix for this might address that or not) I'd be super appreciative. Happy to open a second issue if not - working furiously on finding a reliable repro case for that one as we speak.

@michaeledgar
Copy link
Contributor

I have a fix that eliminates the crash similar to the ideas you proposed: the BuildEventArtifactUploader should be ReferenceCounted. In the scenario you describe - with both the BES transport and the File transport enabled, a single BuildEventArtifactUploader is shared between two instances that are currently both calling shutdown.

@siggisim
Copy link
Contributor Author

Ah that's great news, huge thank you for looking into this @michaeledgar!

Hopefully the reference counting will address the other issue as well, since it seems like in that scenario shutdown is called on the uploadExecutor too early, which is triggered by the same BuildEventArtifactUploader shutdown call (still haven't managed to figure out how or when).

If it doesn't, I'll open a separate issue - which we should be able to tackle with a little more reference counting.

Thanks again for your help!

@michaeledgar
Copy link
Contributor

The fix is currently in review; the patch may be accessed early at https://bazel-review.googlesource.com/c/bazel/+/149490.

philwo pushed a commit that referenced this issue Mar 15, 2021
…t enabled.

When both BES uploading and File BEP output are enabled, a single
BuildEventArtifactUploader object is shared by two different
BuildEventTransports. Both were calling #shutdown() which in turn called
ByteStreamUploader#shutdown(). If shutdown is called while one transport is
still uploading, the ByteStreamUploader will fail an assertion and crash.

This change adds reference counting to the BuildEventArtifactUploader interface
and ensures the reference counts are maintained correctly when sharing a
BuildEventArtifactUploader across multiple independent BuildEventTransport
threads.

Fixes #12575.

RELNOTES: None.

TESTED=Made repro modifications to GrpcCacheClient.java described in
#12575 and confirmed crash without this
change. Implemented this fix, observed crash was no longer reproducible. Added logs
to ByteStreamBuildEventArtifactUploader#deallocate() to verify deallocation happened
after both BuildEventTransports had completed.
PiperOrigin-RevId: 349589743
philwo pushed a commit that referenced this issue Mar 15, 2021
…t enabled.

When both BES uploading and File BEP output are enabled, a single
BuildEventArtifactUploader object is shared by two different
BuildEventTransports. Both were calling #shutdown() which in turn called
ByteStreamUploader#shutdown(). If shutdown is called while one transport is
still uploading, the ByteStreamUploader will fail an assertion and crash.

This change adds reference counting to the BuildEventArtifactUploader interface
and ensures the reference counts are maintained correctly when sharing a
BuildEventArtifactUploader across multiple independent BuildEventTransport
threads.

Fixes #12575.

RELNOTES: None.

TESTED=Made repro modifications to GrpcCacheClient.java described in
#12575 and confirmed crash without this
change. Implemented this fix, observed crash was no longer reproducible. Added logs
to ByteStreamBuildEventArtifactUploader#deallocate() to verify deallocation happened
after both BuildEventTransports had completed.
PiperOrigin-RevId: 349589743
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Core Skyframe, bazel query, BEP, options parsing, bazelrc
Projects
None yet
Development

No branches or pull requests

4 participants