Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build event uploader is not robust to upload failures and does not respect no-cache tag #11473

Closed
scele opened this issue May 24, 2020 · 11 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@scele
Copy link
Contributor

scele commented May 24, 2020

Description of the problem / feature request:

The BEP code that uploads missing files to remote cache has problematic behavior in two cases:

  1. Remote cache uploads that happen during bazel build are robust to upload failures: if a CAS blob upload fails even after retries, bazel will print a warning message but continue the build. However, the BEP uploader stage will also try to upload any missing blobs referenced in build events, and if the upload fails there, the whole build fails.
  2. While no-cache tag works correctly in that an AC entry never gets written to the remote cache, the BEP uploader's "upload-referenced-blobs" behavior means that no-cache cannot be used to prevent writing the action outputs (= CAS blobs) to the remote cache. In particular, we would like to use no-cache tags for actions that generate large artifacts that change often, and therefore not upload them to remote cache.

--nobuild_event_json_file_path_conversion fixes these issues, but it has the downside that none of the entries in the BES report reference the files in the remote cache, even if they were uploaded there as part of the build process.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Run a remote cache at grpc://localhost:8980 (I used bb-storage).

Create two targets that output a large file, and add a no-cache tag to the other one.

$ cat BUILD 
genrule(
   name = "bigfile",
   outs = ["bigfile.img"],
   cmd = "fallocate -l 1G \"$@\"",
)
genrule(
   name = "bigfile_nocache",
   outs = ["bigfile_nocache.img"],
   cmd = "fallocate -l 1G \"$@\"",
   tags = ["no-cache"],
)

We simulate upload failures by disabling retries and using a small timeout value. Without --build_event_json_file the build of //:bigfile passes with warnings:

$ bazel build //:bigfile --remote_cache=grpc://localhost:8980 --remote_instance_name=default --remote_retries=0 --remote_timeout=3
INFO: Invocation ID: 5a9ad001-0658-4bc6-8167-bdb9be6ecc3a
INFO: Analyzed target //:bigfile (5 packages loaded, 7 targets configured).
INFO: Found 1 target...
WARNING: Writing to Remote Cache:
BulkTransferException
Target //:bigfile up-to-date:
  bazel-bin/bigfile.img
INFO: Elapsed time: 6.079s, Critical Path: 5.93s
INFO: 1 process: 1 linux-sandbox.
INFO: Build completed successfully, 2 total actions
$ echo $?
0

With --build_event_json_file the build fails:

$ bazel build //:bigfile --remote_cache=grpc://localhost:8980 --remote_instance_name=default --remote_retries=0 --remote_timeout=3 --build_event_json_file=be.json
INFO: Invocation ID: 468776d9-4ca8-4501-8a1b-b610df0b67ca
INFO: Analyzed target //:bigfile (5 packages loaded, 7 targets configured).
INFO: Found 1 target...
WARNING: Writing to Remote Cache:
BulkTransferException
Target //:bigfile up-to-date:
  bazel-bin/bigfile.img
INFO: Elapsed time: 6.120s, Critical Path: 5.96s
INFO: 1 process: 1 linux-sandbox.
ERROR: Unable to write all BEP events to file due to 'java.io.IOException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 2999124047ns'
INFO: Build completed successfully, 2 total actions
$ echo $?
38

The BEP uploader also doesn't respect the no-cache tag (notice that the build does not emit BulkTransferException anymore, because it doesn't attempt to upload the results due to the no-cache tag):

$ bazel build //:bigfile_nocache --remote_cache=grpc://localhost:8980 --remote_instance_name=default --remote_retries=0 --remote_timeout=3 --build_event_json_file=be.json
INFO: Invocation ID: af3a755b-6f51-4333-9f3c-95deb342c731
INFO: Analyzed target //:bigfile_nocache (5 packages loaded, 7 targets configured).
INFO: Found 1 target...
Target //:bigfile_nocache up-to-date:
  bazel-bin/bigfile_nocache.img
INFO: Elapsed time: 3.074s, Critical Path: 2.92s
INFO: 1 process: 1 linux-sandbox.
ERROR: Unable to write all BEP events to file due to 'java.io.IOException: io.grpc.StatusRuntimeException: UNKNOWN: context deadline exceeded'
INFO: Build completed successfully, 2 total actions
$ echo $?
38

What operating system are you running Bazel on?

Ubuntu 18.04.3 LTS

What's the output of bazel info release?

release 3.1.0

Have you found anything relevant by searching the web?

No.

Any other information, logs, or outputs that you want to share?

I wonder if the BES uploader should behave as if --nobuild_event_json_file_path_conversion was passed for the files that have not been uploaded to the remote cache by the build?

@aiuto aiuto added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels May 26, 2020
@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) type: bug and removed untriaged labels Dec 9, 2020
@brentleyjones
Copy link
Contributor

This is pretty annoying. We also set some actions to be no-cache to prevent uploading large-and-changes-with-every-build blobs. We don't want to set --experimental_build_event_upload_strategy=local as that prevents uploading the profile.json and test logs.

@EitoWayve80
Copy link

I also needed to use it to prevent build failures

@coeuvre
Copy link
Member

coeuvre commented Aug 3, 2021

  1. Remote cache uploads that happen during bazel build are robust to upload failures: if a CAS blob upload fails even after retries, bazel will print a warning message but continue the build. However, the BEP uploader stage will also try to upload any missing blobs referenced in build events, and if the upload fails there, the whole build fails.

I think it's reasonable to print a warning message and continue the build in this case, just like CAS upload. I can look into this.

  1. While no-cache tag works correctly in that an AC entry never gets written to the remote cache, the BEP uploader's "upload-referenced-blobs" behavior means that no-cache cannot be used to prevent writing the action outputs (= CAS blobs) to the remote cache. In particular, we would like to use no-cache tags for actions that generate large artifacts that change often, and therefore not upload them to remote cache.

IIUC, no-cache by definition only prevents writing to remote cache within action execution. Uploading outputs of no-cache actions should be valid for BEP uploader.

However, I do understand there are cases where we don't want some outputs be uploaded. We probably need another way to do this:

  1. Add a flag, e.g. --remote_bes_ignore_no_cache, to tell the uploader provided by remote module to ignore outputs of no-cache actions.
  2. Add a tag, e.g. no-bes-upload, to prevent all the BEP uploaders uploading outputs of the action.

cc @michaeledgar

@brentleyjones
Copy link
Contributor

However, I do understand there are cases where we don't want some outputs be uploaded. We probably need another way to do this:

  1. Add a flag, e.g. --remote_bes_ignore_no_cache, to tell the uploader provided by remote module to ignore outputs of no-cache actions.
  2. Add a tag, e.g. no-bes-upload, to prevent all the BEP uploaders uploading outputs of the action.

I think both of those would be good to have, and if I had to choose just one, it would be the new flag.

bazel-io pushed a commit that referenced this issue Sep 14, 2021
…oad errors

Local files referenced by build events are uploaded to remote cache with `ByteStreamBuildEventArtifactUploader` which uses `ByteStreamUploader` and `MissingDigestsFinder` internally.

This PR changes `ByteStreamBuildEventArtifactUploader` to use `RemoteCache` directly in order to benefit from recently improvements to `RemoteCache`.

Upload error that used to crash Bazel or cause a non-zero exit code is now caught and reported.

Fixes #13920.

Related #11473.

Closes #13959.

PiperOrigin-RevId: 396526152
@brentleyjones
Copy link
Contributor

I just wanted to bump this, since it's been nearly 3 months since the last comment. We would love to have our timing profile, and other BES related files, uploaded, but we can't because we don't want BES to cause our no-remote-cache items to be uploaded 😕.

@coeuvre
Copy link
Member

coeuvre commented Oct 28, 2021

Sorry for the delay. I will try to make this into 5.0.

@brentleyjones
Copy link
Contributor

Thank you @coeuvre!

@brentleyjones
Copy link
Contributor

Hi @coeuvre, do you still think this will make 5.0?

@coeuvre
Copy link
Member

coeuvre commented Nov 16, 2021

I do think. My fix is almost done.

@brentleyjones
Copy link
Contributor

That's great to hear!

@Wyverald
Copy link
Member

Wyverald commented Dec 7, 2021

Fixed by #14338 and cherrypicked into 5.0.0rc2 by #14389.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
7 participants