Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel build //X:all records target completion events in the BEP stream with file:// URIs, but sometimes doesn't write files to disk #20737

Closed
froydnj opened this issue Jan 3, 2024 · 20 comments
Assignees
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged

Comments

@froydnj
Copy link

froydnj commented Jan 3, 2024

Description of the bug:

We use --build_event_json_file to watch for target completion events and run some post-processing on the produced files in tandem with Bazel running. With Bazel 6.3.2, this worked fine, though we did run into #17798, and therefore had to add some polling to wait for the files being written to disk. This polling has a built-in (per-target) timeout of 3 minutes, which we ~never see hit (post-processing generally completes within seconds).

When attempting to upgrade to Bazel 7.0.0, we are seeing this timeout regularly getting hit, despite remote cache hits and an otherwise quiescent system. That is, we are seeing target completion events in the BEP stream, but for a small number of those targets, the relevant files never appear on disk (at least within the 3-minute timeout; bazel build finishes within ~40s, so the machine is not particularly overloaded). There does not seem to be any commonality between the files that are not present between different bazel build runs; a given target might not be downloaded in one run and be downloaded in another. This behavior occurs with both fully-cached and mostly-cached builds.

The same behavior occurs when building with //X:....

We do occasionally see successful builds with Bazel 7, but they are rare (~1%, maybe less). It is possible, of course, that raising the timeout would work, but given that we ~never hit the timeout with Bazel 6 and ~always hit the timeout with Bazel 7, increasing the timeout would represent a significant performance hit.

We have checked the BEP streams between Bazel 6 and Bazel 7 (fully cached build for both) and there are only trivial differences. We took GRPC logs from our remote cache (fully cached build for both) and again there are only trivial differences.

We are using --remote_download_toplevel with both versions. This issue happens with remote caching enabled and remote execution disabled, which is our CI configuration (we know...); we have a separate configuration with remote caching and remote execution enabled that does not appear to exhibit the problem.

We do not see this behavior if we manually expand //X:all on the command line, i.e. listing out all the relevant targets that //X:all would have built.

cc @sushain97 @ngroman @xieyuxi-stripe @coeuvre

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

We haven't been able to come up with a simple example that reproduces this outside of our codebase.

Which operating system are you running Bazel on?

x86-64 Linux

What is the output of bazel info release?

release 7.0.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Yes, this is a regression. We have not tried to isolate the relevant commit(s).

Have you found anything relevant by searching the web?

I have looked through the last several months of Github issues, but do not see anything relevant.

bazel-discuss does not appear to have anything related.

Any other information, logs, or outputs that you want to share?

No response

@coeuvre
Copy link
Member

coeuvre commented Jan 4, 2024

I can't reproduce this issue with simple genrule example neither. Maybe this requires some complexities in the rules (e.g. macro, alias, etc.) to trigger. cc @tjgq

@tjgq
Copy link
Contributor

tjgq commented Jan 4, 2024

This is a bit of a shot in the dark, but does setting --noexperimental_merged_skyframe_analysis_execution make the issue go away?

@froydnj
Copy link
Author

froydnj commented Jan 4, 2024

This is a bit of a shot in the dark, but does setting --noexperimental_merged_skyframe_analysis_execution make the issue go away?

That's a pretty good shot in the dark! It does indeed make the issue go away (multiple successful builds with Bazel 7, with no failures).

@meteorcloudy meteorcloudy added the soft-release-blocker Soft release blockers that are nice to have, but shouldn't block the release if it's the last one. label Jan 8, 2024
@meteorcloudy meteorcloudy added this to the 7.0.1 release blockers milestone Jan 8, 2024
@meteorcloudy meteorcloudy removed the soft-release-blocker Soft release blockers that are nice to have, but shouldn't block the release if it's the last one. label Jan 8, 2024
@meteorcloudy
Copy link
Member

@bazel-io fork 7.0.1

@meteorcloudy
Copy link
Member

I added this issue as a soft release blocker for 7.0.1, since the fix is still unclear and there is a workaround. We can maybe have a 7.0.2 if this doesn't get fix in time.

@iancha1992
Copy link
Member

@bazel-io fork 7.1.0

@coeuvre
Copy link
Member

coeuvre commented Jan 11, 2024

@froydnj We looked into this issue today but still cannot reproduce it. Can you try to work out a minimal repro, or provide a full set of flags you are using?

On the other hand, were you building incrementally or cleanly?

@nya3jp
Copy link

nya3jp commented Apr 3, 2024

ChromeOS Bazel build might be seeing this issue. I recently added a postprocessor that consumes BEP JSON and archives relevant log files generated in bazel-out into a tarball, and observed that it sometimes fails to find output files mentioned in BEP.

In our case, skymeld is enabled, remote caching is enabled, we always start build from a clean output directory.

@lberki
Copy link
Contributor

lberki commented Apr 4, 2024

@nya3jp does this always happen for *_host_deps.profile.json and *_host_deps.json? If so, is there something special about how these files are generated? Maybe from an aspect, a top-level aspect, or some other commonality?

@ismell
Copy link

ismell commented Apr 4, 2024

The *_host_deps.profile.json and *_host_deps.json are standard output files from the build_packages action. What is special is we are using our collect_logs aspect and attaching it to the top level targets via the --aspects flag.

@lberki
Copy link
Contributor

lberki commented Apr 4, 2024

That might actually be relevant! The code path to fetch the outputs of aspects is slightly different depending on the value of --experimental_merged_skyframe_analysis_execution.

@joeleba mind taking a look at this to see if anything stands out? In particular, I'm worried that ConcurrentPathTrie might have a race condition somewhere.

Alternatively, I could imagine that if the analysis of the aspect that promotes an artifact to a top-level artifact finishes after the artifact is built, the artifact will not get downloaded.

@lberki
Copy link
Contributor

lberki commented Apr 5, 2024

The second alternative it is. Here is a reproduction. As expected, it requires --experimental_merged_skyframe_analysis_execution and --remote_download_outputs=toplevel .

cat > BUILD <<'EOF'
filegroup(name = "top", srcs = [":actual", "//slow"])
genrule(name = "proxy",  srcs = [":actual"], outs = ["proxy_file"], cmd = "cp $< $@")
genrule(name = "actual", srcs = [], outs = ["actual_file"], cmd = "echo ACTUAL > $@")
EOF

mkdir -p slow
mkfifo slow/BUILD

bazel build :proxy --disk_cache=/tmp/dc  # Prime the disk cache
bazel clean  # Clean the output tree so that the disk cache is hit again and actual_file is deleted
bazel build :top :proxy --disk_cache=/tmp/dc &
# Wait until :proxy builds; most principled is reading the BEP
# (I just used manual waiting for a few seconds, which is too brittle for an actual test)
echo "filegroup(name = 'slow', visibility = ['//visibility:public'])" > slow/BUILD  # Unblock analyzing :top
wait %1  # Wait for the second Bazel invocation

# Behold the absence of bazel-bin/actual_file

@joeleba
Copy link
Member

joeleba commented Apr 5, 2024

From tracing the code with the debugger, it seems like your theory is correct, Lukacs. actual_file was first generated via :proxy and then only recognized later as a top level output by :top.

Each time a top level target is analyzed, we add its output artifacts to a Trie of pathsToDownload. This Trie is later on used to decide whether to download an output artifact or not, upon the completion of an action.

Naive question: is it feasible to just attempt to download every artifact in this Trie at the end of the build? That would take away the importance of when an artifact is registered as "top-level".

@lberki
Copy link
Contributor

lberki commented Apr 5, 2024

That'll be for @coeuvre to decide -- that's definitely the simplest solution although in incurs a performance hit, since the download would start potentially much later than the moment Bazel realizes that a file needs to be downloaded.

@tjgq
Copy link
Contributor

tjgq commented Apr 5, 2024

I think the idea here is that we still download action outputs during action execution if we can, but we need a final pass at the end for anything that was only marked for download after the respective action finished.

We could also consider storing top-level outputs that came from an aspect separately, so we only have to iterate those at the end, instead of the entire pathsToDownload (I'm not sure if that matters much; we can probably start without it).

@coeuvre
Copy link
Member

coeuvre commented Apr 5, 2024

Naive question: is it feasible to just attempt to download every artifact in this Trie at the end of the build? That would take away the importance of when an artifact is registered as "top-level".

We tried a similar idea before and encountered #17798.

@coeuvre
Copy link
Member

coeuvre commented Apr 5, 2024

Another idea is: attempt downloading toplevel outputs immediately when we receive analyzed event from skymeld.

No, this could still trigger #17798.

@Wyverald
Copy link
Member

Wyverald commented May 6, 2024

Are we still targeting this for 7.2.0? (first RC next Monday, the 13th)

@coeuvre
Copy link
Member

coeuvre commented May 7, 2024

Yes, I am working on the fix now.

coeuvre added a commit to coeuvre/bazel that referenced this issue May 10, 2024
…ent.

Fixes bazelbuild#20737.

PiperOrigin-RevId: 632430888
Change-Id: I1f2e1034c73162e925e25eb57b31d6a32b30fad2
github-merge-queue bot pushed a commit that referenced this issue May 10, 2024
…etion ev… (#22318)

…ent.

Fixes #20737.

PiperOrigin-RevId: 632430888
Change-Id: I1f2e1034c73162e925e25eb57b31d6a32b30fad2

Closes #20816.
Kila2 pushed a commit to Kila2/bazel that referenced this issue May 13, 2024
…ent.

Fixes bazelbuild#20737.

PiperOrigin-RevId: 632430888
Change-Id: I1f2e1034c73162e925e25eb57b31d6a32b30fad2
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.2.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.0rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged
Projects
None yet
Development

No branches or pull requests