Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel does not re-evaluate remote actions, even after shutdown #17366

Closed
ozio85 opened this issue Jan 31, 2023 · 14 comments
Closed

Bazel does not re-evaluate remote actions, even after shutdown #17366

ozio85 opened this issue Jan 31, 2023 · 14 comments
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@ozio85
Copy link

ozio85 commented Jan 31, 2023

Description of the bug:

Currently Bazel does not re-evaluate remote actions towards the remote execution service, even after a shutdown is performed.

This causes errors when:

  1. The remote cache rotates
  2. The remote cache is completely wiped.

This leads to a:
(Exit 34): /home/path/to/file.h (No such file or directory)
(BulkTransferException)

And before this could be reparied by rerunning (since this error causes bazel to shutdown). But the behavior now is even after restart, Bazel still remembers the state, and the BulkTransferException remains until a bazel clean is performed

I am using:
--remote_download_toplevel

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

bazel build //something --experimental_remote_grpc_log=first.run.log

bazel shutdown

bazel build //something --experimental_remote_grpc_log=second.run.log

I expect bazel to reevaluate the actions towards the remote server, to see if any needs a re-run.

Which operating system are you running Bazel on?

linux

What is the output of bazel info release?

7.0.0-pre.20221111.3

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

This is the same issue as #13882, however in 7.0.0, it is enabled by default
How is this ever going to work?

Any other information, logs, or outputs that you want to share?

No response

@ozio85
Copy link
Author

ozio85 commented Jan 31, 2023

It seems to be caused by the change of the flag --experimental_action_cache_store_output_metadata, to be switched to default True in remote_download_toplevel and remote_download_minimal. However, i do not see how this ever is going to work? Is the cache never rotated at Google?

@moroten
Copy link
Contributor

moroten commented Jan 31, 2023

I think the issue is #16720. Setting --experimental_action_cache_store_output_metadata=false together with --remote_download_minimal makes it work again.

That PR was merged as commit 25558ada3fb377cfc2c965d3a93250ca28ce0fc1 which is part of 7.0.0-pre.20221111.3. Running bazel shutdown && bazel test --remote_download_toplevel --remote_executor=... @abseil-hello//:hello_test multiple times reports the following:

Bazel 7.0.0-pre.20221102.3
INFO: 45 processes: 44 remote cache hit, 1 internal.

Bazel 7.0.0-pre.20221111.3
INFO: 1 process: 1 internal.

Bazel 7.0.0-pre.20221111.3 with --experimental_action_cache_store_output_metadata=false
INFO: 45 processes: 44 remote cache hit, 1 internal.

I think --experimental_action_cache_store_output_metadata needs to be kept to false unless the lease service is in place.

@brentleyjones
Copy link
Contributor

@coeuvre

@moroten
Copy link
Contributor

moroten commented Jan 31, 2023

By the way, this problem is also part of Bazel 6.0.0.

@coeuvre
Copy link
Member

coeuvre commented Jan 31, 2023

This is WAI. The purpose of flag --experimental_action_cache_store_output_metadata is to avoid re-evaluate the actions if Bazel doesn't need to (because of local action cache).

@coeuvre
Copy link
Member

coeuvre commented Jan 31, 2023

And the support for remote cache eviction is WIP in #16660.

@moroten
Copy link
Contributor

moroten commented Jan 31, 2023

Was it intended to bring this in as part of 6.0.0 when #16660 is not there? I basically makes --remote_download_minimal and --remote_download_toplevel unusable.

@ozio85
Copy link
Author

ozio85 commented Jan 31, 2023

So how do I solve the problem at hand:

  1. An item is dropped from the remote cache;
  2. My local state still does not re-evalaute the action
  3. Bazel fails with BulkTransferException
  4. I try to shutdown reboot, but Bazel keeps on failing.

I cannot repair this in any way, except with a bazel clean

Is this really intended?

@coeuvre
Copy link
Member

coeuvre commented Jan 31, 2023

Without --experimental_action_cache_store_output_metadata, if remote cache evicted blobs, the build fails as well. To continue the build, a manually bazel shutdown is required.

With --experimental_action_cache_store_output_metadata, the downside is a bazel clean is required to continue the build if remote cache evicted. However, the benefit is to save lots of remote cache checks for incremental build (44 remote cache hit, 1 internal vs 1 internal in your example).

Without --experimental_action_cache_store_output_metadata, the cost for execution phrase for each incremental build is almost identical to build after bazel clean.

I don't see why bazel clean is not suitable to replace bazel shutdown as a workaround.

@ozio85
Copy link
Author

ozio85 commented Jan 31, 2023

Mm I like your intention (since the action phase is the most time consuming right now), but I feel this is not the complete solution.

Before I could restart the build when error 34 was detected, and it only occurred if the Bazel server was left running, so local users were seldom affected by it.

Now it is just a matter of time before a bazel clean is required.

Do you recommend that all server jobs detect the errorcode 34, and in that case perform a bazel clean ?

@coeuvre
Copy link
Member

coeuvre commented Jan 31, 2023

If remote cache eviction happens too frequently and you don't want to do a bazel clean, disabling --experimental_action_cache_store_output_metadata until lease service implemented is probably the best option for now. But for most cases, I believe --experimental_action_cache_store_output_metadata should bring more values.

@ozio85
Copy link
Author

ozio85 commented Jan 31, 2023

@torgil what do you think ?

@sgowroji sgowroji added type: bug team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Jan 31, 2023
@ozio85
Copy link
Author

ozio85 commented Jan 31, 2023

I also think this somewhat contradicts this statement:
Bazel's design is such that these problems are fixable and these bugs are a high priority to be fixed. If you ever find an incorrect incremental build, file a bug report, and report bugs in the tools rather than using clean.

@torgil
Copy link
Contributor

torgil commented Feb 1, 2023

@torgil what do you think ?

I think this needs a bazel binary test-case. I'm trying to do that on relevant issues/patches just to check what capabilities a random bazel binary has. Do we have a good test-environment for bazel remote-execution / remote-cache?

coeuvre added a commit to coeuvre/bazel that referenced this issue Feb 20, 2023
Currently, when building without the bytes, if Bazel failed to download blobs from CAS when fetching them as inputs to local actions, Bazel fails the build with message like `... --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.` and this message keep showing up until a manually `bazel clean`.

This PR fixes that by cleaning up stale state in skyframe and action cache upon remote cache eviction so that a following build can continue without `bazel shutdown` or `bazel clean`.

Fixes bazelbuild#17366.
Part of bazelbuild#16660.

Closes bazelbuild#17462.

PiperOrigin-RevId: 510952745
Change-Id: I4fc59a21195565c68375a19ead76738d2208c4ac
keertk pushed a commit that referenced this issue Feb 20, 2023
* Cleanup stale state when remote cache evicted

Currently, when building without the bytes, if Bazel failed to download blobs from CAS when fetching them as inputs to local actions, Bazel fails the build with message like `... --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.` and this message keep showing up until a manually `bazel clean`.

This PR fixes that by cleaning up stale state in skyframe and action cache upon remote cache eviction so that a following build can continue without `bazel shutdown` or `bazel clean`.

Fixes #17366.
Part of #16660.

Closes #17462.

PiperOrigin-RevId: 510952745
Change-Id: I4fc59a21195565c68375a19ead76738d2208c4ac

* Add integration tests for incremental build for Build without the Bytes.

PiperOrigin-RevId: 487193345
Change-Id: Id10c9aebad630928a9a33fa22824e4d78041a4d7
copybara-service bot pushed a commit that referenced this issue Feb 23, 2023
Related to #17366.

PiperOrigin-RevId: 511849585
Change-Id: I1c4acd469280628b15441a8e7a97ff572c42c1ff
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Feb 27, 2023
Related to bazelbuild#17366.

PiperOrigin-RevId: 511849585
Change-Id: I1c4acd469280628b15441a8e7a97ff572c42c1ff
ShreeM01 pushed a commit that referenced this issue Feb 27, 2023
Related to #17366.

PiperOrigin-RevId: 511849585
Change-Id: I1c4acd469280628b15441a8e7a97ff572c42c1ff

Co-authored-by: Googler <tjgq@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants