Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action conflicts due to differing affected by starlark transition value #14239

Closed
fmeum opened this issue Nov 6, 2021 · 14 comments
Closed

Action conflicts due to differing affected by starlark transition value #14239

fmeum opened this issue Nov 6, 2021 · 14 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug

Comments

@fmeum
Copy link
Collaborator

fmeum commented Nov 6, 2021

Description of the problem / feature request:

Since 711c44e, transitions that lead to configs with differing affected by starlark transition value but otherwise identical configs lead to action conflicts.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  1. Check out https://github.com/bazelbuild/rules_go/tree/v0.29.0.
  2. Run bazel build //... with Bazel at 2255ce4, the parent of the breaking commit.
  3. Observe that the build passes.
  4. Run bazel build //... with Bazel at 711c44e.
  5. Observe that there are action conflicts such as:
ERROR: file 'external/com_google_protobuf/_objs/protoc_lib/java_primitive_field.o' is generated by these conflicting actions:
Label: @com_google_protobuf//:protoc_lib
RuleClass: cc_library rule
Configuration: 00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c, 4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372
Mnemonic: CppCompile
Action key: 5d046af47458a54c0933bf1911899f17c727e1eec656b0b9728c7f3fbc5b0a22
Progress message: Compiling src/google/protobuf/compiler/java/java_primitive_field.cc
PrimaryInput: File:[/home/fhenneke/.cache/bazel/_bazel_fhenneke/da2ee2360ba14ac857bdb47dfbba9b7d/external/com_google_protobuf[source]]src/google/protobuf/compiler/java/java_primitive_field.cc
PrimaryOutput: File:[[<execution_root>]bazel-out/k8-opt-exec-2B5CBBC6-ST-10e8eac0bed8/bin]external/com_google_protobuf/_objs/protoc_lib/java_primitive_field.o
Owner information: ConfiguredTargetKey{label=@com_google_protobuf//:protoc_lib, config=BuildConfigurationKey[00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c]}, ConfiguredTargetKey{label=@com_google_protobuf//:protoc_lib, config=BuildConfigurationKey[4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372]}
MandatoryInputs: are equal
Outputs: are equal

The config diff:

$ bazel config 00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c 4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372
INFO: Invocation ID: eb06f0c4-ddc7-4bc6-8cc6-e8fd2e7e0e4e
INFO: Displaying diff between configs 00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c and 4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372
Displaying diff between configs 00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c and 4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372
FragmentOptions com.google.devtools.build.lib.analysis.config.CoreOptions {
  affected by starlark transition: [//go/private:bootstrap_nogo], [//go/config:pure, //go/private:bootstrap_nogo]
}

What operating system are you running Bazel on?

Linux

What's the output of bazel info release?

development version

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

From 2255ce4 resp. 711c44e

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

https://github.com/bazelbuild/bazel
fad4933ea3c7cd757f32b0760ac268788b3b1d0a
711c44ea79db14406972cc2204a43c589a2debce

Have you found anything relevant by searching the web?

Related to the discussion at #14023.

Any other information, logs, or outputs that you want to share?

@sdtwigg @gregestren

@gregestren gregestren added team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug P2 We'll consider working on this in future. (Assignee optional) labels Nov 8, 2021
@gregestren
Copy link
Contributor

Huh. I hope this is something small. I was thinking 711c44e would be a pure improvement.

@gregestren gregestren added P1 I'll work on this now. (Assignee required) and removed P2 We'll consider working on this in future. (Assignee optional) labels Nov 8, 2021
@sdtwigg
Copy link
Contributor

sdtwigg commented Nov 8, 2021

This is strange, the ST-hashes should be different if affected-by-Starlark-transition is different

@sdtwigg
Copy link
Contributor

sdtwigg commented Nov 8, 2021

I am working on this but out of curiosity, does anyone who encountered this have a smallest possible test case? The other example I have been debugging with involve cc_proto_library computation, which is also somewhat large.

@sdtwigg
Copy link
Contributor

sdtwigg commented Nov 8, 2021

New theory: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L443 this null check is bad. (Short story: just because the Starlark value is at its default does not mean it should be excluded from ST hash addition; especially since the commandline may have explicitly set the Starlark value.)

I have a change that adjusts the hashing logic to do something more correct and this is seemingly fixing the internal test case I have. Will try to export something for you to test against your cases as well.

I do not have an ironclad explanation as to why this bug only surfaced now. My best guess is that the previous bug with transitionDirectoryNameFragment not being updated properly was actually masking this.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 8, 2021

Good catch! Removing the check makes all action conflicts go away.

I do think that the check is bad for a different reason though: Omitting a variable foo from the hash if it is null still leads to a hash value distinct from those for all other possible values of foo, so the command-line value of foo shouldn't matter here. But the hash also serves the purpose of encoding the set affectedByStarlarkTransition, which it fails to do with the check as keys might be missing, which exactly leads to the action conflicts we were seeing.

@sdtwigg
Copy link
Contributor

sdtwigg commented Nov 8, 2021

Yes, in practice when the value is null I am going to have it print key@null (just in case a user has a String StarlarkOption with non-null value "null" resulting in key=null).

@ulrfa
Copy link
Contributor

ulrfa commented Nov 9, 2021

The intention with the check for null was to:

  • Prevent affecting unrelated parts of the dependency graph.
  • Preserve the existing behavior where only starlark options in toOptions were included: toOptions.getStarlarkOptions().forEach((opt, value) -> toHash.put(opt.toString(), value));

Example:

First building a library, with dependency graph A-B-C, and populating remote caches:

  A
 / \
B   C

Then building an application using the library A-B-C, but also E-F with a local transition for a starlark option unrelated to A-B-C.

              D   
             / \
            E   A
Transition-/   / \
          F   B   C

Without the check for null, I suspect there would be no remote cache hits for A-B-C when building D, due to different ST hash. Right?

Does that make sense? I have only experience of transitions near the top of the build, not deep in the dependency graph.

@gregestren
Copy link
Contributor

@gregestren
Copy link
Contributor

Thanks for the explanation. I'd expect remote caching to continue to work in that case.

The logic is subtle. Look at:

for (String optionName : buildConfigOptions.affectedByStarlarkTransition) {

Object value = toOptions.getStarlarkOptions().get(Label.parseAbsoluteUnchecked(optionName));
if (value != null) {
toHash.put(optionName, value);
}

When evaluating A, buildConfigOptions.affectedByStarlarkTransition contains the cumulative list of Starlark and native flags that were changed by any transition from D down to A. This is documented at:

/**
* This internal option is a *set* of names of options that have been changed by starlark
* transitions at any point in the build at the time of accessing. It contains both native and
* starlark options in label form. e.g. "//command_line_option:cpu" for native options and
* "//myapp:foo" for starlark options. This is used to regenerate {@code
* transitionDirectoryNameFragment} after each starlark transition.
*/
@Option(
name = "affected by starlark transition",

The flag is updated at

updateAffectedByStarlarkTransition(toOptions.get(CoreOptions.class), convertedNewValues);

which only triggers on a transition. There are no transitions from D to A, so in the case of A this would be empty and we'd have no hash at all.

In the case of F it's not empty. So when evaluating F, the loop will trigger and read value from toOptions.getStarlarkOptions(). toOptions.getStarlarkOptions() contains the cumulative <name, value> pairs of Starlark options changed anywhere fro D to F. This is set at

toOptions =
BuildOptions.builder()
.merge(toOptions == null ? fromOptions.clone() : toOptions)
.addStarlarkOptions(changedStarlarkOptions)
.

Here's the part where I confused myself. :/ There are two reasons value could be null. One is that it's not in the toOptions.getStarlarkOptions() map. But shouldn't it always be in the map, given how both toOptions.getStarlarkOptions() and buildConfigOptions.affectedByStarlarkTransition() are computed as described above? The other possibility is that it is in the map but legitimately has a null value - I don't know how it's possible to set a Starlark option to a null value, so I'm fuzzy on that.

So that's a mystery to me.

But to your point, whatever's going on it should all be a pure function of the dependency path from the top-level to the target in question. So since the path from D to A doesn't have a transition, that should have no bearing on the path from D to F. They should each be computed completely independently of each other.

Does that all make sense? Anyone see holes in my thinking?

@gregestren
Copy link
Contributor

Here's the part where I confused myself. :/ There are two reasons value could be null. One is that it's not in the toOptions.getStarlarkOptions() map. But shouldn't it always be in the map, given how both toOptions.getStarlarkOptions() and buildConfigOptions.affectedByStarlarkTransition() are computed as described above?

So that's a mystery to me.

@sdtwigg helpfully pointed out that transitions include a post-validation step that removes flags from toOptions.getStarlarkOptions() if their new values are the same as their flag default values:

cleanedOptions.removeStarlarkOption(rule.getLabel());

As far as I can see no such removal happens for affectedByStarlarkTransition(). Which would indeed make it possible that the loop over buildConfigOptions.affectedByStarlarkTransition() includes flags that don't appear in toOptions.getStarlarkOptions().

@gregestren
Copy link
Contributor

One more comment:

If the configuration at the top-level (command line) has all flags at their defaults, then transition A sets //foo to <not its default>, then transition B sets //foo back to its default, this means the configuration after B is the same as the top-level.

In that case you'd want both configurations to not have an ST-<hash> at all. I can see that the null check prevents adding these settings to the hash. But even if the hash is empty (which the null check guarantees), you'd still get an ST-<hash> after B due to

That's an inefficiency. But that's an existing inefficiency. @sdtwigg 's ongoing work should remove that too.

fmeum pushed a commit to fmeum/bazel that referenced this issue Nov 9, 2021
…lculating ST-hash

Previously, the hash calculation was skipping including StarlarkOptions that happened to be at their default values.

This is wrong since those values may still be in "affected by Starlark transition" (because either the commandline set them and the Starlark transition reset them to their Starlark defaults thus still requiring a hash change OR the commandline did not set them but a series of Starlark transitions did an default->B->default anyways causing the Starlark option to still be 'stuck' in "affected by Starlark transition").

Resolves bazelbuild#14239

PiperOrigin-RevId: 408701552
@ulrfa
Copy link
Contributor

ulrfa commented Nov 10, 2021

Thanks for the explanation!

I currently don't have transitions setting options back to their defaults, but I'm looking forward to @sdtwigg's optimizations in that area.

When evaluating A, buildConfigOptions.affectedByStarlarkTransition contains the cumulative list of Starlark and native flags that were changed by any transition from D down to A.

I got the impression that buildConfigOptions.affectedByStarlarkTransition is global for the whole build and can contain option names from transitions in E-F also when processing unrelated transitions in A-B-C. And that the null check prevented those names from E-F to affect ST hash in A-B-C. That is my concern with removing the null check. But perhaps I misunderstood or I'm missing something?

There are no transitions from D to A, so in the case of A this would be empty and we'd have no hash at all.

True, but let’s adapt the example and imagine there are also other unrelated transitions in A-B-C so that they will have hashes.

@gregestren
Copy link
Contributor

buildConfigOptions.affectedByStarlarkTransition isn't global, so no worries there. The general principle is anything you set on node B propagates down B's subgraph, but has no effect on subgraphs B isn't part of.

@ulrfa
Copy link
Contributor

ulrfa commented Nov 11, 2021

I see, thanks @gregestren!

Wyverald pushed a commit that referenced this issue Nov 11, 2021
* Move transitionDirectoryNameFragment calculation to BuildConfigurationValue

As per discussion in b/203470434, transitionDirectoryNameFragment should completely depend on the current values of the (rest of) the BuildOptions class. Thus, it is far better to have this always computed from BuildOptions when building a BuildConfigurationValue  than rely on users keeping it consistent. (Other results of BuildConfigurationValue itself are themselves wholly computed from BuildOptions so placement there is a natural fit.)

This naturally fixes the exec transition forgetting to update transitionDirectoryNameFragment.

This fixes and subsumes #13464 and #13915, respectively. This is related to #14023

PiperOrigin-RevId: 407913175

* Properly account for StarlarkOptions at their default (=null) when calculating ST-hash

Previously, the hash calculation was skipping including StarlarkOptions that happened to be at their default values.

This is wrong since those values may still be in "affected by Starlark transition" (because either the commandline set them and the Starlark transition reset them to their Starlark defaults thus still requiring a hash change OR the commandline did not set them but a series of Starlark transitions did an default->B->default anyways causing the Starlark option to still be 'stuck' in "affected by Starlark transition").

Resolves #14239

PiperOrigin-RevId: 408701552

Co-authored-by: twigg <twigg@google.com>
fweikert pushed a commit that referenced this issue Nov 18, 2021
…lculating ST-hash

Previously, the hash calculation was skipping including StarlarkOptions that happened to be at their default values.

This is wrong since those values may still be in "affected by Starlark transition" (because either the commandline set them and the Starlark transition reset them to their Starlark defaults thus still requiring a hash change OR the commandline did not set them but a series of Starlark transitions did an default->B->default anyways causing the Starlark option to still be 'stuck' in "affected by Starlark transition").

Resolves #14239

PiperOrigin-RevId: 408701552
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug
Projects
None yet
Development

No branches or pull requests

4 participants