Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow runfiles to be easily/efficiently passed as action inputs #15486

Open
rickeylev opened this issue May 12, 2022 · 18 comments
Open

Allow runfiles to be easily/efficiently passed as action inputs #15486

rickeylev opened this issue May 12, 2022 · 18 comments
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts type: feature request

Comments

@rickeylev
Copy link
Contributor

Description of the feature request:

If you have an action that needs to take a collection of runfiles as input, then it's very hard, error prone, and inefficient to correctly do this.

To correctly do this, the inputs passed to the action must include:

  • runfiles.files.
  • The Files within runfiles.symlinks. Getting these requires flattening the depset during analysis.
  • The Files within runfiles.root_symlinks. Also requires flattening the depset during analysis.
  • The symlink path strings of runfiles.symlinks. These either must be passed as args to the action, or a second file must be created to store their values and then that file must be passed to the action.
  • The symlink path strings of runfiles.root_symlinks. Same problem as above.
  • The path strings of runfiles.empty_filenames. Accessing this attribute triggers execution of the underlying RunfilesSupplier, which is normally delayed until execution time. Also same problem as with the symlink paths.

The above is possible using pure-starlark, but is non-ideal because it has to flatten depsets (extract all the files and pass them as inputs; put the symlink paths and empty_filenames paths into Args objects).

Normally the answer to "i need runfiles passed to an action" is to use tools and pass FilesToRunProvider. However, that isn't always possible because:

  1. FilesToRun can't be constructed by Starlark
  2. Getting a reference to FilesToRun requires a target, i.e. a macro has to wire together multiple targets
  3. (2) isn't always possible. Some examples: (a) if you want to properly implement --stamp behavior, embed build information, and trigger re-generation correctly, a target must pass all its runfiles et al to the action that generates that information; (b) for things like "build a zip file of myself" or some other packaging-esq type of thing.
  4. FilesToRun is more expensive: it means "materialize this set of paths when the action is run", when the minimum necessary is "make sure any changes to the runfiles re-trigger the action execution" and/or "I just need the underlying inputs available and their symlink path-mapping" (e.g. a manifest)

From looking around, I think Java-native rules basically deal with this problem by using SourceManifestAction or middleman files[1]. The both work about the same: create a file and write the runfiles as a manifest, then have that file be an input to another action. All that work gets deferred until execution, time, though, and all analysis phase pays is passing the runfiles and registering an action.

So, a couple half-baked ideas:

  1. Add a runfiles_inputs arg to ctx.actions.run. This would accept a list of runfiles objects. Like inputs, changes to them would re-trigger action execution. Whether they are materialized, and how they are, I don't know.
  2. Expose some SourceManifestAction-equivalent, i.e. a way to correctly and efficiently transform a runfiles object into a file that can be passed to inputs and and Args objects.

(2) Seems better overall as it's more flexible. If you have a file, you can then pass it to inputs or add it to args.

[1] Small aside: middleman files don't seem to play well with Starlark implemented rules. From what I could tell, all Starlark rules call some RunfilesSupport helper logic during their Java-phase, which registers various actions and outputs. So interacting with runfiles support in the middle of the Starlark phase of a rule tends to create an action/output conflict with what happens in the Java phase.

What underlying problem are you trying to solve with this feature?

Making it easy to correctly and efficiently pass runfiles information to actions.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

google build

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

google build

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

n/a

Have you found anything relevant by searching the web?

#15164 -- "i want to see the runfiles passed to an action"

Any other information, logs, or outputs that you want to share?

No response

@brandjon brandjon added team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts and removed team-Build-Language labels Nov 5, 2022
@comius comius added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) and removed untriaged labels Sep 14, 2023
@comius
Copy link
Contributor

comius commented Sep 14, 2023

Either P2 or P4. It's a pain point for existing rules, that used this natively. I'm split between making things easier, but also maintaining a new thing that was so far not possible in Starlark and rules implemented from scratch in Starlark didn't need that.

@dgoldstein0
Copy link

I am a little confused on the expected semantics, and also not entirely clear on the possibilities with FilesToRunProvider though #7187 makes it sound like the latter isn't really an option for starlark-written rules anyway. Anyhow I have a use case to throw into the mix here: rules for javascript bundlers (think webpack, rollup, esbuild, rspack, turbopack, etc.). JS bundlers are typically a lot more like linkers from other languages - take in lots of files, which could be compiled js (e.g. from typescript, jsx, etc.) or source files, and transform into either a single output or perhaps a directory (when code splitting is used to output multiple files); however unlike e.g. c++ linking, the filesystem layout of the inputs matters a lot. One of the big challenges of this use case today is that JS bundlers typically want to believe that all inputs - generated or not - are part of the same source tree - or else relative imports in them tend to not resolve correctly. Runfiles seem to be the right abstraction for this - they let us hide the difference between the source and generated files - making the File.short_path visible to the bundler rather than the full File.path.

The problem: the rules I've written for js bundlers currently are constructing a binary of the bundler plus all the inputs, and then executing that. which is inefficient - every input change results in rebuilding this binary before doing the actual build - and also makes bazel think that all of our input code is actually part of the build tool (so bzl query --notool_deps gives the wrong answers). If there were some way to pass runfiles objects as inputs to the action, I suspect these rules would be more efficient? But at the very least, it'd let bazel query understand these are inputs and not tool dependencies. And it's probably a blocker for having a useful implementation of persistent workers for these tools. An alternative is copying all the source input files, but copying is noticeably slower than constructing symlinks.

@rickeylev
Copy link
Contributor Author

This topic came up again and I came up with a more concrete API proposal:

runfiles_input = ctx.runfiles_to_action_input(runfiles)
ctx.actions.run(inputs = [runfiles_input], ...)

The runfiles_to_action_input method converts the runfiles to some object that represents all the underlying File objects of the passed-in runfiles. Perhaps a special proxy object, or a custom File implementation based upon e.g. Middleman. Whatever it is, it efficiently ensures the underlying File objects within runfiles are made available as inputs to the action.

I think this is about the most minimal API necessary.

To do something useful, e.g. put all runfiles into a zip, you have to create a manifest file manually (there are efficient user space ways to do this using map_each).

A couple other ideas that just occurred to me:

manifest, runfile_input = ctx.blabla(runfiles) -- generate a manifest for you

Or, model it as a TreeArtifact. Something like:

runfiles_root = ctx.declare_directory("foo_runfiles_root")
ctx.runfiles_to_directory(runfiles, output=runfiles_root)
ctx.actions.run(inputs=[runfiles_root], arguments=[runfiles_root.path])

@dgoldstein0
Copy link

I like the tree artifact idea, as that answers the "how do we translate the runfiles to an equivalent directory structure on disk". Though I wonder how performant that'll be, it'll definitely solve the --notool_deps problem with the "build and then execute a binary" strategy that makes the inputs look like part of the build tools.

That said it's also plausible that we have two separate use cases here - possibly not all actions that want to consume runfiles need the runfiles organization of the files materialized in the filesystem? So we may want two different apis, rather than one.

For the case of not materializing the filesystem directly, one option two new methods:

  1. A new function runfiles.get_all_files() that would return all File objects inside the runfiles regardless of which of the fields of runfiles (files, symlinks, root_symlinks, or empty_files) each file comes from. (I suppose this isn't strictly necessary, but it's a pretty substantial simplification)
  2. Args.add_runfiles() could give more flexibility in how runfiles are passed.

E.g. something like

def _runfile_to_arg(runfilePath, file):
  return "--path={}:{}".format(runfilePath, file.path)
...
args = ctx.actions.args()
args.add_runfiles(runfiles, format_each=_runfile_to_arg)
ctx.actions.run(inputs=[runfiles.get_all_files()], args=[args])

I think this most clearly competes to your manifest idea. Here you can put the manifest information into your Args object, whereas there, the manifest would presumably have a predetermined format and just be another file fed to the action.

@rickeylev
Copy link
Contributor Author

Yeah, given that runfiles are, essentially, representing a directory, using directory artifacts for them feels right.

re: runfiles.get_all_files: Yeah, this would simplify. The trouble is that, under the hood, symlinks/root_symlinks are depsets of SymlinkEntry objects (basically a (path, File) pair). So to get just the File objects the depsets have to be flattened. Or flattening has to be deferred, somehow (which puts us back to "special file to represent runfiles")

re: runfiles.add_runfiles: Oh interesting idea. You can already use (add_all + map_each) to convert a runfiles object to e.g. --path=... args. But perhaps a separate function would be more efficient?

@rickeylev
Copy link
Contributor Author

Thought of an alternative to/variation of runfiles.get_all_files. So under the hood, the problem is there's a depset of SymlinkEntry objects, which aren't File objects, but are wrappers around File objects.

So just have an adapter that unwraps them. e.g. runfiles.get_all_files() returns:

class SymlinkArtifactNestedSet implements NestedSet<Artifact> {
   constructor(NestedSet<SymlinkEntry> symlinks) { this.symlinks = symlinks }
   Artifact getNext() {
     symlinks.getNext().getArtifact();
   }
}

I don't know the exact NestedSet API, but I hope the gist is clear.

@dgoldstein0
Copy link

Huh, those SymlinkEntry don't seem to be exposed to starlark yet, at least not that I could find. Are there plans for that?

github-merge-queue bot pushed a commit to bazelbuild/rules_rust that referenced this issue Oct 3, 2024
…2891)

After #2887 I started
finding similar linker errors in environments that supported runfiles
and symlinks but have yet to see this issue in environments without
them. I believe my understanding of how the
[DefaultInfo.default_runfiles](https://bazel.build/rules/lib/providers/DefaultInfo#default_runfiles)
objects worked and in the previous change had attempted to rely on it
for a runfiles directory. However, there is no guarantee a directory
will be rendered and no way to force it to be rendered for actions
either (bazelbuild/bazel#15486). The
consequence is that creating a custom runfiles directory is no longer a
contingency and explicitly creating it is the correct thing to do to
ensure `CARGO_MANIFEST_DIR` linker inputs are always available to the
consuming actions.

With the propagation of `CARGO_MANIFEST_DIR` (now an action output) into
`Rustc` actions, a new flag is also introduced to prune unnecessary
files out of the directory to reduce bloat and potential invalidation in
downstream actions,
`--@rules_rust//cargo/settings:cargo_manifest_dir_retain_list`. This
flag accepts a list of path suffixes used to determine what (if
anything) in `CARGO_MANIFEST_DIR` should be retained and passed to
downstream actions.

Note that the failure mode this addresses is hard to observe as it
requires a crate which defines a linker path to something in
`CARGO_MANIFEST_DIR` and then for a clean build to be done with a change
to a dependent of said crate. If the runfiles directory for the build
script is never rendered then there will be no file to link into the
crate.
github-merge-queue bot pushed a commit to bazelbuild/rules_rust that referenced this issue Oct 3, 2024
…2891)

After #2887 I started
finding similar linker errors in environments that supported runfiles
and symlinks but have yet to see this issue in environments without
them. I believe my understanding of how the
[DefaultInfo.default_runfiles](https://bazel.build/rules/lib/providers/DefaultInfo#default_runfiles)
objects worked and in the previous change had attempted to rely on it
for a runfiles directory. However, there is no guarantee a directory
will be rendered and no way to force it to be rendered for actions
either (bazelbuild/bazel#15486). The
consequence is that creating a custom runfiles directory is no longer a
contingency and explicitly creating it is the correct thing to do to
ensure `CARGO_MANIFEST_DIR` linker inputs are always available to the
consuming actions.

With the propagation of `CARGO_MANIFEST_DIR` (now an action output) into
`Rustc` actions, a new flag is also introduced to prune unnecessary
files out of the directory to reduce bloat and potential invalidation in
downstream actions,
`--@rules_rust//cargo/settings:cargo_manifest_dir_filename_suffixes_to_retain`.
This flag accepts a list of path suffixes used to determine what (if
anything) in `CARGO_MANIFEST_DIR` should be retained and passed to
downstream actions.

Note that the failure mode this addresses is hard to observe as it
requires a crate which defines a linker path to something in
`CARGO_MANIFEST_DIR` and then for a clean build to be done with a change
to a dependent of said crate. If the runfiles directory for the build
script is never rendered then there will be no file to link into the
crate.

---------

Co-authored-by: Daniel Wagner-Hall <dawagner@gmail.com>
github-merge-queue bot pushed a commit to bazelbuild/rules_rust that referenced this issue Oct 3, 2024
…2891)

After #2887 I started
finding similar linker errors in environments that supported runfiles
and symlinks but have yet to see this issue in environments without
them. I believe my understanding of how the
[DefaultInfo.default_runfiles](https://bazel.build/rules/lib/providers/DefaultInfo#default_runfiles)
objects worked and in the previous change had attempted to rely on it
for a runfiles directory. However, there is no guarantee a directory
will be rendered and no way to force it to be rendered for actions
either (bazelbuild/bazel#15486). The
consequence is that creating a custom runfiles directory is no longer a
contingency and explicitly creating it is the correct thing to do to
ensure `CARGO_MANIFEST_DIR` linker inputs are always available to the
consuming actions.

With the propagation of `CARGO_MANIFEST_DIR` (now an action output) into
`Rustc` actions, a new flag is also introduced to prune unnecessary
files out of the directory to reduce bloat and potential invalidation in
downstream actions,
`--@rules_rust//cargo/settings:cargo_manifest_dir_filename_suffixes_to_retain`.
This flag accepts a list of path suffixes used to determine what (if
anything) in `CARGO_MANIFEST_DIR` should be retained and passed to
downstream actions.

Note that the failure mode this addresses is hard to observe as it
requires a crate which defines a linker path to something in
`CARGO_MANIFEST_DIR` and then for a clean build to be done with a change
to a dependent of said crate. If the runfiles directory for the build
script is never rendered then there will be no file to link into the
crate.

---------

Co-authored-by: Daniel Wagner-Hall <dawagner@gmail.com>
@rickeylev
Copy link
Contributor Author

cc @lberki A function to convert a runfiles object to a file that can then be passed as inputs to an action would be very useful for this issue.

@lberki
Copy link
Contributor

lberki commented Oct 12, 2024

With 78971a9, creating a Starlark function that takes a Runfiles object and returns a File that is a runfiles tree containing those runfiles is very possible. In fact, I have a draft change that does that in support of some intricate Google-specific Python functionality and it works well.

@comius @c-mita WDYT about making that an official part of the API? (I'm asking Charles because I think he'd have some use for this in deleting some Google-specific warts and in Starlarkifying Google-internal sh_* rules)

I'm thinking of something like

ctx.actions.create_runfiles_tree(path, sibling, runfiles)

or splitting it up into the usual declare_* / create the action parts that we do for other kinds of files (it's superfluous here because there isn't any other way to create a runfiles tree other than from a Runfiles object but then it'd be symmetric)

The complication is that these artifacts would be somewhat special, for example, building them wouldn't cause them to show up in the local file system, only on the inputs of the actions that depend on them (long story). That's eventually fixable, but it'd definitely be a wart.

@c-mita
Copy link
Member

c-mita commented Oct 22, 2024

I'd like it much more if it was just like a tree artifact. :D

But I can imagine a declare_runfiles_tree and create_runfiles_tree pair, with the hope that one day the former becomes synonymous with declare_directory.

I'd prefer experimenting a bit with it behind the "internal API allowlisting" we've done for other things before making it "official" though; but I do have a use case or two in mind.

@lberki
Copy link
Contributor

lberki commented Oct 22, 2024

It can't be a tree artifact because runfiles trees are materially different: tree artifacts have all the files within them and runfiles trees are only symlinks to other files that need to be "dragged along" into the inputs of the action. It's not inconceivable that they'll one day be unified, but I'm afraid I can't commit to that.

re: API, fine, you get a declare_* / create_* pair, provided that you pinkie promise that you'll use it them. Deal?

@comius
Copy link
Contributor

comius commented Oct 22, 2024

Make it an allowlisted api. I think that runfiles need and overhaul, however I don’t have yet a good overview of the area.

There were some nice suggestions @kmoffett shared with me over a chat. I’m not sure if they are materialized anywhere, but the essence of the idea is to split DefaultInfo into FilesInfo and RunfilesInfo. Separation should cause less confusion for the users. And it could allow more natural handling of runfiles within a rule.

@lberki
Copy link
Contributor

lberki commented Oct 22, 2024

I won't take sides on what one should do with DefaultInfo (it's mostly orthogonal to whether the pertinent actions should be createable from Starlark), but what I'm hearing from both @c-mita and @comius is that an API to create abitrary symlink trees from Starlark would be useful.

/me rolls up his sleeves...

@c-mita
Copy link
Member

c-mita commented Oct 22, 2024

re: API, fine, you get a declare_* / create_* pair, provided that you pinkie promise that you'll use it them. Deal?

I missed that the comment about redundancy. Would the declare really be a no-op?

an API to create abitrary symlink trees from Starlark would be useful

Insert comment that rhymes with "let me kill SpecialArtifact.FILESET"

@lberki
Copy link
Contributor

lberki commented Oct 22, 2024

Do I understand correctly that arbitrary symlink trees are enough to kill SpecialArtifact.FILESET i.e. you don't need the tree artifact / runfiles tree hybrid functionality that you mentioned above?

@c-mita
Copy link
Member

c-mita commented Oct 28, 2024

Do I understand correctly that arbitrary symlink trees are enough to kill SpecialArtifact.FILESET i.e. you don't need the tree artifact / runfiles tree hybrid functionality that you mentioned above?

How SpecialArtifact.FILESET affects things versus SpecialArtifact.TREE and SpecialArtifact.RUNFILES is a question I cannot answer. But I do feel that "FILESET" should not be a special artifact type in Bazel.

For my goals, I need a way of producing a symlink tree as an "artifact" that can be transparently returned by a rule or passed to other actions. In my mind this looks as much like a tree artifact as possible (notably, "runfiles" do not look like a directory, but "fileset" artifacts do https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/actions/Artifact.java;l=682).

Maybe this is just a naming issue and it should be SpecialArtifact.SYMLINK_TREE? But then why is runfiles different?

@lberki
Copy link
Contributor

lberki commented Oct 28, 2024

My best educated guess is that SpecialArtifactType.FILESET and SpecialArtifactType.RUNFILES are mostly different now due to their history: there were implementations of two different legacy features inherited from the predecessor of Blaze (the maiden name of Bazel).

The difference used to be meaningful when Filesets were much more complicated than today, but with all your cleanups, I think they are semantically pretty close to each other now, except that SpecialArtifactType.RUNFILES are still special-cased in our source code (look at the call sites for ActionAnalysisMetadata.getActionType() to see where), but that's mostly for historical reasons and I'm busy eliminating that distinction.

Let's do this: I drive the cleanup to the end that removes all the special-casing of SpecialArtifactType.RUNFILES. It should take like two more changes, plus some waiting to see if the first one I'm submitting today causes any trouble. Once that is done, you can implement the Starlark interface to turn a Runfiles object into an artifact that is the materialized version of that runfiles tree. Then that can in turn be hopefully used to implement Filesets.

Deal?

@lberki
Copy link
Contributor

lberki commented Oct 28, 2024

Ideally, one would not make the distinction between tree artifacts and symlink trees either and instead be able to handle both symlinks and files in TreeArtifacts, but it's probably not the best decision to spend our energies on that right now because the unification with runfiles trees is enough for Filesets.

(I also have already done enough cleanups recently to not have the appetite for much more, but that's my personal issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts type: feature request
Projects
None yet
Development

No branches or pull requests

7 participants