Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch arbitrary file in repo rules #21230

Closed
wants to merge 1 commit into from
Closed

Watch arbitrary file in repo rules #21230

wants to merge 1 commit into from

Conversation

Wyverald
Copy link
Member

@Wyverald Wyverald commented Feb 7, 2024

  • Starlark API changes
    • New method rctx.watch() that watches a path. As of right now, said path must exist and be a regular file, but that will change in follow-ups.
    • Existing method rctx.read() gets a new optional boolean parameter watch that defaults to True. This causes the file read to be watched as well by default, even if it's not addressed by a label.
    • Both changes apply to mctx as well.
  • Marker file format changes
    • FILE: marker file entries can now be FILE: followed by either an absolute path (starts with a singular '/'), or a label-like ... thing (which is a label but has a slash in place of a colon). This is because we might be watching something inside a repo that is not under a package.
    • Same change will happen to the accumulatedFileDigests sections of the module lockfile.
  • Code changes
    • RepoRecordedInput.File subclasses are renamed to FileOutsideWorkspace (absolute paths) and FileInsideWorkspace (label-like things) instead to better reflect their meaning.
    • Unified the file watching/digest checking logic in SingleExtensionEvalFunction with the RepoRecordedInput stuff. Follow-ups will unify more.
    • Moved RepoRecordedInput to its own BUILD target as it's being used outside the repo fetching machinery (namely, module extensions)
    • Since FileOutsideWorkspace has to be an absolute path (never relative), we don't need a "base directory" for the RepoRecordedInput parser. What we really need is just the FileSystem object, which we can get from the PathPackageLocator available in Skyframe. Refactored code to reflect that.

Work towards #20952.

@Wyverald Wyverald force-pushed the wyv-watch-file branch 4 times, most recently from b4272a0 to 5f88d98 Compare February 8, 2024 00:18
@Wyverald Wyverald marked this pull request as ready for review February 8, 2024 00:28
@github-actions github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels Feb 8, 2024
@Wyverald Wyverald changed the title [DRAFT] Watch arbitrary file in repo rules Watch arbitrary file in repo rules Feb 8, 2024
@Wyverald Wyverald requested review from fmeum and removed request for ted-xie and ahumesky February 8, 2024 00:28
@lberki
Copy link
Contributor

lberki commented Feb 8, 2024

cc @ismell

It's a long change so before poring over it, I have some high-level questions:

  • Have you run some benchmarks? Before committing to this approach, it would be nice to know how well it scales. IIRC the target is 4K directories and 10K files in a repository, which may or may not need parallelization. AFAIU this approach pretty much excludes parallel readdir() and stat() calls, so before we commit to this, I'd rather have some benchmark numbers to make sure it's viable. If not, I guess adding rctx.watch_recursive() is a viable approach?
  • Why not make FILE: a distinguished union, i.e. instead of looking at whether it's a label, a path, or a "label-like" thing, split it into LABEL:, ABSOLUTE_FILE, etc. That seems to make the code simpler.
  • IIRC when creating a FileValue for symlnks that lead to outside of the repositories known to Bazel, we use a RootedPath with / as root (see FileFunction.toRootedPath()). It seems like this approach would simplify FileOutsideWorkspace and be consistent with FileFunction.

@Wyverald
Copy link
Member Author

Wyverald commented Feb 8, 2024

Have you run some benchmarks?

No I haven't -- and I'm not sure what to benchmark against, what corpus to use, etc. If you have any pointers, please let me know!

Before committing to this approach, it would be nice to know how well it scales. IIRC the target is 4K directories and 10K files in a repository, which may or may not need parallelization. AFAIU this approach pretty much excludes parallel readdir() and stat() calls, so before we commit to this, I'd rather have some benchmark numbers to make sure it's viable. If not, I guess adding rctx.watch_recursive() is a viable approach?

Yeah this is just the first in a series of PRs. I plan to add watching to path.readdir(), and add an explicit rctx.watch_dir(recursive: bool) as well.

Why not make FILE: a distinguished union, i.e. instead of looking at whether it's a label, a path, or a "label-like" thing, split it into LABEL:, ABSOLUTE_FILE, etc. That seems to make the code simpler.

Hmm. It might do, it might not. The reason is not really important but for eg. the module extension part we'd need multiple dicts instead of just the one.

IIRC when creating a FileValue for symlnks that lead to outside of the repositories known to Bazel, we use a RootedPath with / as root (see FileFunction.toRootedPath()). It seems like this approach would simplify FileOutsideWorkspace and be consistent with FileFunction.

FileOutsideWorkspace already does use a RootedPath with the filesystem as the root. Or did I misunderstand you?

@fmeum
Copy link
Collaborator

fmeum commented Feb 8, 2024

An interesting benchmark that doubles as a pretty useful feature could be to add watching of built-in include directories to the auto-configured C++ toolchain (to remove the need to use --configure).

Copy link
Contributor

@lberki lberki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain seems to be on leave but I tried my best.

@@ -444,6 +445,20 @@ public Location read(JsonReader jsonReader) throws IOException {
}
}

private static final TypeAdapter<RepoRecordedInput.File> REPO_RECORDED_INPUT_FILE_TYPE_ADAPTER =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same vague dissatisfaction with this as with time stamps in lockfiles: this encodes non-hermetic information in lockfiles, which are supposed to be checked in. Although in this case, one could say that that could be avoid by carefully not depending on absolute paths...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have two mechanisms for non-hermetic operations, repository rules and module extensions, we should think about where we want to draw the boundary between them. If module extensions are meant to be optimized for "reproducibility in the face of non-hermeticity", i.e., locking down the result of potentially non-hermetic operations in a way that is portable across machines via the lockfile, then we may want to purposefully limit their capabilities.

How about adding the watch capability to repo rules only for now? We can always gather some data from user feedback before we extend it to module extensions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good points, and I'm not totally sure what's the best way forward here. It's awkward to add watch only to repo rules because rctx.read(watch=True) is in StarlarkBaseExternalContext which is shared by both contexts. We could alternatively just filter out any "absolute file" watches when we write the lockfile. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very knowledgeable about the philosophical difference between module extensions and repositories, but at least as far as the implementation goes, it'd be pretty easy to add an "access to absolute paths is allowed" flag to StarlarkBaseExternalContext, which would be set to "true" for repositories and "false" for module extensions.

Between filtering out absolute file watches and erroring out when they are attempted, I'd rather we do the latter because if we ever decide that this is a useful thing, it's much more easy to turn an error into a non-error than to subtly change the behavior of code that Bazel already accepts.

Copy link
Member Author

@Wyverald Wyverald Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the situation is a bit more nuanced -- I'm adding a boolean parameter watch to rctx.read which defaults to True. That is, any file read by rctx.read will be watched, be it a label-addressed file or an absolute path file.

Now we can't really just make it an error to attempt to watch absolute path files in module extensions, because it is legal to read an absolute path file in module extensions today. And yes, we could make watch default to True for repo rules and False for module extensions, but I'm not sure that's better (as we would by default not watch even files inside the workspace).

Copy link
Member Author

@Wyverald Wyverald Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my newest idea: make rctx.read(watch) a tristate: true - always watch (same as today's rctx.read(...); rctx.watch(...);); false - never watch; auto - watch if appropriate ("appropriate" = it's a file inside the workspace OR we're in a repo rule context). Defaults to 'auto'. WDYT? (EDIT: implemented this, see code for concept.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about watch() defaulting to false everywhere, at least for the time being? That would mean that there is no behavior change and it would also resolve the module extension / repository conundrum.

re: your tristate approach, I'm neutral; I think it's a bit of DWIM, but only a bit so I could live with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to enable watch by default because the only change in behavior is that we refetch a bit more often, which is really an improvement in correctness. It also reduces surprises -- people have been asking why label-addressed files are watched but others aren't.

@ismell
Copy link

ismell commented Feb 9, 2024

Have you run some benchmarks?

No I haven't -- and I'm not sure what to benchmark against, what corpus to use, etc. If you have any pointers, please let me know!

I'm hoping to use the watch_dir(recursive=True) API to eliminate the portage_digest repository rule. I envision the portage repository rule invoking the generate-repo command and returning a list of watch directories that bazel should watch. If any files are either modified, added or deleted in one of those paths the repository rule will need to re-run.

So if you wanted a test, you can checkout the chromiumos repository. I would recommend cloning the internal manifest. You can then try "watching" the following directories:

  • src/private-overlays
  • src/overlays
  • src/third_party/{portage-stable,toolchains-overlay,chromiumos-overlay}

That would be the worst case. I'm hoping we can return the specific directories ("overlays") we looked at when generating the repository so we can narrow the scope.

We should probably ignore the .git repository. So maybe we need an exclude parameter to watch_dir?

@Wyverald Wyverald force-pushed the wyv-watch-file branch 3 times, most recently from 2f16de2 to 3cc0dcc Compare February 13, 2024 03:45
FileValue fileValue =
(FileValue) env.getValueOrThrow(FileValue.key(pair.second), IOException.class);
if (fileValue == null) {
throw new NeedsSkyframeRestartException();
Copy link
Contributor

@lberki lberki Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if you watch N files previously unseen, it's O(N^2) time and re-doing the rest of the work the repository does N times? You could get around this by just recording the watch request, then requesting all FileValues at the end in one single getValues() call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only O(N^2) if we actually do restart skyframe; we're now using loom by default so it's actually linear. Given Loom will be the way forward, I don't want to spend more engineering time for the no-Loom case (i.e. the one-single-getValues()-call-at-the-end thing); in fact I want to remove more workarounds for repo restarts (e.g. enforceLabelAttributes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to say that there is still one little corner case where we can't ignore JVMs that don't support virtual threads, but it turns out that that use case went away a few hours ago, so I'm fine with this compromise.

cc @zhengwei143

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turns out that that use case went away a few hours ago

just out of curiosity, what is that use case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builds of the Android kernel. I got word overnight that they finished their upgrade to JVM 21.

@lberki
Copy link
Contributor

lberki commented Feb 13, 2024

The only hard blocker I see is the O(n^2) behavior I mentioned above. I'd prefer watching to default to false so as not to rock the boat in 7.x, but I'd accept a decision to the contrary.

@Wyverald Wyverald force-pushed the wyv-watch-file branch 3 times, most recently from d0bbb74 to b6a4f7c Compare February 13, 2024 22:42
* Starlark API changes
  * New method `rctx.watch()` that watches a path. As of right now, said path must exist and be a regular file, but that will change in follow-ups.
  * Existing method `rctx.read()` gets a new optional string parameter `watch` that defaults to `auto`. This causes the file read to be watched as well by default, even if it's not addressed by a label. See `rctx.read()` docs for the semantics of `watch`.
  * Both changes apply to `mctx` as well, except that no files outside the current Bazel workspace may be watched for `mctx`.
* Marker file format changes
  * `FILE:` marker file entries can now be `FILE:` followed by either an absolute path (starts with a singular '/'), or a label-like ... thing (which is a label but has a slash in place of a colon). This is because we might be watching something inside a repo that is _not_ under a package.
  * Same change will happen to the `accumulatedFileDigests` sections of the module lockfile.
* Code changes
  * `RepoRecordedInput.File` subclasses are renamed to `FileOutsideWorkspace` (absolute paths) and `FileInsideWorkspace` (label-like things) instead to better reflect their meaning.
  * Unified the file watching/digest checking logic in SingleExtensionEvalFunction with the RepoRecordedInput stuff. Follow-ups will unify more.
  * Moved RepoRecordedInput to its own BUILD target as it's being used outside the repo fetching machinery (namely, module extensions)
  * Since `FileOutsideWorkspace` has to be an absolute path (never relative), we don't need a "base directory" for the RepoRecordedInput parser. What we really need is just the FileSystem object, which we can get from the PathPackageLocator available in Skyframe. Refactored code to reflect that.

RELNOTES: Added a new method, `repository_ctx.watch()`, which asks Bazel to watch for changes to an arbitrary file. When said file changes, Bazel will refetch the repo.

Work towards #20952.
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Feb 14, 2024
@Wyverald Wyverald deleted the wyv-watch-file branch February 16, 2024 03:48
Wyverald added a commit that referenced this pull request Feb 20, 2024
* Starlark API changes
  * New method `rctx.watch()` that watches a path. As of right now, said path must exist and be a regular file, but that will change in follow-ups.
  * Existing method `rctx.read()` gets a new optional string parameter `watch` that defaults to `auto`. This causes the file read to be watched as well by default, even if it's not addressed by a label. See `rctx.read()` docs for the semantics of `watch`.
  * Both changes apply to `mctx` as well, except that no files outside the current Bazel workspace may be watched for `mctx`.
* Marker file format changes
  * `FILE:` marker file entries can now be `FILE:` followed by either an absolute path (starts with a singular '/'), or a label-like ... thing (which is a label but has a slash in place of a colon). This is because we might be watching something inside a repo that is _not_ under a package.
  * Same change will happen to the `accumulatedFileDigests` sections of the module lockfile.
* Code changes
  * `RepoRecordedInput.File` subclasses are renamed to `FileOutsideWorkspace` (absolute paths) and `FileInsideWorkspace` (label-like things) instead to better reflect their meaning.
  * Unified the file watching/digest checking logic in SingleExtensionEvalFunction with the RepoRecordedInput stuff. Follow-ups will unify more.
  * Moved RepoRecordedInput to its own BUILD target as it's being used outside the repo fetching machinery (namely, module extensions)
  * Since `FileOutsideWorkspace` has to be an absolute path (never relative), we don't need a "base directory" for the RepoRecordedInput parser. What we really need is just the FileSystem object, which we can get from the PathPackageLocator available in Skyframe. Refactored code to reflect that.

Work towards #20952.

Closes #21230.

RELNOTES: Added a new method, `repository_ctx.watch()`, which asks Bazel to watch for changes to an arbitrary file. When said file changes, Bazel will refetch the repo.
PiperOrigin-RevId: 607097422
Change-Id: Ib96a49461deddd7f4c786bd1c227de18b2dd71d7
Wyverald added a commit that referenced this pull request Feb 20, 2024
* Starlark API changes
  * New method `rctx.watch()` that watches a path. As of right now, said path must exist and be a regular file, but that will change in follow-ups.
  * Existing method `rctx.read()` gets a new optional string parameter `watch` that defaults to `auto`. This causes the file read to be watched as well by default, even if it's not addressed by a label. See `rctx.read()` docs for the semantics of `watch`.
  * Both changes apply to `mctx` as well, except that no files outside the current Bazel workspace may be watched for `mctx`.
* Marker file format changes
  * `FILE:` marker file entries can now be `FILE:` followed by either an absolute path (starts with a singular '/'), or a label-like ... thing (which is a label but has a slash in place of a colon). This is because we might be watching something inside a repo that is _not_ under a package.
  * Same change will happen to the `accumulatedFileDigests` sections of the module lockfile.
* Code changes
  * `RepoRecordedInput.File` subclasses are renamed to `FileOutsideWorkspace` (absolute paths) and `FileInsideWorkspace` (label-like things) instead to better reflect their meaning.
  * Unified the file watching/digest checking logic in SingleExtensionEvalFunction with the RepoRecordedInput stuff. Follow-ups will unify more.
  * Moved RepoRecordedInput to its own BUILD target as it's being used outside the repo fetching machinery (namely, module extensions)
  * Since `FileOutsideWorkspace` has to be an absolute path (never relative), we don't need a "base directory" for the RepoRecordedInput parser. What we really need is just the FileSystem object, which we can get from the PathPackageLocator available in Skyframe. Refactored code to reflect that.

Work towards #20952.

Closes #21230.

RELNOTES: Added a new method, `repository_ctx.watch()`, which asks Bazel to watch for changes to an arbitrary file. When said file changes, Bazel will refetch the repo.
PiperOrigin-RevId: 607097422
Change-Id: Ib96a49461deddd7f4c786bd1c227de18b2dd71d7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants