Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to tokio 0.2, and to stdlib futures for fs and task_executor #9071

Merged
merged 7 commits into from
Mar 23, 2020

Conversation

stuhood
Copy link
Member

@stuhood stuhood commented Feb 5, 2020

Problem

We're on an older version of tokio, which doesn't smoothly support usage of async/await.

Solution

Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the fs and task_executor crates to stdlib futures.

Finally, transitively fixup for the new APIs: in particular, since both task_executor and tokio now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the tokio::main and tokio::test macros, which annotate async methods and spawn a runtime to allow for awaiting futures inline.

Result

Progress toward more usage of async/await!

@stuhood stuhood force-pushed the stuhood/fs-async-await branch 2 times, most recently from d86f592 to 5c53e80 Compare February 5, 2020 06:47
Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting!!


scheduler.core.executor.spawn_on_io_pool(work_future).wait()
})
.wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing from blocking to not? Is it because the blocking-ness is pushed down onto the actual-blocking parts? IIRC last time I tried to do that, there was a massive perf regression, which is why this is how it was... 🤞

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This callsite was redundant: wait() on a 0.1 future is itself a blocking method that uses the current thread as the executor for the future. It's gone in stdlib futures in favor of explicit calls to ${MyRuntime}.block_on(..).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooor I haven't run tests that have covered this yet, and it will complain about one of the nested spawn calls not occurring in the context of a Runtime! We shall see, heh. The patch currently runs ./pants list successfully, but some of the rust tests haven't been ported, and there are likely to be a few "no Runtime" or "too many Runtimes" type errors at various callsites.

}))
self
.executor
.spawn_blocking(move || vfs.scandir_sync(&dir_relative_to_root))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be interesting to see if this regresses perf in the way that when I tried to use the blocking stuff last time it did... 🤞

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spawn_blocking is a different API, afaict. It's basically: "please literally dedicate me a thread". So it should be a 1:1 replacement for the CpuPool we used to have here.

The tokio-fs blocking APIs are definitely still a thing, but having removed one moving piece is nice.

format!("Failed to read file {:?}: {}", path_abs, e),
)
let executor = self.executor.clone();
Box::pin(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if it's how it is, it's how it is, but I'm kind of surprised at the amount of boilerplate to run a closure asynchronously - Box::pin(async move { spawn_blocking( move || { seems like more boilerplate than I'd expect :(

Copy link
Member Author

@stuhood stuhood Feb 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the boilerplate required to "box" an async block (note: not a closure). The .boxed() combinator was reintroduced, so I think that this could maybe also be written:

let f = async move { .. };
f.boxed()

...but chaining a combinator off a block requires a local variable... etc.

See the note in ASYNC.md (and all of the callsites in the patch that reference it: this one included): the explicit use of BoxFuture is a stopgap to improve the futures 0.1/stdlib futures boundary (because an async fn can actually take references without cloning them, but that requires them to stay alive on the caller's stack (not the case with 0.1 futures): once callers have been ported, we should be able to remove almost all explicit usage of BoxFuture.

@@ -788,25 +808,35 @@ impl PosixFS {
}
}

pub fn stat_sync(&self, relative_path: PathBuf) -> Result<Stat, io::Error> {
pub fn stat_sync(&self, relative_path: PathBuf) -> Result<Option<Stat>, io::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a surprising API change - what about asyncifying things means this is easier to couple than do separately?

Copy link
Member Author

@stuhood stuhood Feb 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very thorough review! Heh.

I think that before I committed to async-await-ifying the calling methods, this simplified the combinators. Could undo it. But it also avoided leaking implementation details (avoiding having callers interpret errors where possible is good, imo).

src/rust/engine/task_executor/src/lib.rs Outdated Show resolved Hide resolved
@stuhood stuhood marked this pull request as ready for review February 11, 2020 07:01
Copy link
Contributor

@pierrechevalier83 pierrechevalier83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love seeing these changes moving us ever closer to a desirable future 😄
I'd like an answer to my comment in collect_from before approving in case it's unintentional.

}

///
/// Recursively expands PathGlobs into PathStats while applying excludes.
///
fn expand(&self, path_globs: PathGlobs) -> BoxFuture<Vec<PathStat>, E> {
GlobMatchingImplementation::expand(self, path_globs)
/// TODO: See the note on references in ASYNC.md.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a summary to let the reader decide whether it's worth checking out ASYNC.md or not.
Something like this:

TODO: Turn into an async fn once all call sites are ready. See note on references in ASYNC.md for more details.

Also applies to other places in the commit.


self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is a bit hard to read, but it's nice to see that async/await is already used to make the code incrementally a tiny bit more readable than prior to the change (breaking massive chains into multiple statements with.await? and dropping some levels of nesting). It's a taste of more refactorings that will be possible with this. 😄 Same comment for a few diffs in the rest of the commit.

) -> futures::future::BoxFuture<Result<ChildResults, String>> {
let mut stdout = BytesMut::with_capacity(8192);
let mut stderr = BytesMut::with_capacity(8192);
let mut exit_code = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you changing the default exit code from 0 to 1 here? Is it on purpose? If so, I would probably make it its own commit to explain the intent in the commit message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this changes the default... good eye. While converting this, I noticed that if for whatever reason we exited this loop without ever receiving an exit code from the process, we would consider that to be a success. That felt error prone, but not worth a separate PR.

Copy link
Contributor

@pierrechevalier83 pierrechevalier83 Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not worth a separate PR

In general, when doing a large refactoring like this, I would bias for doing unrelated cleanups at least in their own commits (but maybe in their own PRs) so they're more visible to reviewers. If this change causes any unforeseen regression, this is more likely to cause it than the rest of the PR.
Anyway, I'm happy with your reasoning that it's better to fail by default 😄

@stuhood

This comment has been minimized.

Copy link
Contributor

@pierrechevalier83 pierrechevalier83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me 😍

stuhood added a commit that referenced this pull request Feb 13, 2020
### Problem

It has been a long time since we've updated some of our dependencies! Additionally, during the course of #9071, I encountered the need to run `cargo update`, but didn't want to land that update at the same time as the change to a new runtime.

### Solution

Run `cargo update`, and fix one resulting library incompatibility.

### Result

Modern(-ish) rust deps: `cargo update` will only update within the ranges specified in `Cargo.toml`.
@stuhood stuhood merged commit 20e9db1 into pantsbuild:master Mar 23, 2020
@stuhood stuhood deleted the stuhood/fs-async-await branch March 23, 2020 02:25
@stuhood stuhood mentioned this pull request Apr 4, 2020
stuhood added a commit that referenced this pull request Apr 5, 2020
### Problem

#9395 occurs within our `grpcio` dependency, which is quite stale. Although #9395 is more likely to be related to changes in our executor (#9071) or to our transitive rust dependencies (#9122), getting on a more recent version of the `grpcio` crate _might_ resolve the issue, or make it easier to report an issue.

### Solution

Bump to `0.5.1` with one patch (tikv/grpc-rs#457) pulled forward from our previous fork. 

[ci skip-jvm-tests]  # No JVM changes made.
hrfuller pushed a commit to twitter/pants that referenced this pull request May 11, 2020
…ntsbuild#9071)

We're on an older version of tokio, which doesn't smoothly support usage of async/await.

Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures.

Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline.

Progress toward more usage of async/await!
hrfuller pushed a commit that referenced this pull request May 15, 2020
* Port to tokio 0.2, and to stdlib futures for fs and task_executor (#9071)

We're on an older version of tokio, which doesn't smoothly support usage of async/await.

Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures.

Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline.

Progress toward more usage of async/await!

* Add notify fs watcher to engine. (#9318)

* Use the notify crate to implement an `InvalidationWatcher` for Graph operations.

* Make watch async, and watcher log to pantsd.log.
Relativize paths returned from notify to the build_root.
Refactor invalidate method to be an associated method on the
InvalidationWatcher.

* Respond to feedback.
* Use spawn on io pool instead of custom future impl
* Write python fs tests
* Relativize paths to invalidate to build root
* invalidate nodes with parent paths.
* Comments

* Add rust tests.
Make some things public so we can use them in tests.
Use canonical path to build root for relativizing changed paths.

* Refactor Python tests.
Return watch errors as core::Failure all the way to user.
Move task executor onto invalidation watcher.
Move test_support trait impl into test_support mod.

* use futures lock on watcher

* Platform specific watching behavior. On Darwin recursively watch the
build root at startup. On Linux watch individual directory roots.

Co-authored-by: Stu Hood <stuhood@gmail.com>

* Ignore notify events for pants_ignore patterns. (#9406)

* Create a git ignorer on the context object. Adjust all call sites which
create a posix fs to pass in an ignorer.

* Ignore fsevents from files that match pants_ignore patterns.

* Always pass is_dir = false to ignorer to avoid stat-ing every path the
event watch thread sees.

* Add a feature gate to disable the engine fs watcher introduced in #9318 (#9416)

* Add a feature gate to disable the engine fs watcher introduced in #9318
by default, to mitigate issues seen in #9415 until a fix is in place.

* Don't rerun uncachable nodes if they are dirtied while running. (#9452)

* Don't rerun uncachable nodes if they are dirtied while running.
- Retry dependencies of uncacheable nodes a few times to get a result
  until we are exhausted from trying too many times.
- Bubble uncacheable node retry errors up to the user, tell them things
  were chaning to much.
- Don't propagate dirtiness past uncacheable nodes when invalidating
  from changed roots. Otherwise dirty dependents of uncacheable nodes
  will need to re-run.

* enable the engine fs watcher by default, now that it won't cause issues.
Remove execution option override from tests.

* use reference to self in stop_walk_predicate closure

* invalidate often enough that test doesn't flake

* Add a flag to prevent the FsEventService and watchman from starting (#9487)

* add --watchman-enable flag
* disable watchman when flag is false
* Don't wait for the initial watchman event if we aren't using watchman.
* check invalidation watcher liveness from scheduler service

* Extract a `watch` crate. (#9635)

The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test.

Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests.

[ci skip-jvm-tests]

* Simplify Scheduler::execute and unify Graph retry (#9674)

Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated.

Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to:
```
let maybe_display_handle = Self::maybe_display_initialize(&session);
let result = loop {
  if let Ok(res) = receiver.recv_timeout(refresh_interval) {
    break Ok(res);
  } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) {
    break Err(e);
  }
};
Self::maybe_display_teardown(session, maybe_display_handle);
result
```

A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop.

* Move file invalidation handling to rust (#9636)

A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`:
1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed
2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity
3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph`
4. `--loop` - implemented directly in the `SchedulerService`

Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them.

The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`:
* `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls.
* `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing).

These features are then used by:
* `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed.
* invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files.

The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`.

Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`.

In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`.

* pin tokio at exactly 0.2.13

* fix lint issues

* fix mypy typing issues

* Move away from the debounced notify watcher #9754

* Remove test that has raced file invalidation ever since the notify backend was added, but which will now fairly consistently lose that race.
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]

* As explained in the comment: we can no longer create duplicate parallel BUILD files and hope that pants does not notice them before we scan the directory again!
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]

Co-authored-by: Stu Hood <stuhood@gmail.com>
Co-authored-by: Stu Hood <stuhood@twitter.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants