-
-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pick notify watcher series #9741
Pick notify watcher series #9741
Conversation
…ntsbuild#9071) We're on an older version of tokio, which doesn't smoothly support usage of async/await. Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures. Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline. Progress toward more usage of async/await!
* Use the notify crate to implement an `InvalidationWatcher` for Graph operations. * Make watch async, and watcher log to pantsd.log. Relativize paths returned from notify to the build_root. Refactor invalidate method to be an associated method on the InvalidationWatcher. * Respond to feedback. * Use spawn on io pool instead of custom future impl * Write python fs tests * Relativize paths to invalidate to build root * invalidate nodes with parent paths. * Comments * Add rust tests. Make some things public so we can use them in tests. Use canonical path to build root for relativizing changed paths. * Refactor Python tests. Return watch errors as core::Failure all the way to user. Move task executor onto invalidation watcher. Move test_support trait impl into test_support mod. * use futures lock on watcher * Platform specific watching behavior. On Darwin recursively watch the build root at startup. On Linux watch individual directory roots. Co-authored-by: Stu Hood <stuhood@gmail.com>
* Create a git ignorer on the context object. Adjust all call sites which create a posix fs to pass in an ignorer. * Ignore fsevents from files that match pants_ignore patterns. * Always pass is_dir = false to ignorer to avoid stat-ing every path the event watch thread sees.
…tsbuild#9318 (pantsbuild#9416) * Add a feature gate to disable the engine fs watcher introduced in pantsbuild#9318 by default, to mitigate issues seen in pantsbuild#9415 until a fix is in place.
…sbuild#9452) * Don't rerun uncachable nodes if they are dirtied while running. - Retry dependencies of uncacheable nodes a few times to get a result until we are exhausted from trying too many times. - Bubble uncacheable node retry errors up to the user, tell them things were chaning to much. - Don't propagate dirtiness past uncacheable nodes when invalidating from changed roots. Otherwise dirty dependents of uncacheable nodes will need to re-run. * enable the engine fs watcher by default, now that it won't cause issues. Remove execution option override from tests. * use reference to self in stop_walk_predicate closure * invalidate often enough that test doesn't flake
…antsbuild#9487) * add --watchman-enable flag * disable watchman when flag is false * Don't wait for the initial watchman event if we aren't using watchman. * check invalidation watcher liveness from scheduler service
The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test. Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests. [ci skip-jvm-tests]
Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated. Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to: ``` let maybe_display_handle = Self::maybe_display_initialize(&session); let result = loop { if let Ok(res) = receiver.recv_timeout(refresh_interval) { break Ok(res); } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) { break Err(e); } }; Self::maybe_display_teardown(session, maybe_display_handle); result ``` A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop.
A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`: 1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed 2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity 3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph` 4. `--loop` - implemented directly in the `SchedulerService` Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them. The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`: * `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls. * `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing). These features are then used by: * `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed. * invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files. The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`. Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`. In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`.
Review methodology: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the code that is changed in the Daemon and the new watch
crate. The daemon code looks good, I'm fairly confident in it, but another pair of eyes on the watch
crate would be great.
Thanks for picking this, good work!
@@ -413,6 +414,46 @@ def _write_named_sockets(self, socket_map): | |||
for socket_name, socket_info in socket_map.items(): | |||
self.write_named_socket(socket_name, socket_info) | |||
|
|||
def _initialize_pid(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm so so happy for this change, this will bring so much stability :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think these changes are really great.
…ckend was added, but which will now fairly consistently lose that race. # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests]
…er/pants into hfuller/pick-notify-watcher-series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP. Sharing comments I have so far
@@ -29,10 +29,12 @@ use hashing; | |||
|
|||
use petgraph; | |||
|
|||
mod entry; | |||
// make the entry module public for testing purposes. We use it to contruct mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: contruct -> construct
@@ -29,10 +29,12 @@ use hashing; | |||
|
|||
use petgraph; | |||
|
|||
mod entry; | |||
// make the entry module public for testing purposes. We use it to contruct mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general guideline, I'm a bit opposed to changing an API for the purpose of testing. An API should be able to expose what logically makes sense for a user and test code should emulate a user. That being said, there are practicalities which do need some implementation details leaking, and mocking is one of them; so I'm willing to accept it as a compromise if the tests enabled by this carry their weight. Will comment on the test site if I don't think they do. If I add no comment, consider this a general rambly discussion point and feel free to ignore it 😄
src/rust/engine/graph/src/lib.rs
Outdated
// to use the extra test functions, and they should only be imported into | ||
// test modules. | ||
pub mod test_support { | ||
use super::{EntryId, EntryState, Graph, Node}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I like this approach a much more 😄 : you force the user to declare they want this functionality for tests.
Do you think it would be possible to move the pub mod entry
to inside this module and maybe instead of going from pub (crate)
to pub
, go to pub (in test_support)
?
|
||
let http_client = reqwest::Client::new(); | ||
let rule_graph = RuleGraph::new(tasks.as_map(), root_subject_types); | ||
|
||
Ok(Core { | ||
graph: Graph::new(), | ||
graph: graph, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of being clippy, but can't it be omitted here?
src/rust/engine/src/watch.rs
Outdated
/// | ||
/// TODO: Need the above polling | ||
/// | ||
/// TODO: To simplify testing the InvalidationWatcher we could create a trait which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indirectly answers my question on exposing parts of the graph entry publicly. When we drop support for watchman, it should be possible to tighten the visibility of these functions and types too. It may be worth mentioning it in this comment so that when this TODO gets enacted, the implementer remember to tighten these APIs again.
src/rust/engine/src/watch.rs
Outdated
@@ -49,6 +50,7 @@ impl InvalidationWatcher { | |||
executor: Executor, | |||
build_root: PathBuf, | |||
ignorer: Arc<GitignoreStyleExcludes>, | |||
enabled: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels to me that creating a disabled InvalidationWatcher
is a weird pattern. Would it be possible to replace this logic with the caller holding an Option<InvalidationWatcher>
and avoid the need for any change to this class?
@@ -287,12 +291,18 @@ impl<N: Node> InnerGraph<N> { | |||
/// The Walk will iterate over all nodes that descend from the roots in the direction of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth updating the do comment to add: "until meeting a node that matches the stop_walking_predicate
"
type Item = EntryId; | ||
|
||
fn next(&mut self) -> Option<Self::Item> { | ||
while let Some(id) = self.deque.pop_front() { | ||
if !self.walked.insert(id) { | ||
// Visit this node and it neighbors if this node has not yet be visited and we aren't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: be -> been
// Visit this node and it neighbors if this node has not yet be visited and we aren't | ||
// stopping our walk at this node, based on if it satifies the stop_walking_predicate. | ||
// This mechanism gives us a way to selectively dirty parts of the graph respecting node boundaries | ||
// like uncacheable nodes, which sholdn't be dirtied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: sholdn't -> shouldn't
…el BUILD files and hope that pants does not notice them before we scan the directory again! # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests]
Ok @wisechengyi This is as green as it will get! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left comments as I reviewed the code, but no blocker. Happy with this change. It's quite the heavy lifting that you and Stu did there! Thanks both for your work on this 😄
if generation == last_seen_generation && result.poll_should_wait(context) { | ||
// The Node is currently clean with the observed generation: add a poller on the | ||
// Completed node that will be notified when it is dirtied or dropped. If the Node moves | ||
// to another state, the received will be notified that the sender was dropped, and it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: received -> receiver
|
||
// But polling with the previous token should wait, since nothing has changed. | ||
let request = graph.poll(TNode::new(2), Some(token2), None, &context); | ||
match timeout(Duration::from_millis(1000), request).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right in interpreting this as this test will actually wait a full second before convincing itself that the polling timeout? Does it need to be that high? The drawback is that this test won't be responsive. What poses a lower bound? I'm guessing the polling interval of 100 ms does pose one and we want to allow a few polling intervals. Would it even be worth making that configurable and using a low value in the context of tests?
|
||
// Polling with the previous token (in the same session) should wait, since nothing has changed. | ||
let request = graph.poll(TNode::new(2), Some(token1), None, &context); | ||
match timeout(Duration::from_millis(1000), request).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on full second wait. These add up pretty fast.
@@ -23,7 +23,7 @@ use tokio::task_local; | |||
use ui::EngineDisplay; | |||
use uuid::Uuid; | |||
|
|||
const TIME_FORMAT_STR: &str = "%H:%M:%S"; | |||
const TIME_FORMAT_STR: &str = "%H:%M:%S:%3f"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment to explain what this format means. I don't know by heart and it may help other readers too.
Problem
To release notify watcher changes we need to pick the notify watcher series from upstream.
Solution
Cherry pick the notify watcher change series.
Result
Running
./pants --v2-ui --v2 --no-v1 --experimental-fs-watcher --no-watchman-enable --enable-pantsd fmt2 src/python/pants/::
works.