Add serverset #6921

illicitonion · 2018-12-13T16:05:29Z

This round-robins between a number of backends, skipping any which are
observed to be bad, with exponential backoff and ease-in.

This will be used in remote process execution for CAS operations.

stuhood

This is large: will review before end of day. One quick comment.

stuhood · 2018-12-13T16:51:03Z

src/rust/engine/serverset/src/lib.rs

+}
+
+#[derive(Clone, Copy, Debug)]
+pub struct BackoffConfig {


Could/should we use https://docs.rs/tokio-retry/0.2.0/tokio_retry/ here?

There's a comment about avoiding futures here because fork safety; when we're out of our fork world, I will gladly rewrite this to use tokio_retry :)

blorente

This looks good to me, I won't approve because I don't have much experience with multithreading and thus I can't be confident that this is correct. But it looks nice!

blorente · 2018-12-13T16:17:58Z

src/rust/engine/serverset/src/lib.rs

+          if let Some(ref mut unhealthy_info) = *unhealthy_info {
+            unhealthy_info.unhealthy_since = Instant::now();
+            // failure_backoff_ratio's numer and denom both fit in u8s, so hopefully this won't
+            // overflow of lose too much precision...


blorente · 2018-12-13T16:24:09Z

src/rust/engine/serverset/src/lib.rs

+            // overflow of lose too much precision...
+            unhealthy_info.next_attempt_after *= *self.inner.failure_backoff_ratio.numer();
+            unhealthy_info.next_attempt_after /= *self.inner.failure_backoff_ratio.denom();
+            unhealthy_info.next_attempt_after = std::cmp::min(


This series of assignments seems odd to me, why not calculate the correct value and then assign it once? As far as I can tell we won't have any write race conditions because we lock unhealty_info, we won't have read RCs because self.inner is atomic, and we won't have read-after-write hazards on self.inner.failure_backoff_ratio because it doesn't change except for initialization. And even if it did change, the correct behavior would presumably to make a thread-local copy first to read from.

blorente · 2018-12-13T16:32:08Z

src/rust/engine/serverset/src/lib.rs

+              // failure_backoff_ratio's numer and denom both fit in u8s, so hopefully this won't
+              // overflow of lose too much precision...
+              unhealthy_info.next_attempt_after *= *self.inner.failure_backoff_ratio.denom();
+              unhealthy_info.next_attempt_after /= *self.inner.failure_backoff_ratio.numer();


Same as above

blorente · 2018-12-13T16:33:28Z

src/rust/engine/serverset/src/lib.rs

+      break (i, server);
+    };
+
+    let serverset: Serverset<T> = (*self).clone();


How expensive is this clone? It looks like not very much, because it's essentially a copy of an Arc, but just checking.

Yeah, just an Arc, super cheap :)

blorente · 2018-12-13T16:38:05Z

src/rust/engine/serverset/src/lib.rs

+    assert_eq!(expect, saw);
+  }
+
+  fn mark(s: &Serverset<&'static str>, health: Health) {


Stringly typed servers, I like it! (I really do, it's a cute way to test)

blorente · 2018-12-13T16:43:01Z

src/rust/engine/serverset/src/lib.rs

+  /// last known status.
+  ///
+  /// If all resources are unhealthy, this function will block the calling thread until the backoff
+  /// period has completed. We'd probably prefer to use some Future-based scheduling, but that


Maybe add a TODO for the Future? (a little, unfunny pun intended)

blorente · 2018-12-13T16:46:28Z

src/rust/engine/serverset/src/lib.rs

+#[derive(Debug)]
+struct Backend<T> {
+  server: T,
+  unhealthy_info: Arc<Mutex<Option<UnhealthyInfo>>>,


Would the interface improve if we made this an Option<Mutex<Arc<>>>? Here is a solution for the Scoping problem.

The Option is specifically the thing we may set/unset under the lock, so it needs to be this way around :)

blorente · 2018-12-13T16:51:19Z

src/rust/engine/serverset/src/lib.rs

+  pub fn next(&self) -> (T, Box<Fn(Health) + Send + Sync>) {
+    let (i, server) = loop {
+      let i = self.inner.next.fetch_add(1, Ordering::Relaxed) % self.inner.servers.len();
+      let server = &self.inner.servers[i];


I assume that inner.servers only gets mutated at creation and the privateness of that field guarantees that it doesn't get mutated elsewhere. If that is the case, this is okay. Otherwise, there could be a disconnect between i and the actual vector.

Still maybe worth throwing an RWLock, if reading has little overhead.

In fact, the fact that we Arc up the Inner means that the Vec cannot ever be mutated, because there's no way for anyone to get a mutable reference to it (unless they are the only holder of the Arc, in which case they would need to unwrap it, which proves that no one else could have any dangling callbacks) :) Yay rust!

stuhood

See the note about busywaiting. But I like that this is separated out as an independent crate!

stuhood · 2018-12-13T23:41:05Z

src/rust/engine/serverset/src/lib.rs

+      max_lame,
+    } = backoff_config;
+
+    if backoff_ratio < 1.0 {


Is it worth giving BackoffConfig its own fn new to perform these calculations/validation, and then having this method be infallible? The advantage would be that it would be useful to give Inner an instance of BackoffConfig rather than namespacing the fields with failure_.

stuhood · 2018-12-13T23:45:46Z

src/rust/engine/serverset/src/lib.rs

+  /// unhealthy.
+  ///
+  /// The inverse is used when easing back in after health recovery.
+  pub backoff_ratio: f64,


Given that we think we'll be switching to a library for the backoff implementation, using ratios here seems like it might be overkill. Would assuming a 2x ratio (or having a configurable integer ratio) be reasonable until then?

After futzing with Futures, I'm pretty sure we don't want to switch to a library for the backoff implementation. None of the existing ones I could find support ease in as well as backoff, and the code is now pretty well encapsulated. I'd be happy to split off a separate backoff crate, if you'd like.

stuhood · 2018-12-13T23:55:31Z

src/rust/engine/serverset/src/lib.rs

+  /// would require this type to be Resettable because of our fork model, which would be very
+  /// complex.
+  ///
+  /// TODO: Switch to use tokio_retry when we don't need to worry about forking without execing.


On this topic: we already guarantee that all threads/pools/connections/etc are shut down or dropped before forking... I believe that because tokio_retry does not create threads (instead, it uses the Timer facility provided per-thread by tokio_runtime), it should be safe to use tokio_retry. One thing we'd need to ensure though is that we're interacting with the Timer from within the runtime's threads (ie, not on the io-pool... we should probably kill the io-pool).

If you think otherwise though, #6818 will need an update.

Updated to use futures :)

stuhood · 2018-12-14T00:05:23Z

src/rust/engine/serverset/src/lib.rs

+
+    let serverset: Serverset<T> = (*self).clone();
+
+    let callback = Box::new(move |health: Health| serverset.callback(i, health));


Given that we have to trust the caller to call the callback anyway, maybe giving them an opaque token to explicitly call serverset.callback with would be a bit more efficient (would avoid the allocation of the callback function).

stuhood · 2018-12-14T00:13:36Z

src/rust/engine/serverset/src/lib.rs

+  }
+
+  fn multiply(duration: Duration, fraction: num_rational::Ratio<u32>) -> Duration {
+    (duration * *fraction.numer()) / *fraction.denom()


Dividing first might be a better defense against overflow? Worse for precision though.

Switched to just use floats everywhere. Which makes me pretty sad, but added a link to rust-lang/rust#54361

I assumed that the reason for the ratio was to allow for test determinism? Or are things "close enough" with floats for that not to matter?

stuhood · 2018-12-14T00:20:26Z

src/rust/engine/serverset/src/lib.rs

+      let i = self.inner.next.fetch_add(1, Ordering::Relaxed) % self.inner.servers.len();
+      let server = &self.inner.servers[i];
+      let unhealthy_info = server.unhealthy_info.lock();
+      if let Some(ref unhealthy_info) = *unhealthy_info {


If all hosts are unhealthy, I think that this is going to busywait. In order to not busywait, you'd need to make the result of this method optional. And if the result were optional, your caller would still probably want to lean on a Timer.

So... I feel like incorporating timer usage (either via tokio_retry or directly via tokio_timer) is likely more ergonomic. But giving this a Result<_, Duration> to retry at would be an alternative.

Switched to use a timer.

Specifically not using tokio_retry here because the actual retries should be driven by the caller; what we want is to lame underlying servers here (with their own exponential backoff), and have the caller immediately retry without backoff.

Something we could do is have each server return a Future for when it's going to be usable (weighting between multiple healthy servers, and having unhealthy servers give a (jittered) delay) until they're healthy, and have a select around them, which we give to the caller, but that feels like a lot more stuff being scheduled and dropped. Maybe I'm overestimating how expensive kicking off and selecting between futures is, though...

stuhood

Either explicitly passing in a futures_timer::Timer or switching to tokio_timer will likely be necessary: otherwise, looks good.

stuhood · 2018-12-14T19:48:29Z

src/rust/engine/serverset/src/lib.rs

+    let (index, instant) = earliest_future.unwrap();
+    let server = self.inner.servers[index].server.clone();
+    // Note that Delay::new_at(time in the past) gets immediately scheduled.
+    Delay::new_at(instant)


So, futures_timer does have a dedicated thread, which means it isn't fork safe out of the box. To use it, you'd need to construct and provide the Timer reference manually (as described there), and ensure that it is shut down before forking.

An advantage to using tokio_timer (I think) is that it will "magically" use thread-local references to find the runtime it is running on, so you wouldn't need to pass in an explicit Runtime or Timer instance. The downside, of course, is that that is magical... you could fall into the trap I mentioned above with io-pool and see a panic due to it not being explicit.

stuhood · 2018-12-14T19:52:06Z

src/rust/engine/serverset/src/lib.rs

+      .to_boxed()
+  }
+
+  pub fn callback(&self, callback_token: CallbackToken, health: Health) {


nit: Now that these are actually exposed, giving them purposeful names would be good... report_health and HealthReportToken or something maybe?

Oh, could also mark the token "#[must_use]".

This round-robins between a number of backends, skipping any which are observed to be bad, with expontential backoff and ease-in. This will be used in remote process execution for CAS operations.

illicitonion requested review from stuhood and blorente December 13, 2018 16:05

stuhood reviewed Dec 13, 2018

View reviewed changes

blorente reviewed Dec 13, 2018

View reviewed changes

stuhood requested changes Dec 14, 2018

View reviewed changes

illicitonion mentioned this pull request Dec 14, 2018

Use serverset for Store #6931

Merged

stuhood approved these changes Dec 14, 2018

View reviewed changes

illicitonion force-pushed the dwagnerhall/serversets/1 branch from fb9e442 to 48905b0 Compare December 17, 2018 10:54

illicitonion and others added 9 commits December 17, 2018 18:42

Add serverset

c8b2ab4

This round-robins between a number of backends, skipping any which are observed to be bad, with expontential backoff and ease-in. This will be used in remote process execution for CAS operations.

Pull out mult/div fn

969281d

Use callback token not Box<Fn>

f05fc8e

Serverset::next returns a Future

77fa38b

Use future Delay rather than busy wait

6b47607

Encapsulate BackoffConfig

016802f

Encapsulate UnhealthyInfo

81854ad

Use floats as time because overflow

17825d2

Review comments

a60e644

illicitonion force-pushed the dwagnerhall/serversets/1 branch from 48905b0 to a60e644 Compare December 17, 2018 19:01

illicitonion merged commit 3350f7e into pantsbuild:master Dec 17, 2018

illicitonion deleted the dwagnerhall/serversets/1 branch December 17, 2018 21:02


		let serverset: Serverset<T> = (*self).clone();

		let callback = Box::new(move \|health: Health\| serverset.callback(i, health));

Add serverset #6921

Add serverset #6921

Conversation

illicitonion commented Dec 13, 2018

stuhood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blorente left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood left a comment

Choose a reason for hiding this comment

stuhood Dec 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Dec 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Dec 14, 2018 • edited Loading

Choose a reason for hiding this comment

stuhood Dec 13, 2018 •

edited

Loading

stuhood Dec 14, 2018 •

edited

Loading

stuhood Dec 14, 2018 •

edited

Loading