Less-exponential alternative to `internal_service_policy()`? #1270

jgallagher · 2022-06-24T13:46:16Z

internal_service_policy() is used throughout omicron (at the time of creating this issue, ~20 times). It serves as a "retry this forever with exponential backoff", but there are cases where either exponential backoff is not necessary, or the max backoff time allowed by internal_service_policy() (1 hour) is much too long. #996 describes one such case with discussion about alternatives (some of which may be specific to disk creation). RSS also has several uses of this policy, which seems unlikely to be correct; e.g., #1251 uses this policy to wait for maghemite to report its awareness of other sleds in the rack, but ultimately that happens while an operator is sitting waiting for RSS to make progress, which definitely shouldn't allow for an hour between retries. Should we have at least one other retry policy available that has a much shorter max interval?

The text was updated successfully, but these errors were encountered:

smklein · 2022-06-24T14:47:49Z

The exponential backoff crate we're using allows setting a different "max_interval", which I started to use in #1216.

I have also in some spots just used a constant sleep, but this has other undesirable properties.

jgallagher · 2022-06-24T14:57:47Z

Oh nice, that looks great. 👍

davepacheco · 2022-06-24T16:09:37Z

Yeah, internal_service_policy() is intended for establishing TCP connections or making requests to other internal services, where it's very possible that the attempt itself is contributing to an overload condition that prevents it from succeeding. Exponential backoff is more dubious when the remote service is working and giving a clear response that doesn't suggest overload.

We should think carefully about this. These policies can make or break the system when they're under moderate stress. I think it's dangerous to start tweaking the parameters in various specific code paths. I think there's a lot of value in a small number of canned policies intended for specific use cases. (That's not to say everything should use the same one. I'd definitely use a different one for services on the WAN, for example.)

In the disk creation case, it really feels like we want to use a long poll or an explicit notification in the other direction. Retrying with a short poll creates more work (which can also create overload) and more latency to discover when it's done. I don't know what advantage it has. Can we do something similar (long poll or explicit notification) for the RSS case?

jgallagher · 2022-06-24T16:33:49Z

Can we do something similar (long poll or explicit notification) for the RSS case?

I think there are several RSS and bootstrap-agent cases, and the answer may be different for different cases. In #1251 we're waiting until our local maghemite service is aware of N peers, which is in a retry loop that handles both

a) our maghemite service is still starting up, so we can't even connect to it yet
b) our maghemite service is running but hasn't discovered the expected number of peers yet

This happens in two places:

RSS has been told to initialize N sleds and we're waiting for our maghemite to know them all
Bootstrap agent knows it's a member of a trust quorum that requires at least N sleds to unlock, and we're waiting for our maghemite to know at least N

In both cases and absent errors preventing mg-ddm from starting entirely (at which point the sled is crippled and won't be able to route to others), (a) is very short, and retrying with a short poll seems okay?

I think (b) is different in the two cases. In RSS's case, presumably if we've already shown an operator "this is your rack inventory" and they've said "go", we've already established the set of sleds. But maybe that's only over the management network, and we could still be waiting for the bootstrap network? Even if that's the case, the rack is not doing any work because it's in the initialization process, so short polling is probably okay.

In the bootstrap agent case, (b) could take much longer (e.g., if there aren't enough sleds powered on to establish a trust quorum, and we're waiting for more to come online), but short polling doesn't really create much more work I don't think; all these requests to maghemite are localhost connections that only ask its current state and don't create any more off-sled traffic than it's already doing. The current sled can't do any useful work until it unlocks, so short polling is probably okay here too.

jgallagher · 2022-06-24T16:36:39Z

(All that said, I completely agree with you about analyzing each of these cases - the previous comment covers the two cases I was thinking when I left a note about this on #1251, but there are still ~20 other uses of retry policies that need similar consideration. I don't know if that's already happened or is captured anywhere, though.)

jgallagher mentioned this issue Jun 24, 2022

Advertise bootstrap addresses via maghemite #1251

Merged

davepacheco mentioned this issue Nov 29, 2022

Flaky issue with helios deploy CI job during "ensuring datasets are ready" #1991

Closed

smklein mentioned this issue Nov 30, 2022

Make backoff policy explicit, with short / long options #1996

Merged

smklein self-assigned this Nov 30, 2022

smklein closed this as completed in #1996 Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less-exponential alternative to `internal_service_policy()`? #1270

Less-exponential alternative to `internal_service_policy()`? #1270

jgallagher commented Jun 24, 2022

smklein commented Jun 24, 2022

jgallagher commented Jun 24, 2022

davepacheco commented Jun 24, 2022

jgallagher commented Jun 24, 2022

jgallagher commented Jun 24, 2022 •

edited

Loading

Less-exponential alternative to internal_service_policy()? #1270

Less-exponential alternative to internal_service_policy()? #1270

Comments

jgallagher commented Jun 24, 2022

smklein commented Jun 24, 2022

jgallagher commented Jun 24, 2022

davepacheco commented Jun 24, 2022

jgallagher commented Jun 24, 2022

jgallagher commented Jun 24, 2022 • edited Loading

Less-exponential alternative to `internal_service_policy()`? #1270

Less-exponential alternative to `internal_service_policy()`? #1270

jgallagher commented Jun 24, 2022 •

edited

Loading