Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less-exponential alternative to internal_service_policy()? #1270

Closed
jgallagher opened this issue Jun 24, 2022 · 5 comments · Fixed by #1996
Closed

Less-exponential alternative to internal_service_policy()? #1270

jgallagher opened this issue Jun 24, 2022 · 5 comments · Fixed by #1996
Assignees

Comments

@jgallagher
Copy link
Contributor

internal_service_policy() is used throughout omicron (at the time of creating this issue, ~20 times). It serves as a "retry this forever with exponential backoff", but there are cases where either exponential backoff is not necessary, or the max backoff time allowed by internal_service_policy() (1 hour) is much too long. #996 describes one such case with discussion about alternatives (some of which may be specific to disk creation). RSS also has several uses of this policy, which seems unlikely to be correct; e.g., #1251 uses this policy to wait for maghemite to report its awareness of other sleds in the rack, but ultimately that happens while an operator is sitting waiting for RSS to make progress, which definitely shouldn't allow for an hour between retries. Should we have at least one other retry policy available that has a much shorter max interval?

@smklein
Copy link
Collaborator

smklein commented Jun 24, 2022

The exponential backoff crate we're using allows setting a different "max_interval", which I started to use in #1216.

I have also in some spots just used a constant sleep, but this has other undesirable properties.

@jgallagher
Copy link
Contributor Author

Oh nice, that looks great. 👍

@davepacheco
Copy link
Collaborator

Yeah, internal_service_policy() is intended for establishing TCP connections or making requests to other internal services, where it's very possible that the attempt itself is contributing to an overload condition that prevents it from succeeding. Exponential backoff is more dubious when the remote service is working and giving a clear response that doesn't suggest overload.

We should think carefully about this. These policies can make or break the system when they're under moderate stress. I think it's dangerous to start tweaking the parameters in various specific code paths. I think there's a lot of value in a small number of canned policies intended for specific use cases. (That's not to say everything should use the same one. I'd definitely use a different one for services on the WAN, for example.)

In the disk creation case, it really feels like we want to use a long poll or an explicit notification in the other direction. Retrying with a short poll creates more work (which can also create overload) and more latency to discover when it's done. I don't know what advantage it has. Can we do something similar (long poll or explicit notification) for the RSS case?

@jgallagher
Copy link
Contributor Author

Can we do something similar (long poll or explicit notification) for the RSS case?

I think there are several RSS and bootstrap-agent cases, and the answer may be different for different cases. In #1251 we're waiting until our local maghemite service is aware of N peers, which is in a retry loop that handles both

a) our maghemite service is still starting up, so we can't even connect to it yet
b) our maghemite service is running but hasn't discovered the expected number of peers yet

This happens in two places:

  1. RSS has been told to initialize N sleds and we're waiting for our maghemite to know them all
  2. Bootstrap agent knows it's a member of a trust quorum that requires at least N sleds to unlock, and we're waiting for our maghemite to know at least N

In both cases and absent errors preventing mg-ddm from starting entirely (at which point the sled is crippled and won't be able to route to others), (a) is very short, and retrying with a short poll seems okay?

I think (b) is different in the two cases. In RSS's case, presumably if we've already shown an operator "this is your rack inventory" and they've said "go", we've already established the set of sleds. But maybe that's only over the management network, and we could still be waiting for the bootstrap network? Even if that's the case, the rack is not doing any work because it's in the initialization process, so short polling is probably okay.

In the bootstrap agent case, (b) could take much longer (e.g., if there aren't enough sleds powered on to establish a trust quorum, and we're waiting for more to come online), but short polling doesn't really create much more work I don't think; all these requests to maghemite are localhost connections that only ask its current state and don't create any more off-sled traffic than it's already doing. The current sled can't do any useful work until it unlocks, so short polling is probably okay here too.

@jgallagher
Copy link
Contributor Author

jgallagher commented Jun 24, 2022

(All that said, I completely agree with you about analyzing each of these cases - the previous comment covers the two cases I was thinking when I left a note about this on #1251, but there are still ~20 other uses of retry policies that need similar consideration. I don't know if that's already happened or is captured anywhere, though.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants