-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Less-exponential alternative to internal_service_policy()
?
#1270
Comments
The exponential backoff crate we're using allows setting a different "max_interval", which I started to use in #1216. I have also in some spots just used a constant sleep, but this has other undesirable properties. |
Oh nice, that looks great. 👍 |
Yeah, We should think carefully about this. These policies can make or break the system when they're under moderate stress. I think it's dangerous to start tweaking the parameters in various specific code paths. I think there's a lot of value in a small number of canned policies intended for specific use cases. (That's not to say everything should use the same one. I'd definitely use a different one for services on the WAN, for example.) In the disk creation case, it really feels like we want to use a long poll or an explicit notification in the other direction. Retrying with a short poll creates more work (which can also create overload) and more latency to discover when it's done. I don't know what advantage it has. Can we do something similar (long poll or explicit notification) for the RSS case? |
I think there are several RSS and bootstrap-agent cases, and the answer may be different for different cases. In #1251 we're waiting until our local maghemite service is aware of N peers, which is in a retry loop that handles both a) our maghemite service is still starting up, so we can't even connect to it yet This happens in two places:
In both cases and absent errors preventing mg-ddm from starting entirely (at which point the sled is crippled and won't be able to route to others), (a) is very short, and retrying with a short poll seems okay? I think (b) is different in the two cases. In RSS's case, presumably if we've already shown an operator "this is your rack inventory" and they've said "go", we've already established the set of sleds. But maybe that's only over the management network, and we could still be waiting for the bootstrap network? Even if that's the case, the rack is not doing any work because it's in the initialization process, so short polling is probably okay. In the bootstrap agent case, (b) could take much longer (e.g., if there aren't enough sleds powered on to establish a trust quorum, and we're waiting for more to come online), but short polling doesn't really create much more work I don't think; all these requests to maghemite are localhost connections that only ask its current state and don't create any more off-sled traffic than it's already doing. The current sled can't do any useful work until it unlocks, so short polling is probably okay here too. |
(All that said, I completely agree with you about analyzing each of these cases - the previous comment covers the two cases I was thinking when I left a note about this on #1251, but there are still ~20 other uses of retry policies that need similar consideration. I don't know if that's already happened or is captured anywhere, though.) |
internal_service_policy()
is used throughout omicron (at the time of creating this issue, ~20 times). It serves as a "retry this forever with exponential backoff", but there are cases where either exponential backoff is not necessary, or the max backoff time allowed byinternal_service_policy()
(1 hour) is much too long. #996 describes one such case with discussion about alternatives (some of which may be specific to disk creation). RSS also has several uses of this policy, which seems unlikely to be correct; e.g., #1251 uses this policy to wait for maghemite to report its awareness of other sleds in the rack, but ultimately that happens while an operator is sitting waiting for RSS to make progress, which definitely shouldn't allow for an hour between retries. Should we have at least one other retry policy available that has a much shorter max interval?The text was updated successfully, but these errors were encountered: