-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make timeouts configurable #3782
Make timeouts configurable #3782
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I think I am OK with the annotation you propose if we assume that these timeout tweaks are meant as an escape hatch for exceptional cases. Are you planning to add docs in a separate PR?
@@ -192,6 +193,11 @@ func Command() *cobra.Command { | |||
"", | |||
"Set the IP family to use. Possible values: IPv4, IPv6, \"\" (= auto-detect) ", | |||
) | |||
cmd.Flags().Duration( | |||
operator.KubeClientTimeout, | |||
60*time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that this default of 60 seconds changes the current behaviour for some usages (discovery has 32s apparently)? Don't think it is a big issue, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going by the comment in DiscoveryClient
it looks like the 32s timeout is an arbitrary value used to distinguish that particular instance from among many different timeouts. It is only used when the REST client doesn't have its own timeout set (which we are setting now with this change). I don't think it will have a serious effect, except maybe taking twice as long to detect a problem with discovery.
Of course, there may be other unforeseen side effects from changing the global timeout. I am not too worried about that because we just need to set --kube-client-timeout=0
to get back to the old behaviour.
"github.com/elastic/cloud-on-k8s/pkg/controller/common/version" | ||
"github.com/elastic/cloud-on-k8s/pkg/utils/net" | ||
) | ||
|
||
const ( | ||
// DefaultVotingConfigExclusionsTimeout is the default timeout for setting voting exclusions. | ||
DefaultVotingConfigExclusionsTimeout = "30s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat unrelated to that PR: I could not remember what that timeout was for so I double-checked the code.
It is used as a query param in the voting config exclusions request: https://www.elastic.co/guide/en/elasticsearch/reference/master/voting-config-exclusions.html#voting-config-exclusions-api-query-params.
If the timeout is reached, the request fails and returns an error.
Other API calls, such as GET _cluster/settings
(https://www.elastic.co/guide/en/elasticsearch/reference/master/cluster-get-settings.html) also have this timeout query param, but we don't use it (we should!).
For the sake of simplicity (one less thing to configure), I feel like we could get rid of that DefaultVotingConfigExclusionsTimeout
and use the defaultClientTimeout
instead (which moves the timeout to 3min instead of the default 30sec). Since we'd also have a 3min timeout to the http call, we may reach it before Elasticsearch reaches its own 3min timeout to set the voting config exclusions. Which is not a big deal, I think: this request is idempotent and we do it at every reconciliation anyway.
To summarize:
- keep a single Elasticsearch http timeout for all api calls
- ensure we set the
?timeout=
query param to that timeout for URLs that support it
@elastic/cloud-k8s any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favour of your proposal. I have to admit I did not question the need for this timeout in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't use the timeout
override in the API call at all (we just default to the Elasticsearch default) so I have opted to remove that parameter completely from the function for now. If we find a reason to use an explicit timeout at a later date, we can easily add that back in.
1ebf39d
to
c9eb703
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cfg.Timeout = viper.GetDuration(operator.KubeClientTimeout) | ||
|
||
// set the timeout for Elasticsearch requests | ||
esclient.DefaultESClientTimeout = viper.GetDuration(operator.ElasticsearchClientTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency the default observer timeout & interval could also be configurable through flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that'll just clutter the already long list of configuration options. Under normal circumstances, observer timeout and frequency are not things that users need to be concerned with. They only need to change those settings in exceptional circumstances, so leaving those to annotations feels right to me. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm fine with that.
My gut feeling says the observer timeout should match the es client timeout though (whose value can be specified in the annotation). So basically the observer timeout setting could disappear, I think?
We can open a different issue to better refine those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to stick to what we already had in the code without changing the behaviour too much. The observer is sort of a health check if you squint hard enough 😄 So, it's probably fine to be more aggressive than normal. Of course, we can discuss it later in a different issue if you feel differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but I'm not sure about the logic for not overriding the observer timeout if larger than the client timeout used for reconciliations.
if settings.RequestTimeout > clientTimeout { | ||
log.Info("Ignoring observer request timeout annotation because it is larger than the client request timeout", "namespace", cluster.Namespace, "es_name", cluster.Name) | ||
settings.RequestTimeout = clientTimeout | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not expect this. If I set a specific observer timeout I want it to be used whatever the reconciliation es client timeout is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observers get an already constructed Elasticsearch client passed to them -- which makes it impossible to change the timeout in the Observer. Of course, we can create two different constructor functions for the Elasticsearch client to get over that problem. But, I didn't think that was necessary because what is a legitimate reason for having an Observer request timeout that is longer than the standard Elasticsearch client timeout? They are fetching the health information from the cluster and if that doesn't return in a sensible amount of time, then the chances of other API calls succeeding in a shorter time are pretty low as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - looking at the code I understand technically that's because we use a timeout in the http client, but also an additional context for the observer http calls?
Having timeouts done one way in the reconciliation (http client timeout) vs. another way in the observer (through a context) has a bit of a smell to me.
Also while I'm not opposed to the argument that you generally want the observer timeout to be shorter than the reconciliation es timeout, it feels a bit like we're imposing a constraint due to the way the code is organized internally.
How about we refactor slightly the observer code to set the timeout on the http client instead of using a context? If needed I think we could add an additional SetTimeout(timeout time.Duration)
function to the Elasticsearch Client interface so we can still pass the esClient around, and override its timeout at the observer level? But it looks like we could also pass the right timeout at the observer es client creation time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarise an offline conversation I had with @sebgl: we both agreed that the special timeout for Observer requests doesn't make much sense and could just be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Makes timeouts configurable.
Elasticsearch timeouts are configurable for each cluster by annotating the associated
Elasticsearch
resource.eck.k8s.elastic.co/es-client-timeout
: Elasticsearch client request timeout.eck.k8s.elastic.co/es-voting-config-exclusion-timeout
: Timeout for voting config exclusion API calls.eck.k8s.elastic.co/es-observer-interval
: Elasticsearch observation interval.eck.k8s.elastic.co/es-observer-request-timeout
: Elasticsearch observer request timeout.Kubernetes client timeout is configurable by setting the
kube-client-timeout
flag. I have chosen to make this a global timeout (affects controller-runtime as well) because theWithTimeout
andWithContext
methods provided by the wrappedClient
API are not used anywhere to set custom request timeouts. I think that the wholeClient
abstraction can be removed from the codebase and replaced with the default controller-runtimeClient
, but, I did not do that in this PR because it would make the diff too large.Fixes #684