Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make timeouts configurable #3782

Merged

Conversation

charith-elastic
Copy link
Contributor

@charith-elastic charith-elastic commented Sep 24, 2020

Makes timeouts configurable.

Elasticsearch timeouts are configurable for each cluster by annotating the associated Elasticsearch resource.

  • eck.k8s.elastic.co/es-client-timeout: Elasticsearch client request timeout.
  • eck.k8s.elastic.co/es-voting-config-exclusion-timeout: Timeout for voting config exclusion API calls.
  • eck.k8s.elastic.co/es-observer-interval: Elasticsearch observation interval.
  • eck.k8s.elastic.co/es-observer-request-timeout: Elasticsearch observer request timeout.

Kubernetes client timeout is configurable by setting the kube-client-timeout flag. I have chosen to make this a global timeout (affects controller-runtime as well) because the WithTimeout and WithContext methods provided by the wrapped Client API are not used anywhere to set custom request timeouts. I think that the whole Client abstraction can be removed from the codebase and replaced with the default controller-runtime Client, but, I did not do that in this PR because it would make the diff too large.

Fixes #684

@charith-elastic charith-elastic added >feature Adds or discusses adding a feature to the product v1.3.0 labels Sep 24, 2020
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think I am OK with the annotation you propose if we assume that these timeout tweaks are meant as an escape hatch for exceptional cases. Are you planning to add docs in a separate PR?

cmd/manager/main.go Outdated Show resolved Hide resolved
@@ -192,6 +193,11 @@ func Command() *cobra.Command {
"",
"Set the IP family to use. Possible values: IPv4, IPv6, \"\" (= auto-detect) ",
)
cmd.Flags().Duration(
operator.KubeClientTimeout,
60*time.Second,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that this default of 60 seconds changes the current behaviour for some usages (discovery has 32s apparently)? Don't think it is a big issue, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going by the comment in DiscoveryClient it looks like the 32s timeout is an arbitrary value used to distinguish that particular instance from among many different timeouts. It is only used when the REST client doesn't have its own timeout set (which we are setting now with this change). I don't think it will have a serious effect, except maybe taking twice as long to detect a problem with discovery.

Of course, there may be other unforeseen side effects from changing the global timeout. I am not too worried about that because we just need to set --kube-client-timeout=0 to get back to the old behaviour.

pkg/controller/common/annotation/timeout.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/client/client.go Outdated Show resolved Hide resolved
"github.com/elastic/cloud-on-k8s/pkg/controller/common/version"
"github.com/elastic/cloud-on-k8s/pkg/utils/net"
)

const (
// DefaultVotingConfigExclusionsTimeout is the default timeout for setting voting exclusions.
DefaultVotingConfigExclusionsTimeout = "30s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unrelated to that PR: I could not remember what that timeout was for so I double-checked the code.
It is used as a query param in the voting config exclusions request: https://www.elastic.co/guide/en/elasticsearch/reference/master/voting-config-exclusions.html#voting-config-exclusions-api-query-params.
If the timeout is reached, the request fails and returns an error.

Other API calls, such as GET _cluster/settings (https://www.elastic.co/guide/en/elasticsearch/reference/master/cluster-get-settings.html) also have this timeout query param, but we don't use it (we should!).

For the sake of simplicity (one less thing to configure), I feel like we could get rid of that DefaultVotingConfigExclusionsTimeout and use the defaultClientTimeout instead (which moves the timeout to 3min instead of the default 30sec). Since we'd also have a 3min timeout to the http call, we may reach it before Elasticsearch reaches its own 3min timeout to set the voting config exclusions. Which is not a big deal, I think: this request is idempotent and we do it at every reconciliation anyway.

To summarize:

  • keep a single Elasticsearch http timeout for all api calls
  • ensure we set the ?timeout= query param to that timeout for URLs that support it

@elastic/cloud-k8s any thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favour of your proposal. I have to admit I did not question the need for this timeout in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't use the timeout override in the API call at all (we just default to the Elasticsearch default) so I have opted to remove that parameter completely from the function for now. If we find a reason to use an explicit timeout at a later date, we can easily add that back in.

cmd/manager/main.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/controller/elasticsearch/client/client_test.go Outdated Show resolved Hide resolved
deploy/eck-operator/values.yaml Show resolved Hide resolved
cfg.Timeout = viper.GetDuration(operator.KubeClientTimeout)

// set the timeout for Elasticsearch requests
esclient.DefaultESClientTimeout = viper.GetDuration(operator.ElasticsearchClientTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency the default observer timeout & interval could also be configurable through flags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like that'll just clutter the already long list of configuration options. Under normal circumstances, observer timeout and frequency are not things that users need to be concerned with. They only need to change those settings in exceptional circumstances, so leaving those to annotations feels right to me. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm fine with that.
My gut feeling says the observer timeout should match the es client timeout though (whose value can be specified in the annotation). So basically the observer timeout setting could disappear, I think?
We can open a different issue to better refine those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to stick to what we already had in the code without changing the behaviour too much. The observer is sort of a health check if you squint hard enough 😄 So, it's probably fine to be more aggressive than normal. Of course, we can discuss it later in a different issue if you feel differently.

Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but I'm not sure about the logic for not overriding the observer timeout if larger than the client timeout used for reconciliations.

pkg/controller/elasticsearch/observer/manager.go Outdated Show resolved Hide resolved
pkg/controller/common/annotation/timeout.go Outdated Show resolved Hide resolved
if settings.RequestTimeout > clientTimeout {
log.Info("Ignoring observer request timeout annotation because it is larger than the client request timeout", "namespace", cluster.Namespace, "es_name", cluster.Name)
settings.RequestTimeout = clientTimeout
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not expect this. If I set a specific observer timeout I want it to be used whatever the reconciliation es client timeout is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observers get an already constructed Elasticsearch client passed to them -- which makes it impossible to change the timeout in the Observer. Of course, we can create two different constructor functions for the Elasticsearch client to get over that problem. But, I didn't think that was necessary because what is a legitimate reason for having an Observer request timeout that is longer than the standard Elasticsearch client timeout? They are fetching the health information from the cluster and if that doesn't return in a sensible amount of time, then the chances of other API calls succeeding in a shorter time are pretty low as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - looking at the code I understand technically that's because we use a timeout in the http client, but also an additional context for the observer http calls?
Having timeouts done one way in the reconciliation (http client timeout) vs. another way in the observer (through a context) has a bit of a smell to me.
Also while I'm not opposed to the argument that you generally want the observer timeout to be shorter than the reconciliation es timeout, it feels a bit like we're imposing a constraint due to the way the code is organized internally.
How about we refactor slightly the observer code to set the timeout on the http client instead of using a context? If needed I think we could add an additional SetTimeout(timeout time.Duration) function to the Elasticsearch Client interface so we can still pass the esClient around, and override its timeout at the observer level? But it looks like we could also pass the right timeout at the observer es client creation time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarise an offline conversation I had with @sebgl: we both agreed that the special timeout for Observer requests doesn't make much sense and could just be removed.

Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@charith-elastic charith-elastic merged commit 065cafe into elastic:master Oct 6, 2020
@charith-elastic charith-elastic deleted the feature/timeout-config branch October 6, 2020 15:27
@david-kow david-kow changed the title Configurable timeouts Make timeouts configurable Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Adds or discusses adding a feature to the product v1.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make timeouts configurable
4 participants