Make timeouts configurable #3782

charith-elastic · 2020-09-24T12:08:15Z

Makes timeouts configurable.

Elasticsearch timeouts are configurable for each cluster by annotating the associated Elasticsearch resource.

eck.k8s.elastic.co/es-client-timeout: Elasticsearch client request timeout.
~~eck.k8s.elastic.co/es-voting-config-exclusion-timeout: Timeout for voting config exclusion API calls.~~
eck.k8s.elastic.co/es-observer-interval: Elasticsearch observation interval.
~~eck.k8s.elastic.co/es-observer-request-timeout: Elasticsearch observer request timeout.~~

Kubernetes client timeout is configurable by setting the kube-client-timeout flag. I have chosen to make this a global timeout (affects controller-runtime as well) because the WithTimeout and WithContext methods provided by the wrapped Client API are not used anywhere to set custom request timeouts. I think that the whole Client abstraction can be removed from the codebase and replaced with the default controller-runtime Client, but, I did not do that in this PR because it would make the diff too large.

Fixes #684

pebrc

Looks good to me. I think I am OK with the annotation you propose if we assume that these timeout tweaks are meant as an escape hatch for exceptional cases. Are you planning to add docs in a separate PR?

cmd/manager/main.go

pebrc · 2020-10-01T13:08:43Z

cmd/manager/main.go

@@ -192,6 +193,11 @@ func Command() *cobra.Command {
 		"",
 		"Set the IP family to use. Possible values: IPv4, IPv6, \"\" (= auto-detect) ",
 	)
+	cmd.Flags().Duration(
+		operator.KubeClientTimeout,
+		60*time.Second,


Am I right that this default of 60 seconds changes the current behaviour for some usages (discovery has 32s apparently)? Don't think it is a big issue, though.

Going by the comment in DiscoveryClient it looks like the 32s timeout is an arbitrary value used to distinguish that particular instance from among many different timeouts. It is only used when the REST client doesn't have its own timeout set (which we are setting now with this change). I don't think it will have a serious effect, except maybe taking twice as long to detect a problem with discovery.

Of course, there may be other unforeseen side effects from changing the global timeout. I am not too worried about that because we just need to set --kube-client-timeout=0 to get back to the old behaviour.

pkg/controller/common/annotation/timeout.go

pkg/controller/elasticsearch/client/client.go

sebgl · 2020-10-02T11:59:31Z

pkg/controller/elasticsearch/client/client.go

 	"github.com/elastic/cloud-on-k8s/pkg/controller/common/version"
 	"github.com/elastic/cloud-on-k8s/pkg/utils/net"
 )

 const (
-	// DefaultVotingConfigExclusionsTimeout is the default timeout for setting voting exclusions.
-	DefaultVotingConfigExclusionsTimeout = "30s"


Somewhat unrelated to that PR: I could not remember what that timeout was for so I double-checked the code.
It is used as a query param in the voting config exclusions request: https://www.elastic.co/guide/en/elasticsearch/reference/master/voting-config-exclusions.html#voting-config-exclusions-api-query-params.
If the timeout is reached, the request fails and returns an error.

Other API calls, such as GET _cluster/settings (https://www.elastic.co/guide/en/elasticsearch/reference/master/cluster-get-settings.html) also have this timeout query param, but we don't use it (we should!).

For the sake of simplicity (one less thing to configure), I feel like we could get rid of that DefaultVotingConfigExclusionsTimeout and use the defaultClientTimeout instead (which moves the timeout to 3min instead of the default 30sec). Since we'd also have a 3min timeout to the http call, we may reach it before Elasticsearch reaches its own 3min timeout to set the voting config exclusions. Which is not a big deal, I think: this request is idempotent and we do it at every reconciliation anyway.

To summarize:

keep a single Elasticsearch http timeout for all api calls

ensure we set the ?timeout= query param to that timeout for URLs that support it

@elastic/cloud-k8s any thoughts?

I am in favour of your proposal. I have to admit I did not question the need for this timeout in the first place.

Currently we don't use the timeout override in the API call at all (we just default to the Elasticsearch default) so I have opted to remove that parameter completely from the function for now. If we find a reason to use an explicit timeout at a later date, we can easily add that back in.

pkg/controller/elasticsearch/client/shard.go

pkg/controller/elasticsearch/driver/esstate.go

pkg/utils/k8s/client.go

cmd/manager/main.go

pebrc

LGTM

pkg/controller/elasticsearch/client/client_test.go

deploy/eck-operator/values.yaml

sebgl · 2020-10-05T12:30:05Z

cmd/manager/main.go

+	cfg.Timeout = viper.GetDuration(operator.KubeClientTimeout)
+
+	// set the timeout for Elasticsearch requests
+	esclient.DefaultESClientTimeout = viper.GetDuration(operator.ElasticsearchClientTimeout)


For consistency the default observer timeout & interval could also be configurable through flags?

I feel like that'll just clutter the already long list of configuration options. Under normal circumstances, observer timeout and frequency are not things that users need to be concerned with. They only need to change those settings in exceptional circumstances, so leaving those to annotations feels right to me. WDYT?

I think I'm fine with that.
My gut feeling says the observer timeout should match the es client timeout though (whose value can be specified in the annotation). So basically the observer timeout setting could disappear, I think?
We can open a different issue to better refine those.

I tried to stick to what we already had in the code without changing the behaviour too much. The observer is sort of a health check if you squint hard enough 😄 So, it's probably fine to be more aggressive than normal. Of course, we can discuss it later in a different issue if you feel differently.

sebgl

Overall LGTM, but I'm not sure about the logic for not overriding the observer timeout if larger than the client timeout used for reconciliations.

pkg/controller/elasticsearch/observer/manager.go

pkg/controller/common/annotation/timeout.go

sebgl · 2020-10-06T10:13:47Z

pkg/controller/elasticsearch/observer/manager.go

+	if settings.RequestTimeout > clientTimeout {
+		log.Info("Ignoring observer request timeout annotation because it is larger than the client request timeout", "namespace", cluster.Namespace, "es_name", cluster.Name)
+		settings.RequestTimeout = clientTimeout
+	}


I would not expect this. If I set a specific observer timeout I want it to be used whatever the reconciliation es client timeout is?

Observers get an already constructed Elasticsearch client passed to them -- which makes it impossible to change the timeout in the Observer. Of course, we can create two different constructor functions for the Elasticsearch client to get over that problem. But, I didn't think that was necessary because what is a legitimate reason for having an Observer request timeout that is longer than the standard Elasticsearch client timeout? They are fetching the health information from the cluster and if that doesn't return in a sensible amount of time, then the chances of other API calls succeeding in a shorter time are pretty low as well.

Ah - looking at the code I understand technically that's because we use a timeout in the http client, but also an additional context for the observer http calls?
Having timeouts done one way in the reconciliation (http client timeout) vs. another way in the observer (through a context) has a bit of a smell to me.
Also while I'm not opposed to the argument that you generally want the observer timeout to be shorter than the reconciliation es timeout, it feels a bit like we're imposing a constraint due to the way the code is organized internally.
How about we refactor slightly the observer code to set the timeout on the http client instead of using a context? If needed I think we could add an additional SetTimeout(timeout time.Duration) function to the Elasticsearch Client interface so we can still pass the esClient around, and override its timeout at the observer level? But it looks like we could also pass the right timeout at the observer es client creation time?

To summarise an offline conversation I had with @sebgl: we both agreed that the special timeout for Observer requests doesn't make much sense and could just be removed.

sebgl

LGTM

charith-elastic added >feature Adds or discusses adding a feature to the product v1.3.0 labels Sep 24, 2020

pebrc reviewed Oct 1, 2020

View reviewed changes

sebgl reviewed Oct 2, 2020

View reviewed changes

pkg/controller/elasticsearch/client/shard.go Outdated Show resolved Hide resolved

sebgl reviewed Oct 2, 2020

View reviewed changes

pkg/controller/elasticsearch/driver/esstate.go Outdated Show resolved Hide resolved

sebgl reviewed Oct 2, 2020

View reviewed changes

pkg/utils/k8s/client.go Show resolved Hide resolved

sebgl reviewed Oct 2, 2020

View reviewed changes

cmd/manager/main.go Outdated Show resolved Hide resolved

charith-elastic added 5 commits October 5, 2020 09:11

Configurable Elasticsearch client timeout

c7ab941

Configurable observer settings

e51e325

Add Kube timeout flag

490682c

Cleanup

d4f9c37

Address CR comments

c9eb703

charith-elastic force-pushed the feature/timeout-config branch from 1ebf39d to c9eb703 Compare October 5, 2020 09:19

pebrc approved these changes Oct 5, 2020

View reviewed changes

pkg/controller/elasticsearch/client/client_test.go Outdated Show resolved Hide resolved

sebgl reviewed Oct 5, 2020

View reviewed changes

charith-elastic added 2 commits October 5, 2020 13:54

Address CR comments

97d0565

Revert import ordering

5a69cc3

sebgl reviewed Oct 6, 2020

View reviewed changes

charith-elastic added 4 commits October 6, 2020 12:18

Remove pointer

feaa7a0

Merge branch 'master' into feature/timeout-config

a2e5bb1

Fix e2e code

f670b38

Remove observer timeout

c9e305a

sebgl approved these changes Oct 6, 2020

View reviewed changes

Merge branch 'master' into feature/timeout-config

9904b38

charith-elastic merged commit 065cafe into elastic:master Oct 6, 2020

charith-elastic deleted the feature/timeout-config branch October 6, 2020 15:27

david-kow changed the title ~~Configurable timeouts~~ Make timeouts configurable Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make timeouts configurable #3782

Make timeouts configurable #3782

charith-elastic commented Sep 24, 2020 •

edited

Loading

pebrc left a comment

pebrc Oct 1, 2020

charith-elastic Oct 5, 2020

sebgl Oct 2, 2020

pebrc Oct 2, 2020

charith-elastic Oct 5, 2020

pebrc left a comment

sebgl Oct 5, 2020

charith-elastic Oct 5, 2020

sebgl Oct 5, 2020

charith-elastic Oct 5, 2020

sebgl left a comment

sebgl Oct 6, 2020

charith-elastic Oct 6, 2020

sebgl Oct 6, 2020

charith-elastic Oct 6, 2020

sebgl left a comment

Make timeouts configurable #3782

Make timeouts configurable #3782

Conversation

charith-elastic commented Sep 24, 2020 • edited Loading

pebrc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pebrc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebgl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebgl left a comment

Choose a reason for hiding this comment

charith-elastic commented Sep 24, 2020 •

edited

Loading