Shorten the reconciliation loop duration if Elasticsearch is down #4938

thbkrkr · 2021-10-11T11:45:24Z

This commit allows for a reconciliation loop that takes 30sec instead of 1min30sec when Elasticsearch is down and immediatly update the Elasticsearch status instead of waiting another reconciliation loop.

The first call to the ES API is now to get the license (extracted from the license reconciliation) and is used to update the esReachable bool. Then we skip the steps that requires ES up (remote clusters settings, update annotation with cluster uuid) using esReachable. The Elasticsearch status is updated as soon as we got an error when calling ES with the last observed state.

observedState is changed to be able to get dynamically the last observer state just before updating the status, otherwise we have to wait for another reconciliation because at the beginning of the ES reconciliation that will fail because ES is down, the last observer state is very often still green because the state was retrieved between 0 and 10 secondes ago.

Example timeline:

delete pod (that will make ES down)
es is stopping
es is stopped
pod is deleted
kick off an ES reconciliation
lastState is green (this info is between 0 and 10 secondes old)
reconcile license cluster but ES is down
update state with the last state which was green!

For testing, deploy an ES with a strict pod anti-affinity rule:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: bimbadaboum
spec:
  version: 7.15.0
  nodeSets:
  - name: default
    count: 3
    podTemplate:
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: bimbadaboum
              topologyKey: kubernetes.io/hostname

# list pods with k8s node names
> k get pods -o json | jq '.items[] | {n:.metadata.name,node:.spec.nodeName}' -c

# cordon 2 k8s nodes and delete the associated pods
> k cordon gke-abc-dev-c0-default-pool-1
> k delete po bimbadaboum-es-default-1
> k cordon gke-abc-dev-c0-default-pool-2
> k delete po bimbadaboum-es-default-0

# monitor the ES health state update
> while true; do echo -n "$(date) | status= "; kubectl get es bimbadaboum -o json | jq '.status' -c; sleep 5; done
lun. 11 oct. 2021 13:11:48 CEST | status= {"availableNodes":2,"health":"green","phase":"Ready","version":"7.15.0"}
lun. 11 oct. 2021 13:11:53 CEST | status= {"availableNodes":2,"health":"green","phase":"Ready","version":"7.15.0"}
lun. 11 oct. 2021 13:11:59 CEST | status= {"availableNodes":1,"health":"green","phase":"Ready","version":"7.15.0"} <<<< delete pod => -1 node
lun. 11 oct. 2021 13:12:04 CEST | status= {"availableNodes":1,"health":"green","phase":"Ready","version":"7.15.0"}
...
lun. 11 oct. 2021 13:13:09 CEST | status= {"availableNodes":1,"health":"green","phase":"Ready","version":"7.15.0"}
lun. 11 oct. 2021 13:13:15 CEST | status= {"availableNodes":1,"health":"green","phase":"Ready","version":"7.15.0"}
lun. 11 oct. 2021 13:13:20 CEST | status= {"availableNodes":1,"health":"unknown","phase":"Ready","version":"7.15.0"} <<<< health updated
lun. 11 oct. 2021 13:13:26 CEST | status= {"availableNodes":1,"health":"unknown","phase":"Ready","version":"7.15.0"}
lun. 11 oct. 2021 13:13:31 CEST | status= {"availableNodes":1,"health":"unknown","phase":"Ready","version":"7.15.0"}

Before this change, it took ~3min50sec (including 50sec for the ES shutdown + 2 * (3 * 30)sec reconciliation loops).
With this change, it takes ~1min20sec (including 50sec for the ES shutdown + 1 * (1 * 30)sec)

Resolves #3496.
Resolves #2939.

pkg/controller/elasticsearch/driver/driver.go

I think I missed a case where this is a problem e.g. if 2 out of 3 nodes a boot looping

…onciliation-if-es-is-down

pebrc

LGTM

pkg/controller/elasticsearch/driver/driver.go

…onciliation-if-es-is-down

thbkrkr · 2021-10-19T10:17:10Z

I think the last version covers all the feedbacks:

the reconciliation loop is not stopped if a call to ES fail, we just try to skip steps that requires ES up depending the esReachable bool
the first interaction with ES is now to get the license (extracted from the license reconciliation) and is used to update esReachable

too much change since approval

pkg/controller/elasticsearch/driver/driver.go

pkg/controller/elasticsearch/remotecluster/elasticsearch.go

pebrc

LGTM

pkg/controller/elasticsearch/license/reconcile.go

Co-authored-by: Peter Brachwitz <peter.brachwitz@gmail.com>

…astic#4938) This commit allows for a reconciliation loop that takes 30sec instead of 1min30sec when Elasticsearch is down and immediatly update the Elasticsearch status instead of waiting another reconciliation loop. The first call to the ES API is now to get the license (extracted from the license reconciliation) and is used to update the `esReachable` bool. Then we skip the steps that requires ES up (remote clusters settings, update annotation with cluster uuid) using `esReachable`. The Elasticsearch status is updated as soon as we got an error when calling ES with the last observed state. `observedState` is changed to be able to get dynamically the last observer state just before updating the status, otherwise we have to wait for another reconciliation because at the beginning of the ES reconciliation that will fail because ES is down, the last observer state is very often still green because the state was retrieved between 0 and 10 secondes ago.

thbkrkr added 3 commits October 11, 2021 12:37

Get last state dynamically

9d92179

Stop reconcilliation on requeue and update es state

dadb1aa

Same on remote cluster settings update

c0e6cfd

thbkrkr added the >enhancement Enhancement of existing functionality label Oct 11, 2021

pebrc previously approved these changes Oct 12, 2021

View reviewed changes

pkg/controller/elasticsearch/driver/driver.go Show resolved Hide resolved

thbkrkr added 2 commits October 12, 2021 14:25

Skip ES interactions and continue to reconcile

1f6b95a

Merge remote-tracking branch 'upstream/master' into short-circuit-rec…

5983752

…onciliation-if-es-is-down

thbkrkr mentioned this pull request Oct 12, 2021

TestDynamicEnqueueRequest is flaky #4692

Closed

pebrc previously approved these changes Oct 15, 2021

View reviewed changes

sebgl reviewed Oct 15, 2021

View reviewed changes

pkg/controller/elasticsearch/driver/driver.go Outdated Show resolved Hide resolved

thbkrkr added 2 commits October 18, 2021 17:50

Merge remote-tracking branch 'upstream/master' into short-circuit-rec…

e12d3b1

…onciliation-if-es-is-down

Move CheckElasticsearchLicense call in the driver

82fcdf8

thbkrkr force-pushed the short-circuit-reconciliation-if-es-is-down branch from f4d32cb to 397313a Compare October 19, 2021 09:45

thbkrkr requested a review from pebrc October 19, 2021 09:55

thbkrkr added 2 commits October 19, 2021 12:06

Revert

9d50619

Update unit test

f47c3a2

thbkrkr force-pushed the short-circuit-reconciliation-if-es-is-down branch from 397313a to f47c3a2 Compare October 19, 2021 10:06

pebrc reviewed Oct 19, 2021

View reviewed changes

pkg/controller/elasticsearch/driver/driver.go Outdated Show resolved Hide resolved

pkg/controller/elasticsearch/driver/driver.go Outdated Show resolved Hide resolved

pkg/controller/elasticsearch/remotecluster/elasticsearch.go Outdated Show resolved Hide resolved

thbkrkr added 3 commits October 19, 2021 14:33

Revert esReachable passed in remotecluster.UpdateSettings

22fdb06

Check esReachable to reconcile es license

1b21e49

Use a custom error in CheckElasticsearchLicense

2475dd2

pebrc approved these changes Oct 19, 2021

View reviewed changes

pkg/controller/elasticsearch/license/reconcile.go Outdated Show resolved Hide resolved

Update pkg/controller/elasticsearch/license/reconcile.go

6ba17af

Co-authored-by: Peter Brachwitz <peter.brachwitz@gmail.com>

thbkrkr added the v1.9.0 label Oct 19, 2021

thbkrkr merged commit 8b3c345 into elastic:master Oct 19, 2021

thbkrkr changed the title ~~Short-circuit reconciliation if Elasticsearch is down~~ Shorten the reconciliation loop duration if Elasticsearch is down Oct 19, 2021

thbkrkr deleted the short-circuit-reconciliation-if-es-is-down branch November 3, 2021 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shorten the reconciliation loop duration if Elasticsearch is down #4938

Shorten the reconciliation loop duration if Elasticsearch is down #4938

thbkrkr commented Oct 11, 2021 •

edited

Loading

pebrc left a comment

thbkrkr commented Oct 19, 2021

pebrc left a comment

Shorten the reconciliation loop duration if Elasticsearch is down #4938

Shorten the reconciliation loop duration if Elasticsearch is down #4938

Conversation

thbkrkr commented Oct 11, 2021 • edited Loading

pebrc left a comment

Choose a reason for hiding this comment

thbkrkr commented Oct 19, 2021

pebrc left a comment

Choose a reason for hiding this comment

thbkrkr commented Oct 11, 2021 •

edited

Loading