Green health is reported while it should be unknown #2939

barkbay · 2020-04-23T09:05:54Z

I got into a situation where the reported status was always green:

NAME                                              HEALTH   NODES   VERSION   PHASE             AGE
elasticsearch.elasticsearch.k8s.elastic.co/es-0   green    5       7.6.2     ApplyingChanges   7m42s

While my cluster was broken with only 1 master available out of 3, which means that the status should have been unknown I guess:

# curl -k -v -u elastic-internal-probe:redacted https://127.0.0.1:9200/_cluster/health
< HTTP/1.1 503 Service Unavailable

The text was updated successfully, but these errors were encountered:

anyasabo · 2020-06-09T19:38:45Z

I can reproduce this. In my 3 node GKE cluster and an ES cluster with 3 masters, I set hard anti affinity rules on the masters, let it build. I then tainted two of the nodes and killed two of the pods. The ES resource stayed green/Ready afterwards, even though it was reconciling it

{"log.level":"info","@timestamp":"2020-06-09T19:36:16.712Z","log.logger":"generic-reconciler","message":"Recoverable error during step, continuing","service.version":"1.2.0-2bc08ef2","service.type":"eck","ecs.version":"1.4.0","step":"reconcile-cluster-license","error":"failed to revert to basic: 503 Service Unavailable: unknown","errorVerbose":"503 Service Unavailable

{"log.level":"error","@timestamp":"2020-06-09T19:35:16.311Z","log.logger":"driver","message":"Could not update remote clusters in Elasticsearch settings","service.version":"1.2.0-2bc08ef2","service.type":"eck","ecs.version":"1.4.0","namespace":"default","es_name":"elasticsearch-aff","error":"503 Service Unavailable: ","error.stack_trace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:228

{"log.level":"info","@timestamp":"2020-06-09T19:34:46.294Z","log.logger":"generic-reconciler","message":"Recoverable error during step, continuing","service.version":"1.2.0-2bc08ef2","service.type":"eck","ecs.version":"1.4.0","step":"reconcile-cluster-license","error":"failed to revert to basic: 503 Service Unavailable: unknown",

Taking a look at it more to see how we should approach it.

anyasabo · 2020-11-03T18:24:52Z

Unassigning myself, I think this is worth double checking once the refactoring for #3496 is complete as that will be much more reward for the effort.

barkbay added the >bug Something isn't working label Apr 23, 2020

anyasabo self-assigned this May 15, 2020

anyasabo mentioned this issue Jul 21, 2020

Short-circuit reconciliation if ES is down #3496

Closed

anyasabo removed their assignment Nov 3, 2020

thbkrkr mentioned this issue Oct 11, 2021

Shorten the reconciliation loop duration if Elasticsearch is down #4938

Merged

thbkrkr closed this as completed in #4938 Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Green health is reported while it should be unknown #2939

Green health is reported while it should be unknown #2939

barkbay commented Apr 23, 2020

anyasabo commented Jun 9, 2020 •

edited

Loading

anyasabo commented Nov 3, 2020

Green health is reported while it should be unknown #2939

Green health is reported while it should be unknown #2939

Comments

barkbay commented Apr 23, 2020

anyasabo commented Jun 9, 2020 • edited Loading

anyasabo commented Nov 3, 2020

anyasabo commented Jun 9, 2020 •

edited

Loading