API failure prevents cluster upgrade #1847

barkbay · 2019-10-02T09:32:48Z

This is a follow-up of #1827

apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.3.0
  nodes:
  - name: default
    podTemplate:
      spec:
        nodeSelector:
          diskType: mybad
    nodeCount: 1
  - name: default-2
    nodeCount: 1

With the manifest above ES is considered as "reachable" (i.e.esReachable is true) because there is at least one node in the service.
The problem is that the subsequent call to the API fail and do not give a chance to the deployment to be updated.

The text was updated successfully, but these errors were encountered:

sebgl · 2019-10-02T09:42:12Z

Copy-pasting my comment from #1827 (comment):

I'm starting to think the check could be a bit more relaxed and consider Pods from a given StatefulSet instead of all Pods.
For example if all Pods of StatefulSet A are Pending or bootlooping, then force-upgrade them (bypassing any ES request or safety check). Even though Pods from StatefulSet B may be running fine.

barkbay · 2019-10-07T11:00:52Z

I still have the issue on master:

NAME                                                 READY   AGE     CONTAINERS      IMAGES
statefulset.apps/elasticsearch-sample-es-default     0/1     3m30s   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.3.0
statefulset.apps/elasticsearch-sample-es-default-2   1/1     3m29s   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.3.0


NAME                                      READY   STATUS    RESTARTS   AGE     IP           NODE                                                 NOMINATED NODE   READINESS GATES
pod/elasticsearch-sample-es-default-0     0/1     Pending   0          3m29s   <none>       <none>                                               <none>           <none>
pod/elasticsearch-sample-es-default-2-0   1/1     Running   0

2019-10-07T12:58:58.868+0200	ERROR	controller-runtime.controller	Reconciler error	{"ver": "0.10.0-SNAPSHOT-00000000", "controller": "elasticsearch-controller", "request": "default/elasticsearch-sample", "error": "unable to delete /_cluster/voting_config_exclusions: 503 Service Unavailable: ", "errorCauses": [{"error": "unable to delete /_cluster/voting_config_exclusions: 503 Service Unavailable: unknown", "errorVerbose": "503 Service Unavailable: unknown\nunable to delete /_cluster/voting_config_exclusions\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV7).DeleteVotingConfigExclusions\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v7.go:53\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2.ClearVotingConfigExclusions\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2/voting_exclusions.go:78\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).reconcileNodeSpecs\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/nodes.go:92\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:234\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:277\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:219\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"}]}
github.com/go-logr/zapr.(*zapLogger).Error
	/Users/michael/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
2019-10-07T12:58:59.874+0200	INFO	elasticsearch-controller	Starting reconciliation run	{"ver": "0.10.0-SNAPSHOT-00000000", "iteration": 18, "namespace": "default", "name": "elasticsearch-sample"}
2019-10-07T12:58:59.875+0200	INFO	transport	Skipping pod because it has no IP yet	{"ver": "0.10.0-SNAPSHOT-00000000", "namespace": "default", "pod_name": "elasticsearch-sample-es-default-0"}
2019-10-07T12:59:00.403+0200	INFO	zen2	Ensuring no voting exclusions are set	{"ver": "0.10.0-SNAPSHOT-00000000", "namespace": "default", "es_name": "elasticsearch-sample"}```

sebgl · 2019-10-07T13:32:11Z

Indeed you're right @barkbay.
Copy-paste from #1888 (comment):

The current code only fixes cases where:

es is not reachable (entire cluster broken)
es appears and is reachable (half the cluster may be broken but ES still responds to requests)

but not:

es appears but is not reachable (entire cluster broken even though some nodes are alive)

There are cases where Elasticsearch is reachable (some Pods are Ready), but cannot respond to any requests. For example, if there is 1/2 master nodes available. See elastic#1847. In such case, the bootlooping/pending 2nd master node will stay stuck forever since we will never reach the force upgrade part of the reconciliation. This commit fixes it by running force upgrades (if required) right after the upscale/spec change phase. This force upgrade phase becomes the new "Step 2". Following steps (downscale and regular upgrade) require the Elasticsearch cluster to be reachable. Due to how this force rolling upgrade deletes some pods and set some expectations, I chose to requeue immediately if it was attempted. This way we don't continue the reconciliation based on a transient state that would require us re-checking expectations. The next reconciliation can be a "regular" one. I think this also tends to simplify a bit the general logic: we first do everything that does not require the ES API (steps 1 and 2), then move on with downscales and standard rolling upgrades if ES is reachable (steps 3 and 4); instead of passing an `esReachable` bool around.

* Perform forced rolling upgrade even if ES is reachable There are cases where Elasticsearch is reachable (some Pods are Ready), but cannot respond to any requests. For example, if there is 1/2 master nodes available. See #1847. In such case, the bootlooping/pending 2nd master node will stay stuck forever since we will never reach the force upgrade part of the reconciliation. This commit fixes it by running force upgrades (if required) right after the upscale/spec change phase. This force upgrade phase becomes the new "Step 2". Following steps (downscale and regular upgrade) require the Elasticsearch cluster to be reachable. Due to how this force rolling upgrade deletes some pods and set some expectations, I chose to requeue immediately if it was attempted. This way we don't continue the reconciliation based on a transient state that would require us re-checking expectations. The next reconciliation can be a "regular" one. I think this also tends to simplify a bit the general logic: we first do everything that does not require the ES API (steps 1 and 2), then move on with downscales and standard rolling upgrades if ES is reachable (steps 3 and 4); instead of passing an `esReachable` bool around. * Modify e2e test to cover the es reachable case * Improve comment

barkbay added >bug Something isn't working v1.0.0-beta1 labels Oct 2, 2019

sebgl mentioned this issue Oct 4, 2019

drain node doesn't work with elasticsearch poddisruptionbudget policy #1824

Closed

sebgl self-assigned this Oct 4, 2019

sebgl mentioned this issue Oct 4, 2019

Force upgrade pods of a same StatefulSet #1888

Merged

sebgl closed this as completed in #1888 Oct 7, 2019

barkbay reopened this Oct 7, 2019

thbkrkr removed the v1.0.0-beta1 label Oct 9, 2019

sebgl mentioned this issue Oct 18, 2019

Perform forced rolling upgrade even if ES is reachable #2022

Merged

sebgl closed this as completed in #2022 Oct 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API failure prevents cluster upgrade #1847

API failure prevents cluster upgrade #1847

barkbay commented Oct 2, 2019 •

edited

Loading

sebgl commented Oct 2, 2019

barkbay commented Oct 7, 2019

sebgl commented Oct 7, 2019

API failure prevents cluster upgrade #1847

API failure prevents cluster upgrade #1847

Comments

barkbay commented Oct 2, 2019 • edited Loading

sebgl commented Oct 2, 2019

barkbay commented Oct 7, 2019

sebgl commented Oct 7, 2019

barkbay commented Oct 2, 2019 •

edited

Loading