Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API failure prevents cluster upgrade #1847

Closed
barkbay opened this issue Oct 2, 2019 · 3 comments · Fixed by #1888 or #2022
Closed

API failure prevents cluster upgrade #1847

barkbay opened this issue Oct 2, 2019 · 3 comments · Fixed by #1888 or #2022
Assignees
Labels
>bug Something isn't working

Comments

@barkbay
Copy link
Contributor

barkbay commented Oct 2, 2019

This is a follow-up of #1827

apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.3.0
  nodes:
  - name: default
    podTemplate:
      spec:
        nodeSelector:
          diskType: mybad
    nodeCount: 1
  - name: default-2
    nodeCount: 1

With the manifest above ES is considered as "reachable" (i.e.esReachable is true) because there is at least one node in the service.
The problem is that the subsequent call to the API fail and do not give a chance to the deployment to be updated.

@barkbay barkbay added >bug Something isn't working v1.0.0-beta1 labels Oct 2, 2019
@sebgl
Copy link
Contributor

sebgl commented Oct 2, 2019

Copy-pasting my comment from #1827 (comment):

I'm starting to think the check could be a bit more relaxed and consider Pods from a given StatefulSet instead of all Pods.
For example if all Pods of StatefulSet A are Pending or bootlooping, then force-upgrade them (bypassing any ES request or safety check). Even though Pods from StatefulSet B may be running fine.

@barkbay
Copy link
Contributor Author

barkbay commented Oct 7, 2019

I still have the issue on master:

NAME                                                 READY   AGE     CONTAINERS      IMAGES
statefulset.apps/elasticsearch-sample-es-default     0/1     3m30s   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.3.0
statefulset.apps/elasticsearch-sample-es-default-2   1/1     3m29s   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.3.0


NAME                                      READY   STATUS    RESTARTS   AGE     IP           NODE                                                 NOMINATED NODE   READINESS GATES
pod/elasticsearch-sample-es-default-0     0/1     Pending   0          3m29s   <none>       <none>                                               <none>           <none>
pod/elasticsearch-sample-es-default-2-0   1/1     Running   0
2019-10-07T12:58:58.868+0200	ERROR	controller-runtime.controller	Reconciler error	{"ver": "0.10.0-SNAPSHOT-00000000", "controller": "elasticsearch-controller", "request": "default/elasticsearch-sample", "error": "unable to delete /_cluster/voting_config_exclusions: 503 Service Unavailable: ", "errorCauses": [{"error": "unable to delete /_cluster/voting_config_exclusions: 503 Service Unavailable: unknown", "errorVerbose": "503 Service Unavailable: unknown\nunable to delete /_cluster/voting_config_exclusions\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV7).DeleteVotingConfigExclusions\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v7.go:53\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2.ClearVotingConfigExclusions\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2/voting_exclusions.go:78\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).reconcileNodeSpecs\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/nodes.go:92\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:234\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:277\ngh.neting.cc/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:219\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"}]}
github.com/go-logr/zapr.(*zapLogger).Error
	/Users/michael/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/Users/michael/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/Users/michael/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
2019-10-07T12:58:59.874+0200	INFO	elasticsearch-controller	Starting reconciliation run	{"ver": "0.10.0-SNAPSHOT-00000000", "iteration": 18, "namespace": "default", "name": "elasticsearch-sample"}
2019-10-07T12:58:59.875+0200	INFO	transport	Skipping pod because it has no IP yet	{"ver": "0.10.0-SNAPSHOT-00000000", "namespace": "default", "pod_name": "elasticsearch-sample-es-default-0"}
2019-10-07T12:59:00.403+0200	INFO	zen2	Ensuring no voting exclusions are set	{"ver": "0.10.0-SNAPSHOT-00000000", "namespace": "default", "es_name": "elasticsearch-sample"}```

@barkbay barkbay reopened this Oct 7, 2019
@sebgl
Copy link
Contributor

sebgl commented Oct 7, 2019

Indeed you're right @barkbay.
Copy-paste from #1888 (comment):

The current code only fixes cases where:

  • es is not reachable (entire cluster broken)
  • es appears and is reachable (half the cluster may be broken but ES still responds to requests)

but not:

  • es appears but is not reachable (entire cluster broken even though some nodes are alive)

sebgl added a commit to sebgl/cloud-on-k8s that referenced this issue Oct 18, 2019
There are cases where Elasticsearch is reachable
(some Pods are Ready), but cannot respond to any requests.
For example, if there is 1/2 master nodes available. See
elastic#1847. In such case,
the bootlooping/pending 2nd master node will stay stuck forever since we
will never reach the force upgrade part of the reconciliation.

This commit fixes it by running force upgrades (if required) right after
the upscale/spec change phase. This force upgrade phase becomes the new
"Step 2". Following steps (downscale and regular upgrade) require the
Elasticsearch cluster to be reachable.

Due to how this force rolling upgrade deletes some pods and set some
expectations, I chose to requeue immediately if it was attempted. This
way we don't continue the reconciliation based on a transient state
that would require us re-checking expectations. The next reconciliation
can be a "regular" one.

I think this also tends to simplify a bit the general logic: we first do
everything that does not require the ES API (steps 1 and 2), then move
on with downscales and standard rolling upgrades if ES is reachable
(steps 3 and 4); instead of passing an `esReachable` bool around.
sebgl added a commit that referenced this issue Oct 24, 2019
* Perform forced rolling upgrade even if ES is reachable

There are cases where Elasticsearch is reachable
(some Pods are Ready), but cannot respond to any requests.
For example, if there is 1/2 master nodes available. See
#1847. In such case,
the bootlooping/pending 2nd master node will stay stuck forever since we
will never reach the force upgrade part of the reconciliation.

This commit fixes it by running force upgrades (if required) right after
the upscale/spec change phase. This force upgrade phase becomes the new
"Step 2". Following steps (downscale and regular upgrade) require the
Elasticsearch cluster to be reachable.

Due to how this force rolling upgrade deletes some pods and set some
expectations, I chose to requeue immediately if it was attempted. This
way we don't continue the reconciliation based on a transient state
that would require us re-checking expectations. The next reconciliation
can be a "regular" one.

I think this also tends to simplify a bit the general logic: we first do
everything that does not require the ES API (steps 1 and 2), then move
on with downscales and standard rolling upgrades if ES is reachable
(steps 3 and 4); instead of passing an `esReachable` bool around.

* Modify e2e test to cover the es reachable case

* Improve comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
3 participants