Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform forced rolling upgrade even if ES is reachable #2022

Merged
merged 4 commits into from
Oct 24, 2019

Conversation

sebgl
Copy link
Contributor

@sebgl sebgl commented Oct 18, 2019

There are cases where Elasticsearch is reachable
(some Pods are Ready), but cannot respond to any requests.
For example, if there is 1/2 master nodes available. See
#1847. In such case,
the bootlooping/pending 2nd master node will stay stuck forever since we
will never reach the force upgrade part of the reconciliation.

This commit fixes it by running force upgrades (if required) right after
the upscale/spec change phase. This force upgrade phase becomes the new
"Step 2". Following steps (downscale and regular upgrade) require the
Elasticsearch cluster to be reachable.

Due to how this force rolling upgrade deletes some pods and set some
expectations, I chose to requeue immediately if it was attempted. This
way we don't continue the reconciliation based on a transient state
that would require us re-checking expectations. The next reconciliation
can be a "regular" one.

I think this also tends to simplify a bit the general logic: we first do
everything that does not require the ES API (steps 1 and 2), then move
on with downscales and standard rolling upgrades if ES is reachable
(steps 3 and 4); instead of passing an esReachable bool around.

The PR also modifies the existing force upgrade E2E test so that it covers
the case where Elasticsearch can be reached, but responds with 503.
The new version of the test does not pass with the master branch, but
does pass with this branch.

Fixes #1847.

sebgl added 2 commits October 18, 2019 10:05
There are cases where Elasticsearch is reachable
(some Pods are Ready), but cannot respond to any requests.
For example, if there is 1/2 master nodes available. See
elastic#1847. In such case,
the bootlooping/pending 2nd master node will stay stuck forever since we
will never reach the force upgrade part of the reconciliation.

This commit fixes it by running force upgrades (if required) right after
the upscale/spec change phase. This force upgrade phase becomes the new
"Step 2". Following steps (downscale and regular upgrade) require the
Elasticsearch cluster to be reachable.

Due to how this force rolling upgrade deletes some pods and set some
expectations, I chose to requeue immediately if it was attempted. This
way we don't continue the reconciliation based on a transient state
that would require us re-checking expectations. The next reconciliation
can be a "regular" one.

I think this also tends to simplify a bit the general logic: we first do
everything that does not require the ES API (steps 1 and 2), then move
on with downscales and standard rolling upgrades if ES is reachable
(steps 3 and 4); instead of passing an `esReachable` bool around.
@sebgl sebgl added >enhancement Enhancement of existing functionality v1.0.0 labels Oct 18, 2019
@pebrc pebrc self-assigned this Oct 24, 2019
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

pkg/controller/elasticsearch/driver/nodes.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality v1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API failure prevents cluster upgrade
3 participants