-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API failure prevents cluster upgrade #1847
Comments
Copy-pasting my comment from #1827 (comment):
|
I still have the issue on master:
|
Indeed you're right @barkbay. The current code only fixes cases where:
but not:
|
There are cases where Elasticsearch is reachable (some Pods are Ready), but cannot respond to any requests. For example, if there is 1/2 master nodes available. See elastic#1847. In such case, the bootlooping/pending 2nd master node will stay stuck forever since we will never reach the force upgrade part of the reconciliation. This commit fixes it by running force upgrades (if required) right after the upscale/spec change phase. This force upgrade phase becomes the new "Step 2". Following steps (downscale and regular upgrade) require the Elasticsearch cluster to be reachable. Due to how this force rolling upgrade deletes some pods and set some expectations, I chose to requeue immediately if it was attempted. This way we don't continue the reconciliation based on a transient state that would require us re-checking expectations. The next reconciliation can be a "regular" one. I think this also tends to simplify a bit the general logic: we first do everything that does not require the ES API (steps 1 and 2), then move on with downscales and standard rolling upgrades if ES is reachable (steps 3 and 4); instead of passing an `esReachable` bool around.
* Perform forced rolling upgrade even if ES is reachable There are cases where Elasticsearch is reachable (some Pods are Ready), but cannot respond to any requests. For example, if there is 1/2 master nodes available. See #1847. In such case, the bootlooping/pending 2nd master node will stay stuck forever since we will never reach the force upgrade part of the reconciliation. This commit fixes it by running force upgrades (if required) right after the upscale/spec change phase. This force upgrade phase becomes the new "Step 2". Following steps (downscale and regular upgrade) require the Elasticsearch cluster to be reachable. Due to how this force rolling upgrade deletes some pods and set some expectations, I chose to requeue immediately if it was attempted. This way we don't continue the reconciliation based on a transient state that would require us re-checking expectations. The next reconciliation can be a "regular" one. I think this also tends to simplify a bit the general logic: we first do everything that does not require the ES API (steps 1 and 2), then move on with downscales and standard rolling upgrades if ES is reachable (steps 3 and 4); instead of passing an `esReachable` bool around. * Modify e2e test to cover the es reachable case * Improve comment
This is a follow-up of #1827
With the manifest above ES is considered as "reachable" (i.e.
esReachable
istrue
) because there is at least one node in the service.The problem is that the subsequent call to the API fail and do not give a chance to the deployment to be updated.
The text was updated successfully, but these errors were encountered: