-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Elasticsearch readiness port #7847
Conversation
buildkite test this -f p=gke,E2E_TAGS=es -m s=8.1.3,s=8.13.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I choose to create a separate file mounted next to the existing script to facilitate upgrades. Specifically I wanted to avoid a situation where during the upgrade we update the scripts configmap overwriting the old probe script with the new one and thereby breaking the not yet upgraded nodes in the cluster.
👍
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
buildkite test this p=gke,s=8.1.3,t=TestVersionUpgradeToLatest8x Edit: Oups, I didn't see you've already done it in the long version via: #7847 (comment). |
"command": [ | ||
"bash", | ||
"-c", | ||
"/mnt/elastic-internal/scripts/pre-stop-hook-script.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre 8.2.0
pkg/controller/elasticsearch/nodespec/__snapshots__/podspec_test.snap
Outdated
Show resolved
Hide resolved
"command": [ | ||
"bash", | ||
"-c", | ||
"/mnt/elastic-internal/scripts/readiness-port-script.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New script post 8.2.0
buildkite test this p=gke,s=8.1.3,t=TestVersionUpgradeToLatest8x |
buildkite test this -f p=kind,s=8.1.3 -m t=TestVersionUpgradeSingleToLatest8x,t=TestVersionUpgradeTwoNodesToLatest8x,t=TestExternalESStackMonitoring,t=TestForceUpgradePendingPodsInOneStatefulSet,t=TestKillSingleNodeReusePV,t=TestPodTemplateValidation,t=TestRedClusterCanBeModifiedByDisablingPredicate,t=TestStackConfigPolicy,t=TestVolumeRetention |
There is an edge case with single node clusters that I did not consider:
The problem does not present itself with multinode clusters where at least one node is available at all times in the same way but the edge case applies to those as well if for example all nodes are deleted at the same time or some other external factor. |
buildkite test this -f p=kind,E2E_TAGS=es -m s=8.1.3,s=8.13.2 |
buildkite test this -f p=kind,E2E_TAGS=es -m s=8.1.3,s=8.13.2 |
I am thinking about the different options to address the problem:
Not that all of the options only affect the internal service, the external service used by users and clients should still be based on teh readiness of the Pods. @barkbay @thbkrkr given that you have reviewed this PR I would be curious about your thoughts. |
Could we rely on the service unless there is no |
buildkite test this -f p=kind,E2E_TAGS=es -m s=8.1.3,s=8.13.2 |
The test run from the comment above completed successfully in 2h instead of the usual 4h, which is suspicious. I need to take a closer look if this is a bug or if the fact the we optmistically make connections to non-ready pods speeds up the tests so much. |
buildkite test this -f p=gke,TESTS_MATCH=* -m s=8.1.3,s=8.13.2 |
buildkite test this -f p=gke,t=Test -m s=8.1.3,s=8.13.2 |
It is indeed faster.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I tried to compare memory usage against a baseline from the last nightly run: with the one of the longer runs triggered from this PR here #7847 (comment) I did not spot any significant change, but the comparison leaves much to be desired. |
Fixes #7841
As of 8.2.0 use the new
readiness.port
setting to enable a TCP readiness check that is sensitive to cluster membeship of the node. This should improve cluster availablity during external disruptions e.g. due to node upgrades, as the PDB will disallow furhter upgrades until the most recently upgraded node has rejoined the cluster.I choose to create a separate file mounted next to the existing script to facilitate upgrades. Specifically I wanted to avoid a situation where during the upgrade we update the scripts configmap overwriting the old probe script with the new one and thereby breaking the not yet upgraded nodes in the cluster.
The approach taken here creates some techical debt in form of the extra script hanging around when not needed. But I imaging we can drop it when we stop supporting ES < 8.2.0.
An alternative approach would be to integrate a version check in the existing script which seemed more complicated to reason about to me.
Marked as bug because a cluster might be unavailable if master nodes are deleted while the previously deleted ones are not yet back in the cluster, which violates our promise of interruption free ES operations when running in HA mode with multiple master nodes.