Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestReversalStatefulSetRename is flaky #3844

Closed
barkbay opened this issue Oct 15, 2020 · 3 comments
Closed

TestReversalStatefulSetRename is flaky #3844

barkbay opened this issue Oct 15, 2020 · 3 comments
Labels
>flaky_test >test Related to unit/integration/e2e tests

Comments

@barkbay
Copy link
Contributor

barkbay commented Oct 15, 2020

=== RUN   TestReversalStatefulSetRename/All_expected_Pods_should_eventually_be_ready#01
Retries (30m0s timeout): 
    utils.go:84: 
        	Error Trace:	utils.go:84
        	Error:      	Received unexpected error:
        	            	invalid StatefulSets:
                            expected map[test-sset-rename-reversal-pntt-es-masterdata:1],
                                 got map[test-sset-rename-reversal-pntt-es-masterdata:1 test-sset-rename-reversal-pntt-es-other:1]

The steps of this test are:

  1. Rename the current nodeSet from masterdata to other
  2. Check that the configuration has been updated
  3. Rename other back to masterdata

It means that the nodeSet is renamed for a vey short moment.

When test is failing the resources are:

NAME                                                                        HEALTH   NODES   VERSION   PHASE   AGE
elasticsearch.elasticsearch.k8s.elastic.co/test-sset-rename-reversal-nr6k   green    1       7.9.1     Ready   25m

NAME                                                            READY   AGE   CONTAINERS      IMAGES
statefulset.apps/test-sset-rename-reversal-nr6k-es-masterdata   1/1     25m   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.9.1
statefulset.apps/test-sset-rename-reversal-nr6k-es-other        0/1     23m   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.9.1

NAME                                                 READY   STATUS     RESTARTS   AGE   IP            NODE                                           NOMINATED NODE   READINESS GATES
pod/test-sset-rename-reversal-nr6k-es-masterdata-0   1/1     Running    0          25m   10.0.129.77   gke-michael-dev-2-default-pool-add12a04-223g   <none>           <none>
pod/test-sset-rename-reversal-nr6k-es-other-0        0/1     Init:0/1   0          23m   10.0.128.77   gke-michael-dev-2-default-pool-ce5974a4-4jt0   <none>           <none>

NAME                                                                     TYPE                                  DATA   AGE
[..]
secret/test-sset-rename-reversal-nr6k-es-masterdata-es-transport-certs   Opaque                                3      25m
secret/test-sset-rename-reversal-nr6k-es-other-es-transport-certs        Opaque                                1      23m

test-sset-rename-reversal-nr6k-es-other-0 can't start because the transport certificate is not generated (there's only ca.crt in test-sset-rename-reversal-nr6k-es-other-es-transport-certs), which seems to be expected knowing that the nodeSet other does not exist anymore in the Elasticsearch spec.

The Pod is not deleted because of the reason: Cannot remove the last running master node.
checkDownscaleInvariants is assuming that there is only one master running (which is true, the other one will never start), but hence refuse to delete test-sset-rename-reversal-nr6k-es-other-0 which is also a master according to its labels. (even if it has never really been the case because it never started).

@barkbay barkbay added >flaky_test >test Related to unit/integration/e2e tests labels Oct 15, 2020
@barkbay
Copy link
Contributor Author

barkbay commented Oct 15, 2020

Related to #1986 and #2951 if we decide to delete a master node which never started.

@barkbay
Copy link
Contributor Author

barkbay commented Oct 15, 2020

Two things to be considered in order to solve that issue:

1. Update DownscaleInvariants

For the moment ECK is only accounting for the running masters in order to check if a Pod labeled as a master can be removed. This leads to the Cannot remove the last running master node error:

mastersReady := reconcile.AvailableElasticsearchNodes(label.FilterMasterNodePods(actualPods))
func checkDownscaleInvariants(state downscaleState, statefulSet appsv1.StatefulSet, requestedDeletes int32) (int32, string) {
	if label.IsMasterNodeSet(statefulSet) {
		if state.masterRemovalInProgress {
			return 0, OneMasterAtATimeInvariant
		}
		if state.runningMasters == 1 {
			return 0, AtLeastOneRunningMasterInvariant
		}
		requestedDeletes = 1 // only one removal allowed for masters
	}
	allowedDeletes := state.getMaxNodesToRemove(requestedDeletes)

	if allowedDeletes == 0 {
		return 0, RespectMaxUnavailableInvariant
	}

	return allowedDeletes, ""
}

Do we really need to account for running nodes only ?

2. Exclude master nodes, even if they never joined the cluster

ES < 7.8

There is no reliable way to do that for ES < 7.8. Even if we parse the error from Elasticsearch and ignore it, the Pod might temporarily join the cluster right after the API call.

ES >= 7.8

We should use the new API as described in elastic/elasticsearch#55291

@pebrc @sebgl @anyasabo Thanks for your help 🙇 , feel free to comment if I missed anything.

@thbkrkr
Copy link
Contributor

thbkrkr commented Oct 12, 2021

Closed by #3846 which enabled TestReversalStatefulSetRename again.

@thbkrkr thbkrkr closed this as completed Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>flaky_test >test Related to unit/integration/e2e tests
Projects
None yet
Development

No branches or pull requests

2 participants