-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Red health during E2E test mutation #2788
Comments
I'm not able to reproduce locally. I think there's a real bug hidden here. The test mutates 3 master-data + 2 masters to keep only 1 master-data. |
|
Looking at
it seems we would be running with 0 replicas for this test case as the target cluster size is a single node. I wonder if temporary red cluster state is to be expected in this case as the primary will not be allocated in the moment it relocates from one of the excluded nodes to one of the surviving nodes. I assume the test fails only sometimes because our observation interval is 3 seconds and with the small number of test documents in the index it might be able to relocate within that 3 second window in most runs. |
The test indicates the node has been red for 5 minutes once the mutation is over. I think it's not a transient situation. |
You are right, I did not think of that. I was thinking of our continuous health check which interestingly does not fail in this run. |
From this failed E2E test:
As we suspected, one of the shard became unassigned, as if it was not migrated correctly. |
A few observations
[1] https://github.com/Conky5/eq |
In build 40
in build 57 we see a pattern that is consistent with the explanation @barkbay has outlined in #2864 , which is that annotation and reality of excludes diverge due to failed updates of the annotation caused by conflicts:
If only the first update of the annotation succeeded we might think that |
It seems to be excluded during iteration 29
30
31
32
|
@barkbay You are absolutely right! I only looked at our monitoring cluster and the sad truth seems to be that we did not ingest all operator logs. Luckily we kept the cluster running ... |
https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-stack-versions/32/testReport/github/com_elastic_cloud-on-k8s_test_e2e_es/Run_tests_for_different_ELK_stack_versions_in_GKE___7_6_0___TestMutationSecondMasterSetDown_ES_cluster_health_should_eventually_be_green_01/
The text was updated successfully, but these errors were encountered: