Red health during E2E test mutation #2788

sebgl · 2020-03-31T07:21:29Z

https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-stack-versions/32/testReport/github/com_elastic_cloud-on-k8s_test_e2e_es/Run_tests_for_different_ELK_stack_versions_in_GKE___7_6_0___TestMutationSecondMasterSetDown_ES_cluster_health_should_eventually_be_green_01/

=== RUN   TestMutationSecondMasterSetDown/ES_cluster_health_should_eventually_be_green#01
Retries (5m0s timeout): ....................................................................................................
{"log.level":"error","@timestamp":"2020-03-31T02:13:11.533Z","message":"stopping early","service.version":"0.0.0-00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:42\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:37\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/test.RunMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:78\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/es.RunESMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/mutation_test.go:299\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/es.TestMutationSecondMasterSetDown\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/mutation_test.go:168\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:909"}
    --- FAIL: TestMutationSecondMasterSetDown/ES_cluster_health_should_eventually_be_green#01 (300.00s)
        utils.go:84: 
            	Error Trace:	utils.go:84
            	Error:      	Received unexpected error:
            	            	health is red
            	Test:       	TestMutationSecondMasterSetDown/ES_cluster_health_should_eventually_be_green#01

The text was updated successfully, but these errors were encountered:

sebgl · 2020-04-03T15:57:48Z

I'm not able to reproduce locally.

I think there's a real bug hidden here. The test mutates 3 master-data + 2 masters to keep only 1 master-data.
The output shows the cluster is still red 5 minutes after the mutation is over. It is not a network issue. It would indicate a problem with the way we decommission nodes/migrate data?

sebgl · 2020-04-09T07:40:41Z

It happened again: https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-stack-versions/42/testReport/github/com_elastic_cloud-on-k8s_test_e2e_es/Run_tests_for_different_ELK_stack_versions_in_GKE___7_4_2___TestMutationSecondMasterSetDown_ES_cluster_health_should_eventually_be_green_01/.

pebrc · 2020-04-09T08:52:25Z

Apr 9, 2020 @ 00:23:19.845
[elasticsearch.server][INFO] master node changed {previous [], current [{test-mutation-2nd-master-set-dtvx-es-masterdata-0}{yyXrADMUQ2ad46TSRTVctw}{E8RJGRlFT8qUANHGXfFlyg}{10.1.49.47}{10.1.49.47:9300}{dilm}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20}]}, term: 6, version: 49, reason: Publication{term=6, version=49}
test-mutation-2nd-master-set-dtvx-es-masterdata-0
Apr 9, 2020 @ 00:23:19.877
[elasticsearch.server][INFO] scheduling reroute for delayed shards in [21.4s] (1 delayed shards)
test-mutation-2nd-master-set-dtvx-es-masterdata-0
Apr 9, 2020 @ 00:23:41.220
[elasticsearch.server][WARN] failed to schedule/execute reroute post unassigned shard
test-mutation-2nd-master-set-dtvx-es-master-1
Apr 9, 2020 @ 00:23:41.221
{"type": "server", "timestamp": "2020-04-08T22:23:41,220Z", "level": "WARN", "component": "o.e.c.r.DelayedAllocationService", "cluster.name": "test-mutation-2nd-master-set-dtvx", "node.name": "test-mutation-2nd-master-set-dtvx-es-master-1", "message": "failed to schedule/execute reroute post unassigned shard", "cluster.uuid": "LJmEaA9PSGuLa_3ec_CvJQ", "node.id": "JrvwdXabTnWk98hD2ZjjJw" , 
"stacktrace": ["org.elasticsearch.cluster.NotMasterException: no longer master. source: [delayed_allocation_reroute]"] }
test-mutation-2nd-master-set-dtvx-es-master-1

pebrc · 2020-04-09T09:12:40Z

Looking at

cloud-on-k8s/test/e2e/test/elasticsearch/checks_data.go

Line 177 in 5f56a6d

    
           if MustNumDataNodes(initial.Elasticsearch) == 1 || MustNumDataNodes(b.Elasticsearch) == 1 {

it seems we would be running with 0 replicas for this test case as the target cluster size is a single node. I wonder if temporary red cluster state is to be expected in this case as the primary will not be allocated in the moment it relocates from one of the excluded nodes to one of the surviving nodes.

I assume the test fails only sometimes because our observation interval is 3 seconds and with the small number of test documents in the index it might be able to relocate within that 3 second window in most runs.

sebgl · 2020-04-09T09:16:46Z

I wonder if temporary red cluster state is to be expected

The test indicates the node has been red for 5 minutes once the mutation is over. I think it's not a transient situation.

pebrc · 2020-04-09T09:55:08Z

The test indicates the node has been red for 5 minutes once the mutation is over. I think it's not a transient situation.

You are right, I did not think of that. I was thinking of our continuous health check which interestingly does not fail in this run.

sebgl · 2020-04-10T15:08:48Z

From this failed E2E test:

⟩ curl "https://localhost:9200/_cat/health?v" -k -u elastic:$PASSWORD
epoch      timestamp cluster                           status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1586531442 15:10:42  test-mutation-2nd-master-set-swbk red             1         1      2   2    0    0        1             0                  -                 66.7%

⟩ curl "https://localhost:9200/_cat/shards?v" -k -u elastic:$PASSWORD
index                shard prirep state      docs  store ip         node
data-integrity-check 2     p      STARTED       1  3.4kb 10.64.48.5 test-mutation-2nd-master-set-swbk-es-masterdata-0
data-integrity-check 1     p      STARTED       4 12.9kb 10.64.48.5 test-mutation-2nd-master-set-swbk-es-masterdata-0
data-integrity-check 0     p      UNASSIGNED

⟩ curl "https://localhost:9200/_cluster/allocation/explain" -k -u elastic:$PASSWORD | jq .
{
  "index": "data-integrity-check",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NODE_LEFT",
    "at": "2020-04-10T14:50:58.183Z",
    "details": "node_left [ZMxWVDiATeimR8jfz7PWpg]",
    "last_allocation_status": "no_valid_shard_copy"
  },
  "can_allocate": "no_valid_shard_copy",
  "allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions": [
    {
      "node_id": "z_Wn8EdwT1KU5C-7AnEEyg",
      "node_name": "test-mutation-2nd-master-set-swbk-es-masterdata-0",
      "transport_address": "10.64.48.5:9300",
      "node_attributes": {
        "ml.machine_memory": "2147483648",
        "xpack.installed": "true",
        "ml.max_open_jobs": "20"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    }
  ]
}

As we suspected, one of the shard became unassigned, as if it was not migrated correctly.

pebrc · 2020-04-10T19:12:22Z

eq -n -l 1000 -q 'e2e_test_id:"e2e/master-40-gke-eck-debug-e2e-40-default-7.6.0" AND kubernetes.statefulset.name:e2e-63pgt-operator AND NOT log.logger:license-controller AND NOT log.logger:remotecluster-remoteca' --json | jq --raw-output '._source | "[\(."@timestamp")] \(.message): \([.iteration, .from, .to,.value, .statefulset_name, .reason] | del(.[]|nulls) | join(" "))"' [1] [2]

[2020-04-10T14:50:25.257Z] Starting reconciliation run:  33
[2020-04-10T14:50:25.257Z] validate create:  
[2020-04-10T14:50:25.273Z] Successfully Reconciled:  
[2020-04-10T14:50:25.746Z] Cannot downscale StatefulSet:  test-mutation-2nd-master-set-swbk-es-masterdata A master node is already in the process of being removed
[2020-04-10T14:50:25.746Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:50:26.223Z] Ending reconciliation run:  33
[2020-04-10T14:50:26.223Z] Reconciler error:  
[2020-04-10T14:50:26.223Z] Starting reconciliation run:  34
[2020-04-10T14:50:26.223Z] validate create:  
[2020-04-10T14:50:26.510Z] Cannot downscale StatefulSet:  test-mutation-2nd-master-set-swbk-es-master A master node is already in the process of being removed
[2020-04-10T14:50:26.538Z] Data migration completed successfully, starting node deletion:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:26.538Z] Scaling replicas down:  2 1 test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:26.538Z] Setting voting config exclusions:  
[2020-04-10T14:50:26.694Z] received request:  
[2020-04-10T14:50:26.694Z] validate update:  
[2020-04-10T14:50:26.694Z] wrote response:  
[2020-04-10T14:50:26.712Z] Successfully Reconciled:  
[2020-04-10T14:50:26.742Z] StatefulSets observedGeneration is not reconciled yet, re-queueing:  
[2020-04-10T14:50:26.742Z] validate create:  
[2020-04-10T14:50:26.742Z] Statefulset not reconciled:  test-mutation-2nd-master-set-swbk-es-masterdata template hash not equal
[2020-04-10T14:50:26.742Z] Ending reconciliation run:  34
[2020-04-10T14:50:26.742Z] Starting reconciliation run:  35
[2020-04-10T14:50:27.466Z] Updating status:  35
[2020-04-10T14:50:27.466Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:27.514Z] validate create:  
[2020-04-10T14:50:27.514Z] Successfully Reconciled:  
[2020-04-10T14:50:27.514Z] Ending reconciliation run:  35
[2020-04-10T14:50:27.514Z] Starting reconciliation run:  36
[2020-04-10T14:50:27.798Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:27.798Z] Ending reconciliation run:  36
[2020-04-10T14:50:30.143Z] Retrieving cluster state:  
[2020-04-10T14:50:35.257Z] Starting reconciliation run:  37
[2020-04-10T14:50:35.258Z] validate create:  
[2020-04-10T14:50:35.543Z] Ending reconciliation run:  37
[2020-04-10T14:50:35.543Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:40.143Z] Retrieving cluster state:  
[2020-04-10T14:50:45.543Z] validate create:  
[2020-04-10T14:50:45.543Z] Starting reconciliation run:  38
[2020-04-10T14:50:45.822Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:45.822Z] Ending reconciliation run:  38
[2020-04-10T14:50:50.143Z] Retrieving cluster state:  
[2020-04-10T14:50:55.823Z] Starting reconciliation run:  39
[2020-04-10T14:50:55.823Z] validate create:  
[2020-04-10T14:50:56.104Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:56.104Z] Ending reconciliation run:  39
[2020-04-10T14:50:59.570Z] validate create:  
[2020-04-10T14:50:59.570Z] Starting reconciliation run:  40
[2020-04-10T14:50:59.868Z] validate create:  
[2020-04-10T14:50:59.868Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:59.868Z] Ending reconciliation run:  40
[2020-04-10T14:50:59.868Z] Starting reconciliation run:  41
[2020-04-10T14:51:00.143Z] Retrieving cluster state:  
[2020-04-10T14:51:00.145Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:51:00.145Z] Ending reconciliation run:  41
[2020-04-10T14:51:00.166Z] Starting reconciliation run:  42
[2020-04-10T14:51:00.166Z] validate create:  
[2020-04-10T14:51:00.446Z] Updating status:  42
[2020-04-10T14:51:00.446Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:51:00.446Z] Recording event:  
[2020-04-10T14:51:00.446Z] Elasticsearch cluster health degraded:  Unhealthy

A few observations

in iteration 33 we think we want to downscale the master sset and set an allocation filter to move data away from test-mutation-2nd-master-set-swbk-es-master-2 (pointless because it is a master only node)
in iteration 34 we change our mind and decide to downscale the masterdata sset instead. I'm missing the Setting routing allocation excludes message in this iteration which can only mean that we think that either no node is leaving or that the leaving node is already excluded when looking at the annotations on the Elasticsearch resource.

cloud-on-k8s/pkg/controller/elasticsearch/migration/migrate_data.go

Line 75 in 2f07231

if exclusions == allocationExcludeFromAnnotation(es) {
still in iteration 34 we then proceed to downscale the masterdata sset and data loss ensues. Clearly the check IsMigratingdata does not catch the fact the a shard is currently relocating towards the node we are about to delete, also we don't re-check whether there is an allocation exclude in place for the node we are deleting before deleting it.
the reconciler errors in the logs are always edit conflicts

[1] https://github.com/Conky5/eq
[2] #2862

pebrc · 2020-04-11T18:12:50Z

In build 40 masterdata-1 is never excluded from allocation which is more than odd.

[2020-04-10T14:50:14.481Z] Setting routing allocation excludes:  none_excluded
[2020-04-10T14:50:14.963Z] Setting routing allocation excludes:  none_excluded
[2020-04-10T14:50:21.242Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:50:25.746Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:51:12.135Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:51:12.606Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:51:46.918Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-1
[2020-04-10T14:51:47.400Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-1
[2020-04-10T14:52:31.282Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-0
[2020-04-10T14:52:31.692Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-0
[2020-04-10T14:53:10.600Z] Setting routing allocation excludes:  none_excluded

in build 57 we see a pattern that is consistent with the explanation @barkbay has outlined in #2864 , which is that annotation and reality of excludes diverge due to failed updates of the annotation caused by conflicts:

[2020-04-10T23:16:46.304Z] Setting routing allocation excludes:  none_excluded
[2020-04-10T23:16:46.807Z] Setting routing allocation excludes:  none_excluded
[2020-04-10T23:16:51.633Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-masterdata-1
[2020-04-10T23:16:53.213Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-2
[2020-04-10T23:17:33.032Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-2
[2020-04-10T23:17:33.424Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-2
[2020-04-10T23:18:13.992Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-1
[2020-04-10T23:18:14.474Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-1
[2020-04-10T23:18:53.045Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-0
[2020-04-10T23:18:53.453Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-0

If only the first update of the annotation succeeded we might think that masterdata-1 was excluded when in fact it was not anymore. I still don't understand the sequence for build 40.

barkbay · 2020-04-12T08:55:49Z

In build 40 masterdata-1 is never excluded from allocation which is more than odd.

It seems to be excluded during iteration 29 and 31:

29

{"log.level":"info","@timestamp":"2020-04-10T14:50:20.144Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":29,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.441Z","log.logger":"driver","message":"Cannot downscale StatefulSet","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","reason":"A master node is already in the process of being removed"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:20.441Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","value":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.911Z","log.logger":"driver","message":"Data migration not over yet, skipping node deletion","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","node":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.936Z","log.logger":"driver","message":"Statefulset not reconciled","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","reason":"template hash not equal"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.937Z","log.logger":"elasticsearch-controller","message":"Recording event","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","event":{"EventType":"Normal","Reason":"Delayed","Message":"Requested topology change delayed by data migration. Ensure index replica settings allow node removal."}}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.937Z","log.logger":"elasticsearch-controller","message":"Updating status","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":29,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","status":{"availableNodes":5,"health":"green","phase":"MigratingData"}}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.937Z","log.logger":"controller-runtime.manager.events","message":"Normal","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","object":{"kind":"Elasticsearch","namespace":"e2e-63pgt-mercury","name":"test-mutation-2nd-master-set-swbk","uid":"7a3e7216-7b3a-11ea-aac4-42010a9a0077","apiVersion":"elasticsearch.k8s.elastic.co/v1","resourceVersion":"4705"},"reason":"Delayed","message":"Requested topology change delayed by data migration. Ensure index replica settings allow node removal."}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.945Z","log.logger":"elasticsearch-controller","message":"Conflict while updating status","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:20.945Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":29,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","took":0.801817147}

30

{"log.level":"info","@timestamp":"2020-04-10T14:50:20.946Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":30,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.946Z","log.logger":"es-validation","message":"validate create","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:21.241Z","log.logger":"driver","message":"Cannot downscale StatefulSet","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","reason":"A master node is already in the process of being removed"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:21.241Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","value":"test-mutation-2nd-master-set-swbk-es-master-2"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:21.540Z","log.logger":"driver","message":"Data migration completed successfully, starting node deletion","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","node":"test-mutation-2nd-master-set-swbk-es-master-2"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:21.540Z","log.logger":"driver","message":"Scaling replicas down","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","from":3,"to":2}
{"log.level":"info","@timestamp":"2020-04-10T14:50:21.540Z","log.logger":"zen2","message":"Setting voting config exclusions","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","nodes":["test-mutation-2nd-master-set-swbk-es-master-2"]}
{"log.level":"info","@timestamp":"2020-04-10T14:50:22.481Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":30,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","took":1.53515549}
{"log.level":"error","@timestamp":"2020-04-10T14:50:22.481Z","log.logger":"controller-runtime.controller","message":"Reconciler error","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","controller":"elasticsearch-controller","request":"e2e-63pgt-mercury/test-mutation-2nd-master-set-swbk","error":"Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-swbk\": the object has been modified; please apply your changes to the latest version and try again","errorCauses":[{"error":"Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-swbk\": the object has been modified; please apply your changes to the latest version and try again"}],"error.stack_trace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88"}

31

{"log.level":"info","@timestamp":"2020-04-10T14:50:22.481Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":31,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:23.036Z","log.logger":"driver","message":"Cannot downscale StatefulSet","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","reason":"A master node is already in the process of being removed"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.036Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","value":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:23.264Z","log.logger":"controller-runtime.controller","message":"Successfully Reconciled","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","controller":"remoteca-controller","request":"e2e-63pgt-mercury/test-mutation-2nd-master-set-swbk"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.290Z","log.logger":"driver","message":"Data migration completed successfully, starting node deletion","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","node":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.290Z","log.logger":"driver","message":"Scaling replicas down","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","from":2,"to":1}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.290Z","log.logger":"zen2","message":"Setting voting config exclusions","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","nodes":["test-mutation-2nd-master-set-swbk-es-masterdata-1"]}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.970Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":31,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","took":1.488743571}
{"log.level":"error","@timestamp":"2020-04-10T14:50:23.970Z","log.logger":"controller-runtime.controller","message":"Reconciler error","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","controller":"elasticsearch-controller","request":"e2e-63pgt-mercury/test-mutation-2nd-master-set-swbk","error":"Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-swbk\": the object has been modified; please apply your changes to the latest version and try again","errorCauses":[{"error":"Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-swbk\": the object has been modified; please apply your changes to the latest version and try again"}],"error.stack_trace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88"}

32

{"log.level":"info","@timestamp":"2020-04-10T14:50:23.970Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":32,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:23.970Z","log.logger":"es-validation","message":"validate create","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.137Z","log.logger":"driver","message":"Cannot downscale StatefulSet","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","reason":"A master node is already in the process of being removed"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.221Z","log.logger":"driver","message":"Data migration not over yet, skipping node deletion","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","node":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.243Z","log.logger":"driver","message":"Statefulset not reconciled","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","reason":"template hash not equal"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.243Z","log.logger":"elasticsearch-controller","message":"Recording event","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","event":{"EventType":"Normal","Reason":"Delayed","Message":"Requested topology change delayed by data migration. Ensure index replica settings allow node removal."}}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.243Z","log.logger":"elasticsearch-controller","message":"Updating status","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":32,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","status":{"availableNodes":5,"health":"green","phase":"MigratingData"}}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.243Z","log.logger":"controller-runtime.manager.events","message":"Normal","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","object":{"kind":"Elasticsearch","namespace":"e2e-63pgt-mercury","name":"test-mutation-2nd-master-set-swbk","uid":"7a3e7216-7b3a-11ea-aac4-42010a9a0077","apiVersion":"elasticsearch.k8s.elastic.co/v1","resourceVersion":"4722"},"reason":"Delayed","message":"Requested topology change delayed by data migration. Ensure index replica settings allow node removal."}
{"log.level":"info","@timestamp":"2020-04-10T14:50:25.257Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":32,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","took":1.286797181}

pebrc · 2020-04-12T09:21:52Z

@barkbay You are absolutely right! I only looked at our monitoring cluster and the sad truth seems to be that we did not ingest all operator logs. Luckily we kept the cluster running ...

pebrc · 2020-04-20T13:41:32Z

I am closing this as #2880 #2882 and #2891 should fix the underlying issue for the test failures observed in this issue.

sebgl added the >flaky_test label Mar 31, 2020

thbkrkr mentioned this issue Apr 9, 2020

Dedicated job to debug TestMutationSecondMasterSetDown #2849

Merged

barkbay mentioned this issue Apr 11, 2020

Do not cache Elasticsearch settings in annotations #2864

Closed

pebrc mentioned this issue Apr 13, 2020

Check node excluded from shard allocation before deleting #2869

Closed

This was referenced Apr 14, 2020

Add Elasticsearch client cache #2875

Closed

Do not use annotations to cache Elasticsearch API calls #2880

Merged

pebrc closed this as completed Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Red health during E2E test mutation #2788

Red health during E2E test mutation #2788

sebgl commented Mar 31, 2020

sebgl commented Apr 3, 2020

sebgl commented Apr 9, 2020

pebrc commented Apr 9, 2020 •

edited

Loading

pebrc commented Apr 9, 2020 •

edited

Loading

sebgl commented Apr 9, 2020

pebrc commented Apr 9, 2020

sebgl commented Apr 10, 2020 •

edited

Loading

pebrc commented Apr 10, 2020 •

edited

Loading

pebrc commented Apr 11, 2020

barkbay commented Apr 12, 2020

pebrc commented Apr 12, 2020

pebrc commented Apr 20, 2020

Red health during E2E test mutation #2788

Red health during E2E test mutation #2788

Comments

sebgl commented Mar 31, 2020

sebgl commented Apr 3, 2020

sebgl commented Apr 9, 2020

pebrc commented Apr 9, 2020 • edited Loading

pebrc commented Apr 9, 2020 • edited Loading

sebgl commented Apr 9, 2020

pebrc commented Apr 9, 2020

sebgl commented Apr 10, 2020 • edited Loading

pebrc commented Apr 10, 2020 • edited Loading

pebrc commented Apr 11, 2020

barkbay commented Apr 12, 2020

29

30

31

32

pebrc commented Apr 12, 2020

pebrc commented Apr 20, 2020

pebrc commented Apr 9, 2020 •

edited

Loading

pebrc commented Apr 9, 2020 •

edited

Loading

sebgl commented Apr 10, 2020 •

edited

Loading

pebrc commented Apr 10, 2020 •

edited

Loading