Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Red health during E2E test mutation #2788

Closed
sebgl opened this issue Mar 31, 2020 · 12 comments
Closed

Red health during E2E test mutation #2788

sebgl opened this issue Mar 31, 2020 · 12 comments

Comments

@sebgl
Copy link
Contributor

sebgl commented Mar 31, 2020

https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-stack-versions/32/testReport/github/com_elastic_cloud-on-k8s_test_e2e_es/Run_tests_for_different_ELK_stack_versions_in_GKE___7_6_0___TestMutationSecondMasterSetDown_ES_cluster_health_should_eventually_be_green_01/

=== RUN   TestMutationSecondMasterSetDown/ES_cluster_health_should_eventually_be_green#01
Retries (5m0s timeout): ....................................................................................................
{"log.level":"error","@timestamp":"2020-03-31T02:13:11.533Z","message":"stopping early","service.version":"0.0.0-00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:42\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:37\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/test.RunMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:78\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/es.RunESMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/mutation_test.go:299\ngh.neting.cc/elastic/cloud-on-k8s/test/e2e/es.TestMutationSecondMasterSetDown\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/mutation_test.go:168\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:909"}
    --- FAIL: TestMutationSecondMasterSetDown/ES_cluster_health_should_eventually_be_green#01 (300.00s)
        utils.go:84: 
            	Error Trace:	utils.go:84
            	Error:      	Received unexpected error:
            	            	health is red
            	Test:       	TestMutationSecondMasterSetDown/ES_cluster_health_should_eventually_be_green#01
@sebgl
Copy link
Contributor Author

sebgl commented Apr 3, 2020

I'm not able to reproduce locally.

I think there's a real bug hidden here. The test mutates 3 master-data + 2 masters to keep only 1 master-data.
The output shows the cluster is still red 5 minutes after the mutation is over. It is not a network issue. It would indicate a problem with the way we decommission nodes/migrate data?

@pebrc
Copy link
Collaborator

pebrc commented Apr 9, 2020

Apr 9, 2020 @ 00:23:19.845
[elasticsearch.server][INFO] master node changed {previous [], current [{test-mutation-2nd-master-set-dtvx-es-masterdata-0}{yyXrADMUQ2ad46TSRTVctw}{E8RJGRlFT8qUANHGXfFlyg}{10.1.49.47}{10.1.49.47:9300}{dilm}{ml.machine_memory=2147483648, xpack.installed=true, ml.max_open_jobs=20}]}, term: 6, version: 49, reason: Publication{term=6, version=49}
test-mutation-2nd-master-set-dtvx-es-masterdata-0
Apr 9, 2020 @ 00:23:19.877
[elasticsearch.server][INFO] scheduling reroute for delayed shards in [21.4s] (1 delayed shards)
test-mutation-2nd-master-set-dtvx-es-masterdata-0
Apr 9, 2020 @ 00:23:41.220
[elasticsearch.server][WARN] failed to schedule/execute reroute post unassigned shard
test-mutation-2nd-master-set-dtvx-es-master-1
Apr 9, 2020 @ 00:23:41.221
{"type": "server", "timestamp": "2020-04-08T22:23:41,220Z", "level": "WARN", "component": "o.e.c.r.DelayedAllocationService", "cluster.name": "test-mutation-2nd-master-set-dtvx", "node.name": "test-mutation-2nd-master-set-dtvx-es-master-1", "message": "failed to schedule/execute reroute post unassigned shard", "cluster.uuid": "LJmEaA9PSGuLa_3ec_CvJQ", "node.id": "JrvwdXabTnWk98hD2ZjjJw" , 
"stacktrace": ["org.elasticsearch.cluster.NotMasterException: no longer master. source: [delayed_allocation_reroute]"] }
test-mutation-2nd-master-set-dtvx-es-master-1

@pebrc
Copy link
Collaborator

pebrc commented Apr 9, 2020

Looking at

if MustNumDataNodes(initial.Elasticsearch) == 1 || MustNumDataNodes(b.Elasticsearch) == 1 {

it seems we would be running with 0 replicas for this test case as the target cluster size is a single node. I wonder if temporary red cluster state is to be expected in this case as the primary will not be allocated in the moment it relocates from one of the excluded nodes to one of the surviving nodes.

I assume the test fails only sometimes because our observation interval is 3 seconds and with the small number of test documents in the index it might be able to relocate within that 3 second window in most runs.

@sebgl
Copy link
Contributor Author

sebgl commented Apr 9, 2020

I wonder if temporary red cluster state is to be expected

The test indicates the node has been red for 5 minutes once the mutation is over. I think it's not a transient situation.

@pebrc
Copy link
Collaborator

pebrc commented Apr 9, 2020

The test indicates the node has been red for 5 minutes once the mutation is over. I think it's not a transient situation.

You are right, I did not think of that. I was thinking of our continuous health check which interestingly does not fail in this run.

@sebgl
Copy link
Contributor Author

sebgl commented Apr 10, 2020

From this failed E2E test:

⟩ curl "https://localhost:9200/_cat/health?v" -k -u elastic:$PASSWORD
epoch      timestamp cluster                           status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1586531442 15:10:42  test-mutation-2nd-master-set-swbk red             1         1      2   2    0    0        1             0                  -                 66.7%
⟩ curl "https://localhost:9200/_cat/shards?v" -k -u elastic:$PASSWORD
index                shard prirep state      docs  store ip         node
data-integrity-check 2     p      STARTED       1  3.4kb 10.64.48.5 test-mutation-2nd-master-set-swbk-es-masterdata-0
data-integrity-check 1     p      STARTED       4 12.9kb 10.64.48.5 test-mutation-2nd-master-set-swbk-es-masterdata-0
data-integrity-check 0     p      UNASSIGNED  
⟩ curl "https://localhost:9200/_cluster/allocation/explain" -k -u elastic:$PASSWORD | jq .
{
  "index": "data-integrity-check",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NODE_LEFT",
    "at": "2020-04-10T14:50:58.183Z",
    "details": "node_left [ZMxWVDiATeimR8jfz7PWpg]",
    "last_allocation_status": "no_valid_shard_copy"
  },
  "can_allocate": "no_valid_shard_copy",
  "allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions": [
    {
      "node_id": "z_Wn8EdwT1KU5C-7AnEEyg",
      "node_name": "test-mutation-2nd-master-set-swbk-es-masterdata-0",
      "transport_address": "10.64.48.5:9300",
      "node_attributes": {
        "ml.machine_memory": "2147483648",
        "xpack.installed": "true",
        "ml.max_open_jobs": "20"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    }
  ]
}

As we suspected, one of the shard became unassigned, as if it was not migrated correctly.

@pebrc
Copy link
Collaborator

pebrc commented Apr 10, 2020

eq -n -l 1000 -q 'e2e_test_id:"e2e/master-40-gke-eck-debug-e2e-40-default-7.6.0" AND kubernetes.statefulset.name:e2e-63pgt-operator AND NOT log.logger:license-controller AND NOT log.logger:remotecluster-remoteca' --json | jq --raw-output '._source | "[\(."@timestamp")] \(.message): \([.iteration, .from, .to,.value, .statefulset_name, .reason] | del(.[]|nulls) | join(" "))"' [1] [2]

[2020-04-10T14:50:25.257Z] Starting reconciliation run:  33
[2020-04-10T14:50:25.257Z] validate create:  
[2020-04-10T14:50:25.273Z] Successfully Reconciled:  
[2020-04-10T14:50:25.746Z] Cannot downscale StatefulSet:  test-mutation-2nd-master-set-swbk-es-masterdata A master node is already in the process of being removed
[2020-04-10T14:50:25.746Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:50:26.223Z] Ending reconciliation run:  33
[2020-04-10T14:50:26.223Z] Reconciler error:  
[2020-04-10T14:50:26.223Z] Starting reconciliation run:  34
[2020-04-10T14:50:26.223Z] validate create:  
[2020-04-10T14:50:26.510Z] Cannot downscale StatefulSet:  test-mutation-2nd-master-set-swbk-es-master A master node is already in the process of being removed
[2020-04-10T14:50:26.538Z] Data migration completed successfully, starting node deletion:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:26.538Z] Scaling replicas down:  2 1 test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:26.538Z] Setting voting config exclusions:  
[2020-04-10T14:50:26.694Z] received request:  
[2020-04-10T14:50:26.694Z] validate update:  
[2020-04-10T14:50:26.694Z] wrote response:  
[2020-04-10T14:50:26.712Z] Successfully Reconciled:  
[2020-04-10T14:50:26.742Z] StatefulSets observedGeneration is not reconciled yet, re-queueing:  
[2020-04-10T14:50:26.742Z] validate create:  
[2020-04-10T14:50:26.742Z] Statefulset not reconciled:  test-mutation-2nd-master-set-swbk-es-masterdata template hash not equal
[2020-04-10T14:50:26.742Z] Ending reconciliation run:  34
[2020-04-10T14:50:26.742Z] Starting reconciliation run:  35
[2020-04-10T14:50:27.466Z] Updating status:  35
[2020-04-10T14:50:27.466Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:27.514Z] validate create:  
[2020-04-10T14:50:27.514Z] Successfully Reconciled:  
[2020-04-10T14:50:27.514Z] Ending reconciliation run:  35
[2020-04-10T14:50:27.514Z] Starting reconciliation run:  36
[2020-04-10T14:50:27.798Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:27.798Z] Ending reconciliation run:  36
[2020-04-10T14:50:30.143Z] Retrieving cluster state:  
[2020-04-10T14:50:35.257Z] Starting reconciliation run:  37
[2020-04-10T14:50:35.258Z] validate create:  
[2020-04-10T14:50:35.543Z] Ending reconciliation run:  37
[2020-04-10T14:50:35.543Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:40.143Z] Retrieving cluster state:  
[2020-04-10T14:50:45.543Z] validate create:  
[2020-04-10T14:50:45.543Z] Starting reconciliation run:  38
[2020-04-10T14:50:45.822Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:45.822Z] Ending reconciliation run:  38
[2020-04-10T14:50:50.143Z] Retrieving cluster state:  
[2020-04-10T14:50:55.823Z] Starting reconciliation run:  39
[2020-04-10T14:50:55.823Z] validate create:  
[2020-04-10T14:50:56.104Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:56.104Z] Ending reconciliation run:  39
[2020-04-10T14:50:59.570Z] validate create:  
[2020-04-10T14:50:59.570Z] Starting reconciliation run:  40
[2020-04-10T14:50:59.868Z] validate create:  
[2020-04-10T14:50:59.868Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:50:59.868Z] Ending reconciliation run:  40
[2020-04-10T14:50:59.868Z] Starting reconciliation run:  41
[2020-04-10T14:51:00.143Z] Retrieving cluster state:  
[2020-04-10T14:51:00.145Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:51:00.145Z] Ending reconciliation run:  41
[2020-04-10T14:51:00.166Z] Starting reconciliation run:  42
[2020-04-10T14:51:00.166Z] validate create:  
[2020-04-10T14:51:00.446Z] Updating status:  42
[2020-04-10T14:51:00.446Z] Some pods still need to be created/deleted:  test-mutation-2nd-master-set-swbk-es-masterdata
[2020-04-10T14:51:00.446Z] Recording event:  
[2020-04-10T14:51:00.446Z] Elasticsearch cluster health degraded:  Unhealthy

A few observations

  • in iteration 33 we think we want to downscale the master sset and set an allocation filter to move data away from test-mutation-2nd-master-set-swbk-es-master-2 (pointless because it is a master only node)
  • in iteration 34 we change our mind and decide to downscale the masterdata sset instead. I'm missing the Setting routing allocation excludes message in this iteration which can only mean that we think that either no node is leaving or that the leaving node is already excluded when looking at the annotations on the Elasticsearch resource.
    if exclusions == allocationExcludeFromAnnotation(es) {
  • still in iteration 34 we then proceed to downscale the masterdata sset and data loss ensues. Clearly the check IsMigratingdata does not catch the fact the a shard is currently relocating towards the node we are about to delete, also we don't re-check whether there is an allocation exclude in place for the node we are deleting before deleting it.
  • the reconciler errors in the logs are always edit conflicts

[1] https://github.com/Conky5/eq
[2] #2862

@pebrc
Copy link
Collaborator

pebrc commented Apr 11, 2020

In build 40 masterdata-1 is never excluded from allocation which is more than odd.

[2020-04-10T14:50:14.481Z] Setting routing allocation excludes:  none_excluded
[2020-04-10T14:50:14.963Z] Setting routing allocation excludes:  none_excluded
[2020-04-10T14:50:21.242Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:50:25.746Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:51:12.135Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:51:12.606Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-2
[2020-04-10T14:51:46.918Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-1
[2020-04-10T14:51:47.400Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-1
[2020-04-10T14:52:31.282Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-0
[2020-04-10T14:52:31.692Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-swbk-es-master-0
[2020-04-10T14:53:10.600Z] Setting routing allocation excludes:  none_excluded

in build 57 we see a pattern that is consistent with the explanation @barkbay has outlined in #2864 , which is that annotation and reality of excludes diverge due to failed updates of the annotation caused by conflicts:

[2020-04-10T23:16:46.304Z] Setting routing allocation excludes:  none_excluded
[2020-04-10T23:16:46.807Z] Setting routing allocation excludes:  none_excluded
[2020-04-10T23:16:51.633Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-masterdata-1
[2020-04-10T23:16:53.213Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-2
[2020-04-10T23:17:33.032Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-2
[2020-04-10T23:17:33.424Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-2
[2020-04-10T23:18:13.992Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-1
[2020-04-10T23:18:14.474Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-1
[2020-04-10T23:18:53.045Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-0
[2020-04-10T23:18:53.453Z] Setting routing allocation excludes:  test-mutation-2nd-master-set-pn6t-es-master-0

If only the first update of the annotation succeeded we might think that masterdata-1 was excluded when in fact it was not anymore. I still don't understand the sequence for build 40.

@barkbay
Copy link
Contributor

barkbay commented Apr 12, 2020

In build 40 masterdata-1 is never excluded from allocation which is more than odd.

It seems to be excluded during iteration 29 and 31:

29

{"log.level":"info","@timestamp":"2020-04-10T14:50:20.144Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":29,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.441Z","log.logger":"driver","message":"Cannot downscale StatefulSet","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","reason":"A master node is already in the process of being removed"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:20.441Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","value":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.911Z","log.logger":"driver","message":"Data migration not over yet, skipping node deletion","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","node":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.936Z","log.logger":"driver","message":"Statefulset not reconciled","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","reason":"template hash not equal"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.937Z","log.logger":"elasticsearch-controller","message":"Recording event","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","event":{"EventType":"Normal","Reason":"Delayed","Message":"Requested topology change delayed by data migration. Ensure index replica settings allow node removal."}}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.937Z","log.logger":"elasticsearch-controller","message":"Updating status","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":29,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","status":{"availableNodes":5,"health":"green","phase":"MigratingData"}}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.937Z","log.logger":"controller-runtime.manager.events","message":"Normal","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","object":{"kind":"Elasticsearch","namespace":"e2e-63pgt-mercury","name":"test-mutation-2nd-master-set-swbk","uid":"7a3e7216-7b3a-11ea-aac4-42010a9a0077","apiVersion":"elasticsearch.k8s.elastic.co/v1","resourceVersion":"4705"},"reason":"Delayed","message":"Requested topology change delayed by data migration. Ensure index replica settings allow node removal."}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.945Z","log.logger":"elasticsearch-controller","message":"Conflict while updating status","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:20.945Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":29,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","took":0.801817147}

30

{"log.level":"info","@timestamp":"2020-04-10T14:50:20.946Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":30,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:20.946Z","log.logger":"es-validation","message":"validate create","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:21.241Z","log.logger":"driver","message":"Cannot downscale StatefulSet","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","reason":"A master node is already in the process of being removed"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:21.241Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","value":"test-mutation-2nd-master-set-swbk-es-master-2"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:21.540Z","log.logger":"driver","message":"Data migration completed successfully, starting node deletion","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","node":"test-mutation-2nd-master-set-swbk-es-master-2"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:21.540Z","log.logger":"driver","message":"Scaling replicas down","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","from":3,"to":2}
{"log.level":"info","@timestamp":"2020-04-10T14:50:21.540Z","log.logger":"zen2","message":"Setting voting config exclusions","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","nodes":["test-mutation-2nd-master-set-swbk-es-master-2"]}
{"log.level":"info","@timestamp":"2020-04-10T14:50:22.481Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":30,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","took":1.53515549}
{"log.level":"error","@timestamp":"2020-04-10T14:50:22.481Z","log.logger":"controller-runtime.controller","message":"Reconciler error","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","controller":"elasticsearch-controller","request":"e2e-63pgt-mercury/test-mutation-2nd-master-set-swbk","error":"Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-swbk\": the object has been modified; please apply your changes to the latest version and try again","errorCauses":[{"error":"Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-swbk\": the object has been modified; please apply your changes to the latest version and try again"}],"error.stack_trace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88"}

31

{"log.level":"info","@timestamp":"2020-04-10T14:50:22.481Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":31,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:23.036Z","log.logger":"driver","message":"Cannot downscale StatefulSet","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","reason":"A master node is already in the process of being removed"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.036Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","value":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:23.264Z","log.logger":"controller-runtime.controller","message":"Successfully Reconciled","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","controller":"remoteca-controller","request":"e2e-63pgt-mercury/test-mutation-2nd-master-set-swbk"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.290Z","log.logger":"driver","message":"Data migration completed successfully, starting node deletion","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","node":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.290Z","log.logger":"driver","message":"Scaling replicas down","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","from":2,"to":1}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.290Z","log.logger":"zen2","message":"Setting voting config exclusions","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","nodes":["test-mutation-2nd-master-set-swbk-es-masterdata-1"]}
{"log.level":"info","@timestamp":"2020-04-10T14:50:23.970Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":31,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","took":1.488743571}
{"log.level":"error","@timestamp":"2020-04-10T14:50:23.970Z","log.logger":"controller-runtime.controller","message":"Reconciler error","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","controller":"elasticsearch-controller","request":"e2e-63pgt-mercury/test-mutation-2nd-master-set-swbk","error":"Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-swbk\": the object has been modified; please apply your changes to the latest version and try again","errorCauses":[{"error":"Operation cannot be fulfilled on elasticsearches.elasticsearch.k8s.elastic.co \"test-mutation-2nd-master-set-swbk\": the object has been modified; please apply your changes to the latest version and try again"}],"error.stack_trace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88"}

32

{"log.level":"info","@timestamp":"2020-04-10T14:50:23.970Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":32,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:23.970Z","log.logger":"es-validation","message":"validate create","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","name":"test-mutation-2nd-master-set-swbk"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.137Z","log.logger":"driver","message":"Cannot downscale StatefulSet","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-master","reason":"A master node is already in the process of being removed"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.221Z","log.logger":"driver","message":"Data migration not over yet, skipping node deletion","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-63pgt-mercury","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","node":"test-mutation-2nd-master-set-swbk-es-masterdata-1"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.243Z","log.logger":"driver","message":"Statefulset not reconciled","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","statefulset_name":"test-mutation-2nd-master-set-swbk-es-masterdata","reason":"template hash not equal"}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.243Z","log.logger":"elasticsearch-controller","message":"Recording event","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","event":{"EventType":"Normal","Reason":"Delayed","Message":"Requested topology change delayed by data migration. Ensure index replica settings allow node removal."}}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.243Z","log.logger":"elasticsearch-controller","message":"Updating status","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":32,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","status":{"availableNodes":5,"health":"green","phase":"MigratingData"}}
{"log.level":"debug","@timestamp":"2020-04-10T14:50:25.243Z","log.logger":"controller-runtime.manager.events","message":"Normal","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","object":{"kind":"Elasticsearch","namespace":"e2e-63pgt-mercury","name":"test-mutation-2nd-master-set-swbk","uid":"7a3e7216-7b3a-11ea-aac4-42010a9a0077","apiVersion":"elasticsearch.k8s.elastic.co/v1","resourceVersion":"4722"},"reason":"Delayed","message":"Requested topology change delayed by data migration. Ensure index replica settings allow node removal."}
{"log.level":"info","@timestamp":"2020-04-10T14:50:25.257Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.1.0-d85228ae","service.type":"eck","ecs.version":"1.4.0","iteration":32,"namespace":"e2e-63pgt-mercury","es_name":"test-mutation-2nd-master-set-swbk","took":1.286797181} 

@pebrc
Copy link
Collaborator

pebrc commented Apr 12, 2020

@barkbay You are absolutely right! I only looked at our monitoring cluster and the sad truth seems to be that we did not ingest all operator logs. Luckily we kept the cluster running ...

@pebrc
Copy link
Collaborator

pebrc commented Apr 20, 2020

I am closing this as #2880 #2882 and #2891 should fix the underlying issue for the test failures observed in this issue.

@pebrc pebrc closed this as completed Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants