TestMutationHTTPToHTTPS is flaky #3723

pebrc · 2020-09-07T15:16:49Z

https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-master/575/consoleFull

=== RUN   TestMutationHTTPSToHTTP/All_expected_Pods_should_eventually_be_ready#01
Retries (30m0s timeout): ...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
    utils.go:84: 
        	Error Trace:	utils.go:84
        	Error:      	Received unexpected error:
        	            	pod test-mutation-https-to-http-vslx-es-masterdata-0 was not upgraded (yet?) to match the expected specification
        	Test:       	TestMutationHTTPSToHTTP/All_expected_Pods_should_eventually_be_ready#01
{"log.level":"error","@timestamp":"2020-09-07T12:03:12.939Z","message":"stopping early","service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:43\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:38\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:80\ngithub.com/elastic/cloud-on-k8s/test/e2e/es.RunESMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/mutation_test.go:299\ngithub.com/elastic/cloud-on-k8s/test/e2e/es.TestMutationHTTPSToHTTP\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/mutation_test.go:48\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1108"}
    --- FAIL: TestMutationHTTPSToHTTP/All_expected_Pods_should_eventually_be_ready#01 (1800.00s)

Looks like we have the first case of DNS related issues at hand, after changing to DNS based publish_host.

2020-09-07T11:33:54,670Z publish_address {test-mutation-https-to-http-vslx-es-masterdata-2.test-mutation-https-to-http-vslx-es-masterdata/10.115.49.72:9300}, bound_addresses {0.0.0.0:9300}

It seems Elasticsearch is coming up after the update with its hostname resolving to the old IP as the publish_host, presumably because the old value was still cached in k8s DNS?

It then seems to try to connect to itself (all logs are from test-mutation-https-to-http-vslx-es-masterdata-2) and keeps failing for 30 minutes.

"Caused by: java.io.IOException: No route to host: test-mutation-https-to-http-vslx-es-masterdata-2.test-mutation-https-to-http-vslx-es-masterdata/10.115.49.72:9300",

on test-mutation-https-to-http-vslx-es-masterdata-2. New IP is 10.115.49.73

The text was updated successfully, but these errors were encountered:

pebrc · 2020-09-07T15:44:45Z

Working theory: I am assuming that the DNS lookup for the node's own transport is done only once when the node starts up, once it has the incorrect IP in that BoundTransportAddress it will never update it.

pebrc · 2020-09-07T19:33:28Z

I was able to reproduce the exact same symptoms by manipulating an Elasticsearch pod to start with the publish_host set to its "old" IP address. Of course this experiment lacks the indirection of the DNS hostname and the theoretical possibility of looking up the hostname again after some time. But I guess the test failure indicates that DNS lookup does not happen even after waiting for 30 mins.

pebrc · 2020-09-07T20:15:54Z

Potential solutions:

we have a flag that indicates IPv6 or IPv4 in the operator now which means we can go back to an IP based publish host that is fed from the Downward API and thus always correct. Thanks to the flag we can bracket the address as needed in the config template when running IPv6 (my favourite solution)
we could leave things as they are and try to detect the condition that the pod cannot start up due to an incorrect publish address and just delete the pod once more. This would likely lead to a correct resolution of the hostname, but there is no guarantee, so it could take a couple of cycles until it comes up (sounds messy too me. Also how do we detect this condition? Parsing logs?)
we post process the elasticsearch.yml file in the InitContainer and only move the corrected version to its final destination. (This is just pushing the same logic from 1. into the Pod and would mean to replicate IP family detection in a shell script once more. I don't see an advantage over 1.)

barkbay · 2020-09-08T06:44:38Z

👍 for the first solution, the second one seems very "hacky" and the last one a bit more involved.

A slightly different version of the solution is to "compose" the env variable depending on the context, but I don't think it brings any additional benefit:

Detect if IPv6 is in use (by a flag or automatically according to the IP of the operator's POD ?)
Declare an env variable with the Pod IP
Create a new env variable according to the IPv6/4 context:

If IPv6:

          env:
          - name: TMP_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: POD_IP
            value: "[$(TMP_POD_IP)]"

If IPv4 :

          env:
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP

use POD_IP as before

                 ...
		// derive IP dynamically from the pod IP, injected as env var
		esv1.NetworkPublishHost: "${" + EnvPodIP + "}",

sebgl · 2020-09-08T07:46:49Z

the test failure indicates that DNS lookup does not happen even after waiting for 30 mins

This sounds like either:

a bug in Elasticsearch
a setting in Elasticsearch or in the JVM that we overlooked

In the meantime, +1 with going back to IP-based published host, but I think this deserves some investigation.

pebrc · 2020-09-08T07:57:09Z

the test failure indicates that DNS lookup does not happen even after waiting for 30 mins

This sounds like either:
* a bug in Elasticsearch

* a setting in Elasticsearch or in the JVM that we overlooked
In the meantime, +1 with going back to IP-based published host, but I think this deserves some investigation.

Agreed this definitely needs some further investigation. I am preparing a PR that moves us back to an IP based publish host.

I wonder if the own publish_host is a bit of a special case as it is probably fair to assume the own IP address should be routable at startup.

pebrc · 2020-09-08T08:02:05Z

I got some feedback from the Elasticsearch team that this issue might be related elastic/elasticsearch#49795

The symptoms are slightly different but I think this might be due to 127.0.1.1 not accepting connections but existing in principle.

charith-elastic · 2020-09-08T08:03:25Z

Drive-by comment: would it help to use the full DNS name <pod>.<service>.svc.cluster.local because that's apparently added to the /etc/hosts by Kubernetes and would short-circuit the DNS lookup? [Unverified]

pebrc · 2020-09-08T08:07:06Z

Drive-by comment: would it help to use the full DNS name <pod>.<service>.svc.cluster.local because that's apparently added to the /etc/hosts by Kubernetes and would short-circuit the DNS lookup? [Unverified]

You are right:

127.0.0.1	localhost
::1	localhost ip6-localhost ip6-loopback
fe00::0	ip6-localnet
fe00::0	ip6-mcastprefix
fe00::1	ip6-allnodes
fe00::2	ip6-allrouters
10.73.50.53	elasticsearch-sample-es-default-0.elasticsearch-sample-es-default.default.svc.cluster.local	elasticsearch-sample-es-default-0

That is a good idea. We would need to parameterize the cluster domain as well as we cannot assume it is always cluster.local I believe.

barkbay · 2020-09-08T08:09:26Z

Drive-by comment: would it help to use the full DNS name <pod>.<service>.svc.cluster.local because that's apparently added to the /etc/hosts by Kubernetes and would short-circuit the DNS lookup? [Unverified]

The domain name for a cluster is not guaranteed to be cluster.local, it can be changed by the cluster admin.

pebrc · 2020-09-08T11:36:19Z

We could do something net.LookupCNAME("kubernetes.default.svc") to find out the cluster domain and then follow @charith-elastic 's proposal. I would be curious what your opinions are regarding reverting to IP or this approach?

charith-elastic · 2020-09-08T13:23:32Z

hostname -f outputs the FQDN of the pod. I found this library that claims to emulate how hostname -f works. If it seems reliable enough, it could be an option as well.

After looking at the options available in the pod spec for customizing various aspects of the host name, I am now having serious doubts about how to actually determine the correct host name for a pod. 🤷‍♂️

barkbay · 2020-09-09T06:40:55Z

We could do something net.LookupCNAME("kubernetes.default.svc") to find out the cluster domain and then follow @charith-elastic 's proposal. I would be curious what your opinions are regarding reverting to IP or this approach?

Looking at the godoc:

// A canonical name is the final name after following zero
// or more CNAME records.
// LookupCNAME does not return an error if host does not
// contain DNS "CNAME" records, as long as host resolves to
// address records.

Just trying to understand in which cases it will not work: It means that we are assuming here that kubernetes.default.svc is always a CNAME with kubernetes.default.svc.cluster.local as the A record ? So if kubernetes.default.svc.cluster.local is itself a CNAME it might not work ?

barkbay · 2020-09-10T12:31:43Z

I managed to reproduce and dump DNS requests:

Context :

Two DNS servers running at 10.0.129.3 and 10.0.128.7
ECK is running at 10.0.129.7
test-mutation-http-to-https-9mrj-es-masterdata-1 is recreated (i.e. Pod deletion for upgrade), IP moved from 10.0.129.8 to 10.0.129.9

While restarting test-mutation-http-to-https-9mrj-es-masterdata-1 is resolving a wrong IP:

{
	"type": "server",
	"timestamp": "2020-09-10T10:52:58,652Z",
	"level": "INFO",
	"component": "o.e.t.TransportService",
	"cluster.name": "test-mutation-http-to-https-9mrj",
	"node.name": "test-mutation-http-to-https-9mrj-es-masterdata-1",
	"message": "publish_address {test-mutation-http-to-https-9mrj-es-masterdata-1.test-mutation-http-to-https-9mrj-es-masterdata/10.0.129.8:9300}, bound_addresses {0.0.0.0:9300}"
}

Correct IP is 10.0.129.9:

pod/test-mutation-http-to-https-9mrj-es-masterdata-1   1/1     Running   0          93m   10.0.129.9   gke-michael-dev-2-default-pool-74955f93-m3jc   <none>           <none>

DNS requests show the wrong answer comes from one of the DNS servers:

pebrc added the >bug Something isn't working label Sep 7, 2020

pebrc self-assigned this Sep 7, 2020

pebrc mentioned this issue Sep 8, 2020

Revert network.publish_host back to IP addresses #3726

Merged

pebrc closed this as completed in #3726 Sep 10, 2020

This was referenced Sep 10, 2020

Move discovery to DNS names #3712

Closed

Use Pod DNS names instead of IP addresses for discovery and TLS certificates #2830

Closed

barkbay mentioned this issue Oct 14, 2020

Allow automatic Elasticsearch nodes discovery #3837

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestMutationHTTPToHTTPS is flaky #3723

TestMutationHTTPToHTTPS is flaky #3723

pebrc commented Sep 7, 2020 •

edited

Loading

pebrc commented Sep 7, 2020

pebrc commented Sep 7, 2020

pebrc commented Sep 7, 2020

barkbay commented Sep 8, 2020 •

edited

Loading

sebgl commented Sep 8, 2020

pebrc commented Sep 8, 2020

pebrc commented Sep 8, 2020

charith-elastic commented Sep 8, 2020 •

edited

Loading

pebrc commented Sep 8, 2020 •

edited

Loading

barkbay commented Sep 8, 2020

pebrc commented Sep 8, 2020

charith-elastic commented Sep 8, 2020

barkbay commented Sep 9, 2020

barkbay commented Sep 10, 2020

TestMutationHTTPToHTTPS is flaky #3723

TestMutationHTTPToHTTPS is flaky #3723

Comments

pebrc commented Sep 7, 2020 • edited Loading

pebrc commented Sep 7, 2020

pebrc commented Sep 7, 2020

pebrc commented Sep 7, 2020

barkbay commented Sep 8, 2020 • edited Loading

sebgl commented Sep 8, 2020

pebrc commented Sep 8, 2020

pebrc commented Sep 8, 2020

charith-elastic commented Sep 8, 2020 • edited Loading

pebrc commented Sep 8, 2020 • edited Loading

barkbay commented Sep 8, 2020

pebrc commented Sep 8, 2020

charith-elastic commented Sep 8, 2020

barkbay commented Sep 9, 2020

barkbay commented Sep 10, 2020

pebrc commented Sep 7, 2020 •

edited

Loading

barkbay commented Sep 8, 2020 •

edited

Loading

charith-elastic commented Sep 8, 2020 •

edited

Loading

pebrc commented Sep 8, 2020 •

edited

Loading