Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestMutationHTTPToHTTPS is flaky #3723

Closed
pebrc opened this issue Sep 7, 2020 · 14 comments · Fixed by #3726
Closed

TestMutationHTTPToHTTPS is flaky #3723

pebrc opened this issue Sep 7, 2020 · 14 comments · Fixed by #3726
Assignees
Labels
>bug Something isn't working

Comments

@pebrc
Copy link
Collaborator

pebrc commented Sep 7, 2020

https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-master/575/consoleFull

=== RUN   TestMutationHTTPSToHTTP/All_expected_Pods_should_eventually_be_ready#01
Retries (30m0s timeout): ...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
    utils.go:84: 
        	Error Trace:	utils.go:84
        	Error:      	Received unexpected error:
        	            	pod test-mutation-https-to-http-vslx-es-masterdata-0 was not upgraded (yet?) to match the expected specification
        	Test:       	TestMutationHTTPSToHTTP/All_expected_Pods_should_eventually_be_ready#01
{"log.level":"error","@timestamp":"2020-09-07T12:03:12.939Z","message":"stopping early","service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:43\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:38\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:80\ngithub.com/elastic/cloud-on-k8s/test/e2e/es.RunESMutation\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/mutation_test.go:299\ngithub.com/elastic/cloud-on-k8s/test/e2e/es.TestMutationHTTPSToHTTP\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/es/mutation_test.go:48\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1108"}
    --- FAIL: TestMutationHTTPSToHTTP/All_expected_Pods_should_eventually_be_ready#01 (1800.00s)

Looks like we have the first case of DNS related issues at hand, after changing to DNS based publish_host.

2020-09-07T11:33:54,670Z publish_address {test-mutation-https-to-http-vslx-es-masterdata-2.test-mutation-https-to-http-vslx-es-masterdata/10.115.49.72:9300}, bound_addresses {0.0.0.0:9300}

It seems Elasticsearch is coming up after the update with its hostname resolving to the old IP as the publish_host, presumably because the old value was still cached in k8s DNS?

It then seems to try to connect to itself (all logs are from test-mutation-https-to-http-vslx-es-masterdata-2) and keeps failing for 30 minutes.

"Caused by: java.io.IOException: No route to host: test-mutation-https-to-http-vslx-es-masterdata-2.test-mutation-https-to-http-vslx-es-masterdata/10.115.49.72:9300",

on test-mutation-https-to-http-vslx-es-masterdata-2. New IP is 10.115.49.73

@pebrc pebrc added the >bug Something isn't working label Sep 7, 2020
@pebrc
Copy link
Collaborator Author

pebrc commented Sep 7, 2020

Working theory: I am assuming that the DNS lookup for the node's own transport is done only once when the node starts up, once it has the incorrect IP in that BoundTransportAddress it will never update it.

@pebrc
Copy link
Collaborator Author

pebrc commented Sep 7, 2020

I was able to reproduce the exact same symptoms by manipulating an Elasticsearch pod to start with the publish_host set to its "old" IP address. Of course this experiment lacks the indirection of the DNS hostname and the theoretical possibility of looking up the hostname again after some time. But I guess the test failure indicates that DNS lookup does not happen even after waiting for 30 mins.

@pebrc pebrc self-assigned this Sep 7, 2020
@pebrc
Copy link
Collaborator Author

pebrc commented Sep 7, 2020

Potential solutions:

  1. we have a flag that indicates IPv6 or IPv4 in the operator now which means we can go back to an IP based publish host that is fed from the Downward API and thus always correct. Thanks to the flag we can bracket the address as needed in the config template when running IPv6 (my favourite solution)
  2. we could leave things as they are and try to detect the condition that the pod cannot start up due to an incorrect publish address and just delete the pod once more. This would likely lead to a correct resolution of the hostname, but there is no guarantee, so it could take a couple of cycles until it comes up (sounds messy too me. Also how do we detect this condition? Parsing logs?)
  3. we post process the elasticsearch.yml file in the InitContainer and only move the corrected version to its final destination. (This is just pushing the same logic from 1. into the Pod and would mean to replicate IP family detection in a shell script once more. I don't see an advantage over 1.)

@barkbay
Copy link
Contributor

barkbay commented Sep 8, 2020

👍 for the first solution, the second one seems very "hacky" and the last one a bit more involved.

A slightly different version of the solution is to "compose" the env variable depending on the context, but I don't think it brings any additional benefit:

  1. Detect if IPv6 is in use (by a flag or automatically according to the IP of the operator's POD ?)
  2. Declare an env variable with the Pod IP
  3. Create a new env variable according to the IPv6/4 context:
  • If IPv6:
          env:
          - name: TMP_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: POD_IP
            value: "[$(TMP_POD_IP)]"
  • If IPv4 :
          env:
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
  1. use POD_IP as before
                 ...
		// derive IP dynamically from the pod IP, injected as env var
		esv1.NetworkPublishHost: "${" + EnvPodIP + "}",

@sebgl
Copy link
Contributor

sebgl commented Sep 8, 2020

the test failure indicates that DNS lookup does not happen even after waiting for 30 mins

This sounds like either:

  • a bug in Elasticsearch
  • a setting in Elasticsearch or in the JVM that we overlooked

In the meantime, +1 with going back to IP-based published host, but I think this deserves some investigation.

@pebrc
Copy link
Collaborator Author

pebrc commented Sep 8, 2020

the test failure indicates that DNS lookup does not happen even after waiting for 30 mins

This sounds like either:

* a bug in Elasticsearch

* a setting in Elasticsearch or in the JVM that we overlooked

In the meantime, +1 with going back to IP-based published host, but I think this deserves some investigation.

Agreed this definitely needs some further investigation. I am preparing a PR that moves us back to an IP based publish host.

I wonder if the own publish_host is a bit of a special case as it is probably fair to assume the own IP address should be routable at startup.

@pebrc
Copy link
Collaborator Author

pebrc commented Sep 8, 2020

I got some feedback from the Elasticsearch team that this issue might be related elastic/elasticsearch#49795

The symptoms are slightly different but I think this might be due to 127.0.1.1 not accepting connections but existing in principle.

@charith-elastic
Copy link
Contributor

charith-elastic commented Sep 8, 2020

Drive-by comment: would it help to use the full DNS name <pod>.<service>.svc.cluster.local because that's apparently added to the /etc/hosts by Kubernetes and would short-circuit the DNS lookup? [Unverified]

@pebrc
Copy link
Collaborator Author

pebrc commented Sep 8, 2020

Drive-by comment: would it help to use the full DNS name <pod>.<service>.svc.cluster.local because that's apparently added to the /etc/hosts by Kubernetes and would short-circuit the DNS lookup? [Unverified]

You are right:

127.0.0.1	localhost
::1	localhost ip6-localhost ip6-loopback
fe00::0	ip6-localnet
fe00::0	ip6-mcastprefix
fe00::1	ip6-allnodes
fe00::2	ip6-allrouters
10.73.50.53	elasticsearch-sample-es-default-0.elasticsearch-sample-es-default.default.svc.cluster.local	elasticsearch-sample-es-default-0

That is a good idea. We would need to parameterize the cluster domain as well as we cannot assume it is always cluster.local I believe.

@barkbay
Copy link
Contributor

barkbay commented Sep 8, 2020

Drive-by comment: would it help to use the full DNS name <pod>.<service>.svc.cluster.local because that's apparently added to the /etc/hosts by Kubernetes and would short-circuit the DNS lookup? [Unverified]

The domain name for a cluster is not guaranteed to be cluster.local, it can be changed by the cluster admin.

@pebrc
Copy link
Collaborator Author

pebrc commented Sep 8, 2020

We could do something net.LookupCNAME("kubernetes.default.svc") to find out the cluster domain and then follow @charith-elastic 's proposal. I would be curious what your opinions are regarding reverting to IP or this approach?

@charith-elastic
Copy link
Contributor

hostname -f outputs the FQDN of the pod. I found this library that claims to emulate how hostname -f works. If it seems reliable enough, it could be an option as well.

After looking at the options available in the pod spec for customizing various aspects of the host name, I am now having serious doubts about how to actually determine the correct host name for a pod. 🤷‍♂️

@barkbay
Copy link
Contributor

barkbay commented Sep 9, 2020

We could do something net.LookupCNAME("kubernetes.default.svc") to find out the cluster domain and then follow @charith-elastic 's proposal. I would be curious what your opinions are regarding reverting to IP or this approach?

Looking at the godoc:

// A canonical name is the final name after following zero
// or more CNAME records.
// LookupCNAME does not return an error if host does not
// contain DNS "CNAME" records, as long as host resolves to
// address records.

Just trying to understand in which cases it will not work: It means that we are assuming here that kubernetes.default.svc is always a CNAME with kubernetes.default.svc.cluster.local as the A record ? So if kubernetes.default.svc.cluster.local is itself a CNAME it might not work ?

@barkbay
Copy link
Contributor

barkbay commented Sep 10, 2020

I managed to reproduce and dump DNS requests:

Context :

  • Two DNS servers running at 10.0.129.3 and 10.0.128.7
  • ECK is running at 10.0.129.7
  • test-mutation-http-to-https-9mrj-es-masterdata-1 is recreated (i.e. Pod deletion for upgrade), IP moved from 10.0.129.8 to 10.0.129.9

While restarting test-mutation-http-to-https-9mrj-es-masterdata-1 is resolving a wrong IP:

{
	"type": "server",
	"timestamp": "2020-09-10T10:52:58,652Z",
	"level": "INFO",
	"component": "o.e.t.TransportService",
	"cluster.name": "test-mutation-http-to-https-9mrj",
	"node.name": "test-mutation-http-to-https-9mrj-es-masterdata-1",
	"message": "publish_address {test-mutation-http-to-https-9mrj-es-masterdata-1.test-mutation-http-to-https-9mrj-es-masterdata/10.0.129.8:9300}, bound_addresses {0.0.0.0:9300}"
}

Correct IP is 10.0.129.9:

pod/test-mutation-http-to-https-9mrj-es-masterdata-1   1/1     Running   0          93m   10.0.129.9   gke-michael-dev-2-default-pool-74955f93-m3jc   <none>           <none>

DNS requests show the wrong answer comes from one of the DNS servers:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants