-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestMutationHTTPToHTTPS is flaky #3723
Comments
Working theory: I am assuming that the DNS lookup for the node's own transport is done only once when the node starts up, once it has the incorrect IP in that |
I was able to reproduce the exact same symptoms by manipulating an Elasticsearch pod to start with the publish_host set to its "old" IP address. Of course this experiment lacks the indirection of the DNS hostname and the theoretical possibility of looking up the hostname again after some time. But I guess the test failure indicates that DNS lookup does not happen even after waiting for 30 mins. |
Potential solutions:
|
👍 for the first solution, the second one seems very "hacky" and the last one a bit more involved. A slightly different version of the solution is to "compose" the env variable depending on the context, but I don't think it brings any additional benefit:
env:
- name: TMP_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: POD_IP
value: "[$(TMP_POD_IP)]"
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
...
// derive IP dynamically from the pod IP, injected as env var
esv1.NetworkPublishHost: "${" + EnvPodIP + "}", |
This sounds like either:
In the meantime, +1 with going back to IP-based published host, but I think this deserves some investigation. |
Agreed this definitely needs some further investigation. I am preparing a PR that moves us back to an IP based publish host. I wonder if the own |
I got some feedback from the Elasticsearch team that this issue might be related elastic/elasticsearch#49795 The symptoms are slightly different but I think this might be due to |
Drive-by comment: would it help to use the full DNS name |
You are right:
That is a good idea. We would need to parameterize the cluster domain as well as we cannot assume it is always |
The domain name for a cluster is not guaranteed to be |
We could do something |
After looking at the options available in the pod spec for customizing various aspects of the host name, I am now having serious doubts about how to actually determine the correct host name for a pod. 🤷♂️ |
Looking at the godoc:
Just trying to understand in which cases it will not work: It means that we are assuming here that |
I managed to reproduce and dump DNS requests: Context :
While restarting {
"type": "server",
"timestamp": "2020-09-10T10:52:58,652Z",
"level": "INFO",
"component": "o.e.t.TransportService",
"cluster.name": "test-mutation-http-to-https-9mrj",
"node.name": "test-mutation-http-to-https-9mrj-es-masterdata-1",
"message": "publish_address {test-mutation-http-to-https-9mrj-es-masterdata-1.test-mutation-http-to-https-9mrj-es-masterdata/10.0.129.8:9300}, bound_addresses {0.0.0.0:9300}"
} Correct IP is
DNS requests show the wrong answer comes from one of the DNS servers: |
https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-master/575/consoleFull
Looks like we have the first case of DNS related issues at hand, after changing to DNS based
publish_host
.It seems Elasticsearch is coming up after the update with its hostname resolving to the old IP as the
publish_host
, presumably because the old value was still cached in k8s DNS?It then seems to try to connect to itself (all logs are from
test-mutation-https-to-http-vslx-es-masterdata-2
) and keeps failing for 30 minutes.on test-mutation-https-to-http-vslx-es-masterdata-2. New IP is 10.115.49.73
The text was updated successfully, but these errors were encountered: