-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urlsEqual
might wrongly skip resolving DNS
#15062
Comments
urlsEqual
might wrong skip resolving DNSurlsEqual
might wrongly skip resolving DNS
Thanks Ben, I think we need to discuss what behavior we want to fix. Clearly this is a regression in 3.5 (and 3.4?), but implicitly blocking the bootstrap config on resolving a DNS entry in an equality check is also not a great idea. I would rather have the config.go explicitly lookup the important URLs and fail with a meaningful error message when they can't be resolved (akin to the init container the hypershift team added in openshift/hypershift#1985). Then validate whether the URLs and their IP configs are consistent and equal. Potentially with a cmd flag to turn off that check. #7798 does have some merit after all, there are implications on cluster membership as well. |
In case I misunderstood anything, could you clarify this? How did |
I only have fragments of the whole issue myself, I'll come back with a reproducer in a test case once I receive more information later today. As far as I understood, the issue stems from the fact that the URLs somehow match this logic: My assumption was the wildcard is ignored, thus matching only on the host which is equal, which in turn will not resolve the DNS anymore. That previously would've blocked for a while to resolve, thus "preventing" etcd from starting. |
It seems not related to #13224. The function Anyway, please clarify the issue and provide more detailed info firstly. FYI. https://github.com/ahrtr/etcd-issues/blob/master/docs/troubleshooting/sanity_check_and_collect_basic_info.md |
The logic is not the issue, it's the ordering that breaks an implicit assumption users had about the startup of etcd. |
Here's how the team runs etcd inside k8s:
If you step through it with a debugger, you notice the path in config.go trying to validate the peer urls: Line 255 in 6200b22
in the above case, the urls that are compared are equal as: which is correct, previously the method That then causes errors on the listener_tls.go that looks like this:
Which led us to believe it was an issue with the wildcard matching - apologies for the red herring. It's merely a difference in waiting for DNS resolution. |
here's my attempt to make this a bit more explicit during bootstrap: #15064 The startup would then fail with:
|
Have you confirmed that #15064 can resolve your issue? Based on the error message above in #15062 (comment), the etcd server rejected the client connection due to client hostname mismatching the DNSName in certificate. FYI. https://etcd.io/docs/v3.5/op-guide/security/#notes-for-tls-authentication Have you changed anything recently (e.g. client certificate)? If not, probably does it mean the you just need to wait some time for the reverse lookup to work? |
Did you see this error only in a short period after the etcd gets started or all the time? |
@csrwng can you quickly pitch in? you validated it yesterday with this PR with success. |
@ahrtr The error occurs continuously for about 5 minutes and then eventually resolves itself. This looks to be a timing issue. If a new cluster member starts attempting to contact its peers without waiting ~2-3 secs for its name to be resolvable, then peers will start reporting the error above and will not accept traffic from the new member until after 5 minutes have passed. Having the check that ensures the name is resolvable before starting the member prevents the communication errors from the peer. I have confirmed that this change does fix the issue. |
Do you mean etcd keeps being restarted (failing due to the error " It looks like a DNS issue to me. Adding an init-container to make sure the service (e.g. Just to double confirm... you only upgraded to the etcd 3.5.6, and did not change anything else, then you ran into this issue? Have you changed anything else (e.g client or peer certificate, or DNS settings)? |
While I agree it's a DNS issue, the etcd behavior has changed with the patch release due to the backport. I'm not sure anyone else is relying on that, but it would still be good if the 3.5 releases could behave the same way. |
No, the error is reported by an established cluster member as a new member tries to join. And yes the error occurs for about 5 minutes, even though the dns name was not resolvable only for about 2-3 secs.
Yes I created a couple of diagrams to better explain what is happening: |
Thanks @csrwng for the feedback. But I am confused.
The two biggest concerns are:
In order to avoid too much back and forth session, could you please clarify the issue firstly, and follow collect-basic-info to collect & provide the required info? |
Sorry that is my bad, I used the wrong error message in my diagram. It is "does not match any of DNSNames".
This is a cluster defined by a statefulset. All members start at roughly the same time. There is no local WAL file when members start, but how does it know if it's a new cluster or not ?
The dns name used for the member is that of a headless service associated with the statefulset. I imagine that after the pod starting, it takes some time for coredns to make that pod's IP resolvable via the service's name.
Very good question. This is something happening in etcd code. It seems that if a lookup fails, the result is cached for that amount of time, but I don't know exactly how.
Will do |
It depends on the CLI flag If it has local WAL files, you should can see log below, etcd/server/etcdserver/raft.go Lines 494 to 497 in 816c2e2
If it doesn't have local WAL file, you should see log below, etcd/server/etcdserver/raft.go Lines 529 to 534 in 816c2e2
|
any update on this? |
I've had to turn my attention to more urgent issues. I will try to collect this info by end of next week. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
What happened?
See #13224 (comment)
The correct logic should be:
The text was updated successfully, but these errors were encountered: