Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlsEqual might wrongly skip resolving DNS #15062

Open
ahrtr opened this issue Jan 5, 2023 · 20 comments
Open

urlsEqual might wrongly skip resolving DNS #15062

ahrtr opened this issue Jan 5, 2023 · 20 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented Jan 5, 2023

What happened?

See #13224 (comment)

The correct logic should be:

  1. If the URL doesn't contains wildcard, keep current behavior; otherwise follow netutil: add url comparison without resolver to URLStringsEqual #13224 (comment)
@ahrtr ahrtr added the type/bug label Jan 5, 2023
@ahrtr ahrtr changed the title urlsEqual might wrong skip resolving DNS urlsEqual might wrongly skip resolving DNS Jan 5, 2023
@tjungblu
Copy link
Contributor

tjungblu commented Jan 5, 2023

Thanks Ben, I think we need to discuss what behavior we want to fix. Clearly this is a regression in 3.5 (and 3.4?), but implicitly blocking the bootstrap config on resolving a DNS entry in an equality check is also not a great idea.

I would rather have the config.go explicitly lookup the important URLs and fail with a meaningful error message when they can't be resolved (akin to the init container the hypershift team added in openshift/hypershift#1985). Then validate whether the URLs and their IP configs are consistent and equal. Potentially with a cmd flag to turn off that check.

#7798 does have some merit after all, there are implications on cluster membership as well.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 5, 2023

we've seen troubles with wildcard dns names matching the server name early

In case I misunderstood anything, could you clarify this? How did wildcard dns names matching the server name early happen? Do you have detailed steps to reproduce this issue?

@tjungblu
Copy link
Contributor

tjungblu commented Jan 5, 2023

I only have fragments of the whole issue myself, I'll come back with a reproducer in a test case once I receive more information later today.

As far as I understood, the issue stems from the fact that the URLs somehow match this logic:
https://github.com/etcd-io/etcd/pull/14573/files#diff-669913e187d4dc726dcb9e01a5666ae0562549545138d25d2c8fc2ceb3e23d30R151-R165

My assumption was the wildcard is ignored, thus matching only on the host which is equal, which in turn will not resolve the DNS anymore. That previously would've blocked for a while to resolve, thus "preventing" etcd from starting.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 5, 2023

It seems not related to #13224. The function urlsEqual just compares two slices of URL, and skip resolving DNS if they are deep equal. The logic seems simple and correct to me.

Anyway, please clarify the issue and provide more detailed info firstly. FYI. https://github.com/ahrtr/etcd-issues/blob/master/docs/troubleshooting/sanity_check_and_collect_basic_info.md

@tjungblu
Copy link
Contributor

tjungblu commented Jan 5, 2023

The logic seems simple and correct to me.

The logic is not the issue, it's the ordering that breaks an implicit assumption users had about the startup of etcd.

@tjungblu
Copy link
Contributor

tjungblu commented Jan 5, 2023

Here's how the team runs etcd inside k8s:

 /usr/bin/etcd
        --data-dir=/var/lib/data
        --name=etcd-0
        --initial-advertise-peer-urls=https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380
        --listen-peer-urls=https://10.137.0.28:2380
        --listen-client-urls=https://10.137.0.28:2379,https://localhost:2379
        --advertise-client-urls=https://etcd-0.etcd-client.clusters-cewong-guest.svc:2379
        --listen-metrics-urls=https://0.0.0.0:2382
        --initial-cluster-token=etcd-cluster
        --initial-cluster=etcd-0=https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380,etcd-1=https://etcd-1.etcd-discovery.clusters-cewong-guest.svc:2380,etcd-2=https://etcd-2.etcd-discovery.clusters-cewong-guest.svc:2380
        --initial-cluster-state=new
        --quota-backend-bytes=4294967296
        --snapshot-count=10000
        --peer-client-cert-auth=true
        --peer-cert-file=/etc/etcd/tls/peer/peer.crt
        --peer-key-file=/etc/etcd/tls/peer/peer.key
        --peer-trusted-ca-file=/etc/etcd/tls/etcd-ca/ca.crt
        --client-cert-auth=true
        --cert-file=/etc/etcd/tls/server/server.crt
        --key-file=/etc/etcd/tls/server/server.key
        --trusted-ca-file=/etc/etcd/tls/etcd-ca/ca.crt

If you step through it with a debugger, you notice the path in config.go trying to validate the peer urls:

func (c *ServerConfig) advertiseMatchesCluster() error {

in the above case, the urls that are compared are equal as:

https://etcd-0.etcd-client.clusters-cewong-guest.svc:2379

which is correct, previously the method netutil.URLStringsEqual would block until the IP resolves. Since 3.5.6 it just continues to start the server.

That then causes errors on the listener_tls.go that looks like this:

{"level":"warn","ts":"2022-12-13T11:29:51.254Z","caller":"embed/config_logging.go:160","msg":"rejected connection",
"remote-addr":"10.132.1.27:42760",
"server-name":"etcd-0.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc",
"ip-addresses":[],
"dns-names":[
	"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc",
	"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local",
	"127.0.0.1",
	"::1"],
	
"error":"tls: \"10.132.1.27\" does not match any of DNSNames [\"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc\" \"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}

Which led us to believe it was an issue with the wildcard matching - apologies for the red herring. It's merely a difference in waiting for DNS resolution.

@tjungblu
Copy link
Contributor

tjungblu commented Jan 5, 2023

here's my attempt to make this a bit more explicit during bootstrap: #15064

The startup would then fail with:

{"level":"warn","ts":"2023-01-05T16:33:32.901+0100","caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380","host":"etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380","retry-interval":"1s","error":"lookup etcd-0.etcd-discovery.clusters-cewong-guest.svc: no such host"}

@ahrtr
Copy link
Member Author

ahrtr commented Jan 6, 2023

Have you confirmed that #15064 can resolve your issue?

Based on the error message above in #15062 (comment), the etcd server rejected the client connection due to client hostname mismatching the DNSName in certificate. FYI. https://etcd.io/docs/v3.5/op-guide/security/#notes-for-tls-authentication

Have you changed anything recently (e.g. client certificate)? If not, probably does it mean the you just need to wait some time for the reverse lookup to work?

@ahrtr
Copy link
Member Author

ahrtr commented Jan 6, 2023

"error":"tls: \"10.132.1.27\" does not match any of DNSNames [\"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc\" \"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}

Did you see this error only in a short period after the etcd gets started or all the time?

@tjungblu
Copy link
Contributor

tjungblu commented Jan 6, 2023

@csrwng can you quickly pitch in? you validated it yesterday with this PR with success.

@csrwng
Copy link

csrwng commented Jan 10, 2023

@ahrtr The error occurs continuously for about 5 minutes and then eventually resolves itself. This looks to be a timing issue. If a new cluster member starts attempting to contact its peers without waiting ~2-3 secs for its name to be resolvable, then peers will start reporting the error above and will not accept traffic from the new member until after 5 minutes have passed. Having the check that ensures the name is resolvable before starting the member prevents the communication errors from the peer.

I have confirmed that this change does fix the issue.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 10, 2023

The error occurs continuously for about 5 minutes and then eventually resolves itself. This looks to be a timing issue.

Do you mean etcd keeps being restarted (failing due to the error "does not match any of DNSNames", and gets started again soon; then fails again...) for about 5 minutes? Have you figured out the root cause why the error occurs continuously for about 5 minutes?

It looks like a DNS issue to me. Adding an init-container to make sure the service (e.g. etcd-0.etcd-discovery.clusters-cewong-guest.svc) is resolvable is a correct workaround to me; but I'd suggest to figure out the root cause why the service isn't resolvable in the beginning about 5 minutes.

Just to double confirm... you only upgraded to the etcd 3.5.6, and did not change anything else, then you ran into this issue? Have you changed anything else (e.g client or peer certificate, or DNS settings)?

@tjungblu
Copy link
Contributor

It looks like a DNS issue to me.

While I agree it's a DNS issue, the etcd behavior has changed with the patch release due to the backport. I'm not sure anyone else is relying on that, but it would still be good if the 3.5 releases could behave the same way.

@csrwng
Copy link

csrwng commented Jan 11, 2023

Do you mean etcd keeps being restarted (failing due to the error "does not match any of DNSNames", and gets started again soon; then fails again...) for about 5 minutes?

No, the error is reported by an established cluster member as a new member tries to join. And yes the error occurs for about 5 minutes, even though the dns name was not resolvable only for about 2-3 secs.

Just to double confirm... you only upgraded to the etcd 3.5.6, and did not change anything else, then you ran into this issue?

Yes

I created a couple of diagrams to better explain what is happening:

Screenshot 2023-01-11 at 1 41 02 PM

Screenshot 2023-01-11 at 1 36 19 PM

@ahrtr
Copy link
Member Author

ahrtr commented Jan 12, 2023

Thanks @csrwng for the feedback. But I am confused.

The two biggest concerns are:

  1. Why the hostname is unresolved for about 2~3 seconds?
  2. The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?

In order to avoid too much back and forth session, could you please clarify the issue firstly, and follow collect-basic-info to collect & provide the required info?

@csrwng
Copy link

csrwng commented Jan 12, 2023

Previously @tjungblu provided error message "does not match any of DNSNames", but you provided error message "failed to resolve URL Host". Are you talking about two different issues?

Sorry that is my bad, I used the wrong error message in my diagram. It is "does not match any of DNSNames".

Your case is related to a member joining an existing cluster, how can the PR fix your issue?

This is a cluster defined by a statefulset. All members start at roughly the same time. There is no local WAL file when members start, but how does it know if it's a new cluster or not ?

Why the hostname is unresolved for about 2~3 seconds?

The dns name used for the member is that of a headless service associated with the statefulset. I imagine that after the pod starting, it takes some time for coredns to make that pod's IP resolvable via the service's name.

The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?

Very good question. This is something happening in etcd code. It seems that if a lookup fails, the result is cached for that amount of time, but I don't know exactly how.

could you please clarify the issue firstly, and follow [collect-basic-info](https://github.com/ahrtr/etcd-
issues/blob/master/docs/troubleshooting/sanity_check_and_collect_basic_info.md#collect-basic-info) to collect & provide the required info?

Will do

@ahrtr
Copy link
Member Author

ahrtr commented Jan 13, 2023

but how does it know if it's a new cluster or not ?

It depends on the CLI flag --initial-cluster-state. But if it has local WAL files, then the flag is ignored.

If it has local WAL files, you should can see log below,

cfg.Logger.Info(
"starting local member",
zap.String("local-member-id", id.String()),
zap.String("cluster-id", cl.ID().String()),

If it doesn't have local WAL file, you should see log below,

cfg.Logger.Info(
"restarting local member",
zap.String("cluster-id", cid.String()),
zap.String("local-member-id", id.String()),
zap.Uint64("commit-index", st.Commit),
)

@ahrtr
Copy link
Member Author

ahrtr commented Mar 6, 2023

The dns name used for the member is that of a headless service associated with the statefulset. I imagine that after the pod starting, it takes some time for coredns to make that pod's IP resolvable via the service's name.

The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?

Very good question. This is something happening in etcd code. It seems that if a lookup fails, the result is cached for that amount of time, but I don't know exactly how.

could you please clarify the issue firstly, and follow [collect-basic-info](https://github.com/ahrtr/etcd-
issues/blob/master/docs/troubleshooting/sanity_check_and_collect_basic_info.md#collect-basic-info) to collect & provide the required info?

Will do

any update on this?

@csrwng
Copy link

csrwng commented Mar 7, 2023

I've had to turn my attention to more urgent issues. I will try to collect this info by end of next week.

@stale
Copy link

stale bot commented Aug 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants