`urlsEqual` might wrongly skip resolving DNS #15062

ahrtr · 2023-01-05T08:49:47Z

What happened?

The correct logic should be:

If the URL doesn't contains wildcard, keep current behavior; otherwise follow netutil: add url comparison without resolver to URLStringsEqual #13224 (comment)

tjungblu · 2023-01-05T09:46:05Z

Thanks Ben, I think we need to discuss what behavior we want to fix. Clearly this is a regression in 3.5 (and 3.4?), but implicitly blocking the bootstrap config on resolving a DNS entry in an equality check is also not a great idea.

I would rather have the config.go explicitly lookup the important URLs and fail with a meaningful error message when they can't be resolved (akin to the init container the hypershift team added in openshift/hypershift#1985). Then validate whether the URLs and their IP configs are consistent and equal. Potentially with a cmd flag to turn off that check.

#7798 does have some merit after all, there are implications on cluster membership as well.

ahrtr · 2023-01-05T09:49:17Z

we've seen troubles with wildcard dns names matching the server name early

In case I misunderstood anything, could you clarify this? How did wildcard dns names matching the server name early happen? Do you have detailed steps to reproduce this issue?

tjungblu · 2023-01-05T09:51:40Z

I only have fragments of the whole issue myself, I'll come back with a reproducer in a test case once I receive more information later today.

As far as I understood, the issue stems from the fact that the URLs somehow match this logic:
https://github.com/etcd-io/etcd/pull/14573/files#diff-669913e187d4dc726dcb9e01a5666ae0562549545138d25d2c8fc2ceb3e23d30R151-R165

My assumption was the wildcard is ignored, thus matching only on the host which is equal, which in turn will not resolve the DNS anymore. That previously would've blocked for a while to resolve, thus "preventing" etcd from starting.

ahrtr · 2023-01-05T10:19:28Z

It seems not related to #13224. The function urlsEqual just compares two slices of URL, and skip resolving DNS if they are deep equal. The logic seems simple and correct to me.

Anyway, please clarify the issue and provide more detailed info firstly. FYI. https://github.com/ahrtr/etcd-issues/blob/master/docs/troubleshooting/sanity_check_and_collect_basic_info.md

tjungblu · 2023-01-05T10:38:41Z

The logic seems simple and correct to me.

The logic is not the issue, it's the ordering that breaks an implicit assumption users had about the startup of etcd.

tjungblu · 2023-01-05T14:57:10Z

Here's how the team runs etcd inside k8s:

 /usr/bin/etcd
        --data-dir=/var/lib/data
        --name=etcd-0
        --initial-advertise-peer-urls=https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380
        --listen-peer-urls=https://10.137.0.28:2380
        --listen-client-urls=https://10.137.0.28:2379,https://localhost:2379
        --advertise-client-urls=https://etcd-0.etcd-client.clusters-cewong-guest.svc:2379
        --listen-metrics-urls=https://0.0.0.0:2382
        --initial-cluster-token=etcd-cluster
        --initial-cluster=etcd-0=https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380,etcd-1=https://etcd-1.etcd-discovery.clusters-cewong-guest.svc:2380,etcd-2=https://etcd-2.etcd-discovery.clusters-cewong-guest.svc:2380
        --initial-cluster-state=new
        --quota-backend-bytes=4294967296
        --snapshot-count=10000
        --peer-client-cert-auth=true
        --peer-cert-file=/etc/etcd/tls/peer/peer.crt
        --peer-key-file=/etc/etcd/tls/peer/peer.key
        --peer-trusted-ca-file=/etc/etcd/tls/etcd-ca/ca.crt
        --client-cert-auth=true
        --cert-file=/etc/etcd/tls/server/server.crt
        --key-file=/etc/etcd/tls/server/server.key
        --trusted-ca-file=/etc/etcd/tls/etcd-ca/ca.crt

If you step through it with a debugger, you notice the path in config.go trying to validate the peer urls:

etcd/server/config/config.go

Line 255 in 6200b22

func (c *ServerConfig) advertiseMatchesCluster() error {

in the above case, the urls that are compared are equal as:

https://etcd-0.etcd-client.clusters-cewong-guest.svc:2379

which is correct, previously the method netutil.URLStringsEqual would block until the IP resolves. Since 3.5.6 it just continues to start the server.

That then causes errors on the listener_tls.go that looks like this:

{"level":"warn","ts":"2022-12-13T11:29:51.254Z","caller":"embed/config_logging.go:160","msg":"rejected connection",
"remote-addr":"10.132.1.27:42760",
"server-name":"etcd-0.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc",
"ip-addresses":[],
"dns-names":[
	"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc",
	"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local",
	"127.0.0.1",
	"::1"],
	
"error":"tls: \"10.132.1.27\" does not match any of DNSNames [\"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc\" \"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}

Which led us to believe it was an issue with the wildcard matching - apologies for the red herring. It's merely a difference in waiting for DNS resolution.

tjungblu · 2023-01-05T15:35:55Z

here's my attempt to make this a bit more explicit during bootstrap: #15064

The startup would then fail with:

{"level":"warn","ts":"2023-01-05T16:33:32.901+0100","caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380","host":"etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380","retry-interval":"1s","error":"lookup etcd-0.etcd-discovery.clusters-cewong-guest.svc: no such host"}

ahrtr · 2023-01-06T03:11:10Z

Have you confirmed that #15064 can resolve your issue?

Based on the error message above in #15062 (comment), the etcd server rejected the client connection due to client hostname mismatching the DNSName in certificate. FYI. https://etcd.io/docs/v3.5/op-guide/security/#notes-for-tls-authentication

Have you changed anything recently (e.g. client certificate)? If not, probably does it mean the you just need to wait some time for the reverse lookup to work?

ahrtr · 2023-01-06T04:23:57Z

"error":"tls: \"10.132.1.27\" does not match any of DNSNames [\"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc\" \"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}

Did you see this error only in a short period after the etcd gets started or all the time?

tjungblu · 2023-01-06T08:32:21Z

@csrwng can you quickly pitch in? you validated it yesterday with this PR with success.

csrwng · 2023-01-10T02:38:25Z

@ahrtr The error occurs continuously for about 5 minutes and then eventually resolves itself. This looks to be a timing issue. If a new cluster member starts attempting to contact its peers without waiting ~2-3 secs for its name to be resolvable, then peers will start reporting the error above and will not accept traffic from the new member until after 5 minutes have passed. Having the check that ensures the name is resolvable before starting the member prevents the communication errors from the peer.

I have confirmed that this change does fix the issue.

ahrtr · 2023-01-10T05:51:41Z

The error occurs continuously for about 5 minutes and then eventually resolves itself. This looks to be a timing issue.

Do you mean etcd keeps being restarted (failing due to the error "does not match any of DNSNames", and gets started again soon; then fails again...) for about 5 minutes? Have you figured out the root cause why the error occurs continuously for about 5 minutes?

It looks like a DNS issue to me. Adding an init-container to make sure the service (e.g. etcd-0.etcd-discovery.clusters-cewong-guest.svc) is resolvable is a correct workaround to me; but I'd suggest to figure out the root cause why the service isn't resolvable in the beginning about 5 minutes.

Just to double confirm... you only upgraded to the etcd 3.5.6, and did not change anything else, then you ran into this issue? Have you changed anything else (e.g client or peer certificate, or DNS settings)?

tjungblu · 2023-01-10T10:40:32Z

It looks like a DNS issue to me.

While I agree it's a DNS issue, the etcd behavior has changed with the patch release due to the backport. I'm not sure anyone else is relying on that, but it would still be good if the 3.5 releases could behave the same way.

csrwng · 2023-01-11T18:42:17Z

Do you mean etcd keeps being restarted (failing due to the error "does not match any of DNSNames", and gets started again soon; then fails again...) for about 5 minutes?

No, the error is reported by an established cluster member as a new member tries to join. And yes the error occurs for about 5 minutes, even though the dns name was not resolvable only for about 2-3 secs.

Just to double confirm... you only upgraded to the etcd 3.5.6, and did not change anything else, then you ran into this issue?

Yes

I created a couple of diagrams to better explain what is happening:

ahrtr · 2023-01-12T08:41:08Z

Thanks @csrwng for the feedback. But I am confused.

Previously @tjungblu provided error message "does not match any of DNSNames", but you provided error message "failed to resolve URL Host". Are you talking about two different issues?
In @tjungblu 's PR Ensure DNS is looked up on boostrap #15064, the method (*ServerConfig) advertiseMatchesCluster() is modified, but it's only called in (*ServerConfig)VerifyBootstrap, which is only called in the case of No local WAL files and new cluster (Please ignore the etcdutl use case). Your case is related to a member joining an existing cluster, how can the PR fix your issue?

The two biggest concerns are:

Why the hostname is unresolved for about 2~3 seconds?
The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?

In order to avoid too much back and forth session, could you please clarify the issue firstly, and follow collect-basic-info to collect & provide the required info?

csrwng · 2023-01-12T14:28:52Z

Previously @tjungblu provided error message "does not match any of DNSNames", but you provided error message "failed to resolve URL Host". Are you talking about two different issues?

Sorry that is my bad, I used the wrong error message in my diagram. It is "does not match any of DNSNames".

Your case is related to a member joining an existing cluster, how can the PR fix your issue?

This is a cluster defined by a statefulset. All members start at roughly the same time. There is no local WAL file when members start, but how does it know if it's a new cluster or not ?

Why the hostname is unresolved for about 2~3 seconds?

The dns name used for the member is that of a headless service associated with the statefulset. I imagine that after the pod starting, it takes some time for coredns to make that pod's IP resolvable via the service's name.

The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?

Very good question. This is something happening in etcd code. It seems that if a lookup fails, the result is cached for that amount of time, but I don't know exactly how.

could you please clarify the issue firstly, and follow [collect-basic-info](https://github.com/ahrtr/etcd-
issues/blob/master/docs/troubleshooting/sanity_check_and_collect_basic_info.md#collect-basic-info) to collect & provide the required info?

Will do

ahrtr · 2023-01-13T03:00:02Z

but how does it know if it's a new cluster or not ?

It depends on the CLI flag --initial-cluster-state. But if it has local WAL files, then the flag is ignored.

If it has local WAL files, you should can see log below,

etcd/server/etcdserver/raft.go

Lines 494 to 497 in 816c2e2

    
           cfg.Logger.Info( 
        
           	"starting local member", 
        
           	zap.String("local-member-id", id.String()), 
        
           	zap.String("cluster-id", cl.ID().String()),

If it doesn't have local WAL file, you should see log below,

etcd/server/etcdserver/raft.go

Lines 529 to 534 in 816c2e2

    
           cfg.Logger.Info( 
        
           	"restarting local member", 
        
           	zap.String("cluster-id", cid.String()), 
        
           	zap.String("local-member-id", id.String()), 
        
           	zap.Uint64("commit-index", st.Commit), 
        
           )

ahrtr · 2023-03-06T07:31:28Z

The dns name used for the member is that of a headless service associated with the statefulset. I imagine that after the pod starting, it takes some time for coredns to make that pod's IP resolvable via the service's name.

The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?

Very good question. This is something happening in etcd code. It seems that if a lookup fails, the result is cached for that amount of time, but I don't know exactly how.

could you please clarify the issue firstly, and follow [collect-basic-info](https://github.com/ahrtr/etcd-
issues/blob/master/docs/troubleshooting/sanity_check_and_collect_basic_info.md#collect-basic-info) to collect & provide the required info?

Will do

any update on this?

csrwng · 2023-03-07T20:09:44Z

I've had to turn my attention to more urgent issues. I will try to collect this info by end of next week.

stale · 2023-08-12T07:52:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

ahrtr added the type/bug label Jan 5, 2023

ahrtr mentioned this issue Jan 5, 2023

netutil: add url comparison without resolver to URLStringsEqual #13224

Merged

ahrtr changed the title ~~urlsEqual might wrong skip resolving DNS~~ urlsEqual might wrongly skip resolving DNS Jan 5, 2023

ahrtr added good first issue help wanted labels Jan 5, 2023

ahrtr removed good first issue help wanted labels Jan 5, 2023

tjungblu mentioned this issue Jan 5, 2023

Ensure DNS is looked up on boostrap #15064

Closed

ahrtr mentioned this issue Jan 18, 2023

Test etcd container in K8s cluster #15139

Closed

tjungblu mentioned this issue May 9, 2023

WIP: ETCD-295: Automated Backups of Etcd openshift/enhancements#1370

Closed

stale bot added the stale label Aug 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`urlsEqual` might wrongly skip resolving DNS #15062

`urlsEqual` might wrongly skip resolving DNS #15062

ahrtr commented Jan 5, 2023 •

edited

Loading

tjungblu commented Jan 5, 2023

ahrtr commented Jan 5, 2023

tjungblu commented Jan 5, 2023 •

edited

Loading

ahrtr commented Jan 5, 2023

tjungblu commented Jan 5, 2023

tjungblu commented Jan 5, 2023

tjungblu commented Jan 5, 2023

ahrtr commented Jan 6, 2023

ahrtr commented Jan 6, 2023

tjungblu commented Jan 6, 2023

csrwng commented Jan 10, 2023 •

edited

Loading

ahrtr commented Jan 10, 2023

tjungblu commented Jan 10, 2023

csrwng commented Jan 11, 2023

ahrtr commented Jan 12, 2023

csrwng commented Jan 12, 2023

ahrtr commented Jan 13, 2023

ahrtr commented Mar 6, 2023

csrwng commented Mar 7, 2023

stale bot commented Aug 12, 2023

urlsEqual might wrongly skip resolving DNS #15062

urlsEqual might wrongly skip resolving DNS #15062

Comments

ahrtr commented Jan 5, 2023 • edited Loading

What happened?

tjungblu commented Jan 5, 2023

ahrtr commented Jan 5, 2023

tjungblu commented Jan 5, 2023 • edited Loading

ahrtr commented Jan 5, 2023

tjungblu commented Jan 5, 2023

tjungblu commented Jan 5, 2023

tjungblu commented Jan 5, 2023

ahrtr commented Jan 6, 2023

ahrtr commented Jan 6, 2023

tjungblu commented Jan 6, 2023

csrwng commented Jan 10, 2023 • edited Loading

ahrtr commented Jan 10, 2023

tjungblu commented Jan 10, 2023

csrwng commented Jan 11, 2023

ahrtr commented Jan 12, 2023

csrwng commented Jan 12, 2023

ahrtr commented Jan 13, 2023

ahrtr commented Mar 6, 2023

csrwng commented Mar 7, 2023

stale bot commented Aug 12, 2023

`urlsEqual` might wrongly skip resolving DNS #15062

`urlsEqual` might wrongly skip resolving DNS #15062

ahrtr commented Jan 5, 2023 •

edited

Loading

tjungblu commented Jan 5, 2023 •

edited

Loading

csrwng commented Jan 10, 2023 •

edited

Loading