Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: CoreDNS in version 1.8.0 fails to resolve domain (works in 1.6.9) #4735

Closed
lhotrifork opened this issue Jul 6, 2021 · 6 comments
Labels

Comments

@lhotrifork
Copy link

What happened:
When using k3s with CoreDNS 1.8.0 for a specific domain (which is my corporate http proxy) returns 'rflags' in a different way than when using CoreDNS 1.6.9.
In practice this means that OpenID Connect using OpenResty as a reverse-proxy fails. Downgrading to 1.6.9 fixes the issue. CoreDNS in version 1.7.X has the same issue.
I cannot easily upgrade to 1.8.3 as it depends on endpointSlices.

I've narrowed it down to the following difference (which may be a wild goose-chase).
For 1.6.9 CoreDNS logs the following way for the domain:

[INFO] 10.42.0.220:44598 - 59764 "A IN proxy.mydomain.com. udp 32 false 512" NOERROR qr,aa,rd,ra 336 0.000129347s

For 1.8.0 it logs:

[INFO] 10.42.0.228:33081 - 63992 "A IN proxy.mydomain.com. udp 32 false 512" NOERROR qr,rd,ra,ad 336 0.000702693s

The difference seems to be in the 'rflags'.

Anything else we need to know?:
As this is my proxy server a lot of requests also goes through it, and for those request there's also a difference between 1.6.9 and 1.8.0.
1.6.9:

[INFO] 10.42.0.213:36482 - 19579 "A IN proxy.mydomain.com.flux-system.svc.cluster.local. udp 62 false 512" NXDOMAIN qr,aa,rd 155 0.000166089s
[INFO] 10.42.0.213:35226 - 61848 "AAAA IN proxy.mydomain.com.flux-system.svc.cluster.local. udp 62 false 512" NXDOMAIN qr,aa,rd 155 0.003467135s
[INFO] 10.42.0.213:37512 - 34923 "A IN proxy.mydomain.com.svc.cluster.local. udp 50 false 512" NXDOMAIN qr,aa,rd 143 0.000105847s
[INFO] 10.42.0.213:59737 - 16377 "AAAA IN proxy.mydomain.com.svc.cluster.local. udp 50 false 512" NXDOMAIN qr,aa,rd 143 0.000061375s
[INFO] 10.42.0.213:55805 - 65513 "AAAA IN proxy.mydomain.com.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000271467s
[INFO] 10.42.0.213:44383 - 8234 "A IN proxy.mydomain.com.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000194574s
[INFO] 10.42.0.213:35004 - 18276 "A IN proxy.mydomain.com.my.subdomain.mydomain.com. udp 53 false 512" NXDOMAIN qr,rd,ra 141 0.000447348s

1.8.0:

[INFO] 10.42.0.251:41582 - 30418 "AAAA IN proxy.mydomain.com. udp 32 false 512" NOERROR qr,aa,rd,ra 348 0.000044463s
[INFO] 10.42.0.251:51639 - 8933 "A IN proxy.mydomain.com.kube-system.svc.cluster.local. udp 62 false 512" NXDOMAIN qr,aa,rd 155 0.000155996s
[INFO] 10.42.0.251:46513 - 14889 "A IN proxy.mydomain.com.svc.cluster.local. udp 50 false 512" NXDOMAIN qr,aa,rd 143 0.00009106s
[INFO] 10.42.0.251:56726 - 44362 "A IN proxy.mydomain.com.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000055118s
[INFO] 10.42.0.251:52852 - 37567 "A IN proxy.mydomain.com.my.subdomain.mydomain.com. udp 53 false 512" NXDOMAIN qr,aa,rd,ra 141 0.000026808s
[INFO] 10.42.0.251:55715 - 2845 "A IN proxy.mydomain.com.netic.dk. udp 41 false 512" NXDOMAIN qr,aa,rd,ra 117 0.000024033s
[INFO] 10.42.0.251:54047 - 17602 "A IN proxy.mydomain.com. udp 32 false 512" NOERROR qr,aa,rd,ra 336 0.00005718s
[INFO] 10.42.0.251:41358 - 4578 "AAAA IN proxy.mydomain.com. udp 32 false 512" NOERROR qr,aa,rd,ra 348 0.000046303s

Environment:

  • the version of CoreDNS:
    1.8.0
  • Corefile:
    .:53 {
        log
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        hosts /etc/coredns/NodeHosts {
          ttl 60
          reload 15s
          fallthrough
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
  • OS (e.g: cat /etc/os-release):
    CoreDNS is running in k3s on Ubuntu.
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
$ k3s -v
k3s version v1.20.4+k3s1 (838a906a)
go version go1.15.8
  • Others:
    /ect/resolv.conf on the host:
nameserver 127.0.0.53
options edns0
search my.subdomain.mydomain.com mydomain.com
@chrisohaver
Copy link
Member

Since 1.7.1, #4085 makes cache default all queries to use DNSSEC (an optimization). This doesn't exactly line up with your observing 1.7.X, but if 1.7.0 is not affected, it could explain one of the flag differences you see, the presence of the AD flag instead of AA. However, cache won't reply with an AD flag unless the request was actually DNSSEC, and in this case per the log entry the request would not have been DNSSEC, so something is off. Anyways, perhaps this confuses OpenID Connect/OpenResty.

Clearly disabling cache in production is not an option. But to experiment, you could temporarily disable cache to see if the issue goes away, just to narrow down things.

@lhotrifork
Copy link
Author

lhotrifork commented Jul 8, 2021

Hi,

Thanks for your reply!
Disabling the cache seems to fix the problem. I also checked CoreDNS version 1.7.0 - it doesn't have the problem as you suspected. 1.7.1 does have the problem. So #4085 seems to be the culprit.

The log for 1.8.0 with caching disabled contains the following entries (filterede for the correct domain only):

[INFO] 10.42.0.213:39144 - 15931 \"A IN proxy.mydomain.com. udp 32 false 512\" NOERROR qr,rd,ra 336 0.000520191s
[INFO] 10.42.0.213:38318 - 28807 \"AAAA IN proxy.mydomain.com. udp 32 false 512\" NOERROR qr,rd,ra 348 0.000592308s
[INFO] 10.42.0.213:56961 - 38980 \"AAAA IN proxy.mydomain.com. udp 32 false 512\" NOERROR qr,rd,ra 348 0.000510597s
[INFO] 10.42.0.213:45560 - 45147 \"A IN proxy.mydomain.com. udp 32 false 512\" NOERROR qr,rd,ra 336 0.000580388s
[INFO] 10.42.0.5:59891 - 49319 \"A IN proxy.mydomain.com. udp 32 false 512\" NOERROR qr,rd,ra 336 0.000532207s
[INFO] 10.42.0.6:54442 - 31949 \"A IN proxy.mydomain.com. udp 32 false 512\" NOERROR qr,rd,ra 336 0.000440802s

If I on the host (outside k8s) perform a dig I get the following response:

$ dig proxy.mydomain.com.

; <<>> DiG 9.11.3-1ubuntu1.15-Ubuntu <<>> proxy.mydomain.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39163
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;proxy.mydomain.com.                        IN      A

;; ANSWER SECTION:
proxy.mydomain.com.         17112   IN      A       {IP-ADDR}

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Jul 08 10:29:17 CEST 2021
;; MSG SIZE  rcvd: 59

@lhotrifork
Copy link
Author

I saw that there was some work done by related to caching of DNSSEC in #4148
I just tested it with 1.8.3 (k3s hasn't built a 1.8.4 docker image yet) and it still has the problem.

Maybe the query is returning something that triggers the isDNSSEC function to return true?: https://github.com/coredns/coredns/blob/master/plugin/cache/dnssec.go#L8 here?

@chrisohaver
Copy link
Member

isDNSSEC is used when filtering the RR sets, and the log doesn't show the length of the response sections, so its hard to tell if RR set filtering is happening as expected from the log entry ... however the log does show the header flags ...

Looking at the code, it looks like the only way cache will write the ad flag to a client is if the both the cached response has the ad flag and the client says DNSSEC is OK (DO bit set). IOW, for a response from cache that has ad set, both of those must be true. So it would seem that ...

[INFO] 10.42.0.228:33081 - 63992 "A IN proxy.mydomain.com. udp 32 false 512" NOERROR qr,rd,ra,ad 336 0.000702693s

... is impossible, because false shows that the DO bit is not set in the client request. And qr,rd,ra,ad shows that response included the ad flag.

Yet it happened somehow.

@chrisohaver
Copy link
Member

Yet it happened somehow.

IIUC, for a cache miss, this can occur. The flag was only unset on cache hits. I have opened #4736 to fix it.

I don't know if #4736 will fix this issue however.

@chrisohaver
Copy link
Member

Please re-open if #4736 does not resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants