Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vL3 DNS doesn't work in an interdomain context #1414

Closed
fr-Pursuit opened this issue Jan 26, 2023 · 18 comments
Closed

vL3 DNS doesn't work in an interdomain context #1414

fr-Pursuit opened this issue Jan 26, 2023 · 18 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@fr-Pursuit
Copy link

Expected Behavior

After deploying the interdomain vL3 example, the vL3 DNS feature should work as described in its associated example.

Specifically, after modifying the NSM_DNS_TEMPLATES to avoid using the {{ .NetworkService }} property (which contains the illegal character @ in the case of an interdomain vL3 service), the vL3 NSE should respond to DNS queries with the vL3 IP of the pod about which the query is.

For instance, in the context of the interdomain vL3 example example and after modifying the NSM_DNS_TEMPLATES to {{ index .Labels \"podName\" }}.my-interdomain-vl3-network for both vL3 NSEs, the NSEs should respond to DNS queries about the alpine.my-interdomain-vl3-network with the IP address of the alpine pod connected to the my-interdomain-vl3-network vL3 network. Since two alpine pods are deployed in this example (one in each cluster), whether the NSE should return the vL3 IP of the local alpine pod, the remote one, or both, remains to be defined.

Current Behavior

In the context described above, neither the alpine pods nor NSE pods are able to resolve the alpine.my-interdomain-vl3-network domain name.

Failure Information (for bugs)

Running nslookup alpine.my-interdomain-vl3-network. in either one alpine pod or one vL3 NSE pod results in the SERVFAIL error.

Steps to Reproduce

  1. Deploy the interdomain vL3 example while modifying the NSE deployment to use the {{ index .Labels \"podName\" }}.my-interdomain-vl3-network value for the NSM_DNS_TEMPLATES environment variable
  2. Run apt update && apt install dnsutils && nslookup alpine.my-interdomain-vl3-network. 127.0.0.1 in the NSE pod
  3. Observe that the DNS query is failing
  4. Check the pod's logs for errors

Context

The interdomain vL3 example example was deployed on 3 VMs, each running a local kind cluster. The nodes of each kind cluster have InternalIPs in different ranges, and multicluster communication between nodes is possible using their InternalIPs. The nodes however do not have ExternalIPs.

MetalLB is also deployed in each cluster to provide ExternalIPs to k8s services. These IPs are routable between the three VMs.

I used Kind 0.17.0.

Failure Logs

When a DNS query about the alpine pod is sent to the vL3 NSE, the following error appears in the pod's logs:

Jan 26 12:10:22.802 [TRAC] [id:55094] [type:dnsServer] (1) ⎆ sdk/pkg/tools/dnsutils/dnsconfigs/dnsConfigsHandler.ServeDNS()
Jan 26 12:10:22.803 [TRAC] [id:55094] [type:dnsServer] (2)  ⎆ sdk/pkg/tools/dnsutils/noloop/noloopDNSHandler.ServeDNS()
Jan 26 12:10:22.803 [TRAC] [id:55094] [type:dnsServer] (3)   ⎆ sdk/pkg/tools/dnsutils/norecursion/norecursionDNSHandler.ServeDNS()
Jan 26 12:10:22.803 [TRAC] [id:55094] [type:dnsServer] (4)    ⎆ sdk/pkg/tools/dnsutils/memory/memoryHandler.ServeDNS()
Jan 26 12:10:22.803 [TRAC] [id:55094] [type:dnsServer] (5)     ⎆ sdk/pkg/tools/dnsutils/fanout/fanoutHandler.ServeDNS()
Jan 26 12:10:22.859 [TRAC] [id:55094] [type:dnsServer] (1) ⎆ sdk/pkg/tools/dnsutils/dnsconfigs/dnsConfigsHandler.ServeDNS()
Jan 26 12:10:22.859 [TRAC] [id:55094] [type:dnsServer] (2)  ⎆ sdk/pkg/tools/dnsutils/noloop/noloopDNSHandler.ServeDNS()
Jan 26 12:10:22.859 [ERRO] [id:55094] [noloopDNSHandler:ServeDNS] [type:dnsServer] (2.1)    loop is not allowed: query: ;; opcode: QUERY, status: NOERROR, id: 55094;   ;; flags:; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0;    ;      ;; QUESTION SECTION:;    ;alpine.my-interdomain-vl3-network.     IN       A;     ;

The complete logs of the vL3 NSE pod are available here.

@denis-tingaikin denis-tingaikin added the bug Something isn't working label Jan 26, 2023
@denis-tingaikin denis-tingaikin added this to the v1.8.0 milestone Jan 26, 2023
@glazychev-art
Copy link
Contributor

Hi @fr-Pursuit ,

Could you please recheck your problem with {{ index .Labels \"podName\" }}.my-interdomain-vl3-network.(added dot at the end)?

@fr-Pursuit
Copy link
Author

fr-Pursuit commented Feb 6, 2023

Hi,

With the added dot at the end of the template, the DNS resolution partially works.
if I have a pod named yelb-ui connected to the vL3 network nslookup yelb-ui.my-interdomain-vl3-network fails after initially returning the expected IP:

# nslookup yelb-ui.my-interdomain-vl3-network. && echo success
Server:         127.0.0.1
Address:        127.0.0.1#53

Name:   yelb-ui.my-interdomain-vl3-network
Address: 172.16.0.3
** server can't find yelb-ui.my-interdomain-vl3-network: SERVFAIL

The same error also appears in the logs of the vL3 NSE that is in the same cluster as the yelb-ui pod:

Feb  6 23:18:25.908 [TRAC] [id:29292] [type:dnsServer] (1) ⎆ sdk/pkg/tools/dnsutils/dnsconfigs/dnsConfigsHandler.ServeDNS()
Feb  6 23:18:25.908 [TRAC] [id:29292] [type:dnsServer] (1.1)   message-request={"Id":29292,"Response":false,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":true,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":0,"Question":[{"Name":"yelb-ui.my-interdomain-vl3-network.my-interdomain-vl3-network.","Qtype":28,"Qclass":1}],"Answer":null,"Ns":null,"Extra":null}
Feb  6 23:18:25.908 [DEBU] [id:29292] [type:dnsServer] (1.2)   passed clientURLs: [{udp   172.16.1.1   false   }]
Feb  6 23:18:25.908 [DEBU] [id:29292] [type:dnsServer] (1.3)   passed SearchDomains: []
Feb  6 23:18:25.908 [TRAC] [id:29292] [type:dnsServer] (2)  ⎆ sdk/pkg/tools/dnsutils/noloop/noloopDNSHandler.ServeDNS()
Feb  6 23:18:25.908 [TRAC] [id:29292] [type:dnsServer] (3)   ⎆ sdk/pkg/tools/dnsutils/norecursion/norecursionDNSHandler.ServeDNS()
Feb  6 23:18:25.908 [TRAC] [id:29292] [type:dnsServer] (3.1)     message-request-diff=[{"type":"update","path":["MsgHdr","RecursionDesired"],"from":true,"to":false}]
Feb  6 23:18:25.908 [TRAC] [id:29292] [type:dnsServer] (4)    ⎆ sdk/pkg/tools/dnsutils/memory/memoryHandler.ServeDNS()
Feb  6 23:18:25.908 [TRAC] [id:29292] [type:dnsServer] (5)     ⎆ sdk/pkg/tools/dnsutils/fanout/fanoutHandler.ServeDNS()
Feb  6 23:18:25.911 [TRAC] [id:29292] [type:dnsServer] (1) ⎆ sdk/pkg/tools/dnsutils/dnsconfigs/dnsConfigsHandler.ServeDNS()
Feb  6 23:18:25.911 [TRAC] [id:29292] [type:dnsServer] (1.1)   message-request={"Id":29292,"Response":false,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":false,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":0,"Question":[{"Name":"yelb-ui.my-interdomain-vl3-network.my-interdomain-vl3-network.","Qtype":28,"Qclass":1}],"Answer":null,"Ns":null,"Extra":null}
Feb  6 23:18:25.911 [DEBU] [id:29292] [type:dnsServer] (1.2)   passed clientURLs: [{udp   172.16.1.1   false   }]
Feb  6 23:18:25.911 [DEBU] [id:29292] [type:dnsServer] (1.3)   passed SearchDomains: []
Feb  6 23:18:25.911 [TRAC] [id:29292] [type:dnsServer] (2)  ⎆ sdk/pkg/tools/dnsutils/noloop/noloopDNSHandler.ServeDNS()
Feb  6 23:18:25.911 [ERRO] [id:29292] [noloopDNSHandler:ServeDNS] [type:dnsServer] (2.1)    loop is not allowed: query: ;; opcode: QUERY, status: NOERROR, id: 29292;   ;; flags:; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0;    ;      ;; QUESTION SECTION:;    ;yelb-ui.my-interdomain-vl3-network.my-interdomain-vl3-network. IN       AAAA;  ;
Feb  6 23:18:25.911 [TRAC] [id:29292] [type:dnsServer] (2.2)    message-response={"Id":29292,"Response":true,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":false,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":2,"Question":[{"Name":"yelb-ui.my-interdomain-vl3-network.my-interdomain-vl3-network.","Qtype":28,"Qclass":1}],"Answer":[],"Ns":[],"Extra":[]}
Feb  6 23:18:25.913 [TRAC] [id:29292] [type:dnsServer] (5.1)       message-response={"Id":29292,"Response":true,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":false,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":2,"Question":[{"Name":"yelb-ui.my-interdomain-vl3-network.my-interdomain-vl3-network.","Qtype":28,"Qclass":1}],"Answer":[],"Ns":[],"Extra":[]}

In practice, this error seems to considerably increase the time it takes for a program to perform a DNS resolution in the my-interdomain-vl3-network zone. For example, every time I use the ping yelb-ui.my-interdomain-vl3-network command, I have to wait a significant while before the actual pings get exchanged (whereas if I use the IP, the pings are exchanged almost immediately).

PS: I changed the template to use a custom domain label instead of the default podName label to generate the domain name, as the yelb app I'm using for my tests generate pods with random names

@fr-Pursuit
Copy link
Author

Actually, the slow ping initialization seems to be caused by slow, failing reverse DNS queries: the command temporarily freezes after the IP associated with the domain is shown, but before the first response appears.

In the meantime, the following appears in the remote NSE's logs:

Feb  7 12:06:14.957 [TRAC] [id:26676] [type:dnsServer] (1) ⎆ sdk/pkg/tools/dnsutils/dnsconfigs/dnsConfigsHandler.ServeDNS()
Feb  7 12:06:14.957 [TRAC] [id:26676] [type:dnsServer] (1.1)   message-request={"Id":26676,"Response":false,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":false,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":0,"Question":[{"Name":"3.0.16.172.in-addr.arpa.","Qtype":12,"Qclass":1}],"Answer":null,"Ns":null,"Extra":null}
Feb  7 12:06:14.958 [DEBU] [id:26676] [type:dnsServer] (1.2)   passed clientURLs: [{udp   172.16.0.1   false   }]
Feb  7 12:06:14.958 [DEBU] [id:26676] [type:dnsServer] (1.3)   passed SearchDomains: []
Feb  7 12:06:14.958 [TRAC] [id:26676] [type:dnsServer] (2)  ⎆ sdk/pkg/tools/dnsutils/noloop/noloopDNSHandler.ServeDNS()
Feb  7 12:06:14.958 [ERRO] [id:26676] [noloopDNSHandler:ServeDNS] [type:dnsServer] (2.1)    loop is not allowed: query: ;; opcode: QUERY, status: NOERROR, id: 26676;   ;; flags:; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0;    ;      ;; QUESTION SECTION:;    ;3.0.16.172.in-addr.arpa.       IN       PTR;   ;
Feb  7 12:06:14.958 [TRAC] [id:26676] [type:dnsServer] (2.2)    message-response={"Id":26676,"Response":true,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":false,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":2,"Question":[{"Name":"3.0.16.172.in-addr.arpa.","Qtype":12,"Qclass":1}],"Answer":[],"Ns":[],"Extra":[]}
Feb  7 12:06:15.011 [TRAC] [id:26676] [type:dnsServer] (5.1)       message-response={"Id":26676,"Response":true,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":false,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":2,"Question":[{"Name":"3.0.16.172.in-addr.arpa.","Qtype":12,"Qclass":1}],"Answer":[],"Ns":[],"Extra":[]}

The same error appears several times in the logs, as the server seems to handle one PTR request per entry in the pod's DNS search path.

I've also noticed major slowdowns in the yelb app, but these don't seem to be caused by NSM: in order for the app to work regardless of how it is deployed, I added the my-interdomain-vl3-network domain in the pod's DNS search path. But since it was added at the end of the search path, the pod has to wait for several failing DNS queries to finish before sending the right query.

@glazychev-art
Copy link
Contributor

@fr-Pursuit
nslookup makes both ipv4 and ipv6 requests by default.

So, we can check:
nslookup -debug yelb-ui.my-interdomain-vl3-network.
or
nslookup -type=a -debug yelb-ui.my-interdomain-vl3-network.

For ping, force ipv4:
ping yelb-ui.my-interdomain-vl3-network. -4

@fr-Pursuit
Copy link
Author

@glazychev-art Indeed! The nslookup failure came from an AAAA request. Here's the output of the command you sent:

# nslookup -debug yelb-ui.my-interdomain-vl3-network.
Server:         127.0.0.1
Address:        127.0.0.1#53

------------
    QUESTIONS:
        yelb-ui.my-interdomain-vl3-network, type = A, class = IN
    ANSWERS:
    ->  yelb-ui.my-interdomain-vl3-network
        internet address = 172.16.0.3
        ttl = 283
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:   yelb-ui.my-interdomain-vl3-network
Address: 172.16.0.3
------------
    QUESTIONS:
        yelb-ui.my-interdomain-vl3-network, type = AAAA, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
** server can't find yelb-ui.my-interdomain-vl3-network: SERVFAIL

I already tried to force IPv4 on ping though, which didn't change anything. I don't find this surprising though, since from what I saw the freeze came from reverse DNS queries.

@glazychev-art
Copy link
Contributor

We can also check how long each request takes with nslookup.
You need to look at man of your version of nslookup (for mine, -debug flag is enough)

Where do PTR requests come from? Any ideas?

@fr-Pursuit
Copy link
Author

I didn't see how nslookup could directly tell how long each request took, but using the time command, I got this:

root@yelb-appserver-55688766-pm8rf:/# time nslookup -debug yelb-ui.my-interdomain-vl3-network.
Server:         127.0.0.1
Address:        127.0.0.1#53

------------
    QUESTIONS:
        yelb-ui.my-interdomain-vl3-network, type = A, class = IN
    ANSWERS:
    ->  yelb-ui.my-interdomain-vl3-network
        internet address = 172.16.0.3
        ttl = 3511
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:   yelb-ui.my-interdomain-vl3-network
Address: 172.16.0.3
------------
    QUESTIONS:
        yelb-ui.my-interdomain-vl3-network, type = AAAA, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
** server can't find yelb-ui.my-interdomain-vl3-network: SERVFAIL


real    0m0.905s
user    0m0.014s
sys     0m0.000s

Each request took more or less the same time, so I'd say around 400/500ms per request.

As for the PTR requests, they come from the ping problem itself. From what I can see, ping performs a reverse DNS query when you use it with a domain name instead of an IP:

% ping -4 google.com -c 4
PING  (216.58.213.78) 56(84) bytes of data.
64 bytes from lhr25s01-in-f14.1e100.net (216.58.213.78): icmp_seq=1 ttl=110 time=4.92 ms
64 bytes from lhr25s01-in-f14.1e100.net (216.58.213.78): icmp_seq=2 ttl=110 time=4.91 ms
64 bytes from par21s18-in-f14.1e100.net (216.58.213.78): icmp_seq=3 ttl=110 time=4.98 ms
64 bytes from lhr25s01-in-f78.1e100.net (216.58.213.78): icmp_seq=4 ttl=110 time=5.00 ms

---  ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 4.909/4.950/5.000/0.038 ms

% ping -4 216.58.213.78 -c 4
PING 216.58.213.78 (216.58.213.78) 56(84) bytes of data.
64 bytes from 216.58.213.78: icmp_seq=1 ttl=110 time=4.86 ms
64 bytes from 216.58.213.78: icmp_seq=2 ttl=110 time=4.87 ms
64 bytes from 216.58.213.78: icmp_seq=3 ttl=110 time=4.96 ms
64 bytes from 216.58.213.78: icmp_seq=4 ttl=110 time=4.92 ms

--- 216.58.213.78 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 4.857/4.902/4.963/0.041 ms

@glazychev-art
Copy link
Contributor

Could you check please time nslookup -debug -type=a yelb-ui.my-interdomain-vl3-network. ?

Got it, thanks. Apparently we also need to process PTR requests.

@fr-Pursuit
Copy link
Author

Here's what I get when I only query A records:

# time nslookup -debug -type=a yelb-ui.my-interdomain-vl3-network.
Server:         127.0.0.1
Address:        127.0.0.1#53

------------
    QUESTIONS:
        yelb-ui.my-interdomain-vl3-network, type = A, class = IN
    ANSWERS:
    ->  yelb-ui.my-interdomain-vl3-network
        internet address = 172.16.0.3
        ttl = 2402
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:   yelb-ui.my-interdomain-vl3-network
Address: 172.16.0.3


real    0m0.380s
user    0m0.008s
sys     0m0.008s

@fr-Pursuit
Copy link
Author

I've also noticed major slowdowns in the yelb app, but these don't seem to be caused by NSM: in order for the app to work regardless of how it is deployed, I added the my-interdomain-vl3-network domain in the pod's DNS search path. But since it was added at the end of the search path, the pod has to wait for several failing DNS queries to finish before sending the right query.

Actually, pods connected to a vL3 network seem to use a local resolver running in the cmd-nsc sidecar container. Is there a way to configure it to cache DNS results to avoid having to wait too long each time a DNS resolution has to be made?

@glazychev-art
Copy link
Contributor

glazychev-art commented Feb 8, 2023

@fr-Pursuit
In fact, the cache should work. Here are the logs, where we see that the response from the first request was cached, and the second one simply takes the result from this cache (dnsCacheHandler):

Feb  8 05:54:55.385 [TRAC] [id:40104] [type:dnsServer] (1) ⎆ sdk/pkg/tools/dnsutils/checkmsg/checkMsgHandler.ServeDNS()
Feb  8 05:54:55.385 [TRAC] [id:40104] [type:dnsServer] (1.1)   message-request={"Id":40104,"Response":false,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":true,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":true,"CheckingDisabled":false,"Rcode":0,"Question":[{"Name":"alpine.my-interdomain-vl3-network.my.cluster3.","Qtype":1,"Qclass":1}],"Answer":null,"Ns":null,"Extra":null}
Feb  8 05:54:55.385 [TRAC] [id:40104] [type:dnsServer] (2)  ⎆ sdk/pkg/tools/dnsutils/dnsconfigs/dnsConfigsHandler.ServeDNS()
Feb  8 05:54:55.385 [DEBU] [id:40104] [type:dnsServer] (2.1)    passed clientURLs: [{udp   172.16.1.1   false   } {udp   10.96.0.10   false   }]
Feb  8 05:54:55.385 [DEBU] [id:40104] [type:dnsServer] (2.2)    passed SearchDomains: [ns-floating-vl3-basic.svc.cluster.local svc.cluster.local cluster.local]
Feb  8 05:54:55.385 [TRAC] [id:40104] [type:dnsServer] (3)   ⎆ sdk/pkg/tools/dnsutils/searches/searchDomainsHandler.ServeDNS()
Feb  8 05:54:55.385 [TRAC] [id:40104] [type:dnsServer] (4)    ⎆ sdk/pkg/tools/dnsutils/noloop/noloopDNSHandler.ServeDNS()
Feb  8 05:54:55.385 [TRAC] [id:40104] [type:dnsServer] (5)     ⎆ sdk/pkg/tools/dnsutils/cache/dnsCacheHandler.ServeDNS()
Feb  8 05:54:55.385 [TRAC] [id:40104] [type:dnsServer] (6)      ⎆ sdk/pkg/tools/dnsutils/fanout/fanoutHandler.ServeDNS()
Feb  8 05:54:55.386 [TRAC] [id:40104] [type:dnsServer] (6.1)        message-response={"Id":40104,"Response":true,"Opcode":0,"Authoritative":true,"Truncated":false,"RecursionDesired":false,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":0,"Question":[{"Name":"alpine.my-interdomain-vl3-network.my.cluster3.","Qtype":1,"Qclass":1}],"Answer":[{"Hdr":{"Name":"alpine.my-interdomain-vl3-network.my.cluster3.","Rrtype":1,"Class":1,"Ttl":3600,"Rdlength":4},"A":"172.16.1.2"}],"Ns":[],"Extra":[]}



Feb  8 05:54:59.368 [TRAC] [id:23717] [type:dnsServer] (1) ⎆ sdk/pkg/tools/dnsutils/checkmsg/checkMsgHandler.ServeDNS()
Feb  8 05:54:59.368 [TRAC] [id:23717] [type:dnsServer] (1.1)   message-request={"Id":23717,"Response":false,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":true,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":true,"CheckingDisabled":false,"Rcode":0,"Question":[{"Name":"alpine.my-interdomain-vl3-network.my.cluster3.","Qtype":1,"Qclass":1}],"Answer":null,"Ns":null,"Extra":null}
Feb  8 05:54:59.368 [TRAC] [id:23717] [type:dnsServer] (2)  ⎆ sdk/pkg/tools/dnsutils/dnsconfigs/dnsConfigsHandler.ServeDNS()
Feb  8 05:54:59.368 [DEBU] [id:23717] [type:dnsServer] (2.1)    passed clientURLs: [{udp   172.16.1.1   false   } {udp   10.96.0.10   false   }]
Feb  8 05:54:59.368 [DEBU] [id:23717] [type:dnsServer] (2.2)    passed SearchDomains: [ns-floating-vl3-basic.svc.cluster.local svc.cluster.local cluster.local]
Feb  8 05:54:59.369 [TRAC] [id:23717] [type:dnsServer] (3)   ⎆ sdk/pkg/tools/dnsutils/searches/searchDomainsHandler.ServeDNS()
Feb  8 05:54:59.369 [TRAC] [id:23717] [type:dnsServer] (4)    ⎆ sdk/pkg/tools/dnsutils/noloop/noloopDNSHandler.ServeDNS()
Feb  8 05:54:59.369 [TRAC] [id:23717] [type:dnsServer] (5)     ⎆ sdk/pkg/tools/dnsutils/cache/dnsCacheHandler.ServeDNS()
Feb  8 05:54:59.369 [TRAC] [id:23717] [type:dnsServer] (5.1)       message-response={"Id":23717,"Response":true,"Opcode":0,"Authoritative":true,"Truncated":false,"RecursionDesired":false,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":0,"Question":[{"Name":"alpine.my-interdomain-vl3-network.my.cluster3.","Qtype":1,"Qclass":1}],"Answer":[{"Hdr":{"Name":"alpine.my-interdomain-vl3-network.my.cluster3.","Rrtype":1,"Class":1,"Ttl":3597,"Rdlength":4},"A":"172.16.1.2"}],"Ns":[],"Extra":[]}

This log is from alpine (cmd-nsc).

Could you also check the latest main branch of deployments-k8s? We slightly increased the query speed, at least nslookup type=a should work faster for cmd-nsc.

@fr-Pursuit
Copy link
Author

fr-Pursuit commented Feb 8, 2023

Weirdly, I get similar logs in the cmd-nsc container, but logs from the NSE indicate it stills handles the requests, even though they should be cached in the local cmd-nsc container.

I ran dig yelb-appserver.my-interdomain-vl3-network on the yelb-appserver container (so only the local NSE should get involved).

I'll retry with the latest main branch (I'm currently using v1.7.0), and I'll get back to you.

@fr-Pursuit
Copy link
Author

fr-Pursuit commented Feb 8, 2023

Great! With the latest branch, the DNS resolutions are quite faster:

# time nslookup -debug -type=a yelb-ui.my-interdomain-vl3-network.
Server:         127.0.0.1
Address:        127.0.0.1#53

------------
    QUESTIONS:
        yelb-ui.my-interdomain-vl3-network, type = A, class = IN
    ANSWERS:
    ->  yelb-ui.my-interdomain-vl3-network
        internet address = 172.16.1.4
        ttl = 3408
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:   yelb-ui.my-interdomain-vl3-network
Address: 172.16.1.4


real    0m0.021s
user    0m0.009s
sys     0m0.009s

However, I have another issue... I deployed yelb using the same manifest as before, and for some reason yelb-db.my-interdomain-vl3-network cannot be resolved. The yelb-db pod was successfully connected to the vL3 network (with the address 172.16.0.2), but the NSE fails when I try to resolve its associated name:

# nslookup -debug -type=a yelb-db.my-interdomain-vl3-network.
Server:         127.0.0.1
Address:        127.0.0.1#53

------------
    QUESTIONS:
        yelb-db.my-interdomain-vl3-network, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
** server can't find yelb-db.my-interdomain-vl3-network: SERVFAIL

The following error appears in the NSE's logs:

Feb  8 12:03:31.680 [TRAC] [id:40970] [type:dnsServer] (1) ⎆ sdk/pkg/tools/dnsutils/dnsconfigs/dnsConfigsHandler.ServeDNS()
Feb  8 12:03:31.681 [TRAC] [id:40970] [type:dnsServer] (1.1)   message-request={"Id":40970,"Response":false,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":true,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":0,"Question":[{"Name":"yelb-db.my-interdomain-vl3-network.","Qtype":1,"Qclass":1}],"Answer":null,"Ns":null,"Extra":null}
Feb  8 12:03:31.681 [DEBU] [id:40970] [type:dnsServer] (1.2)   passed clientURLs: []
Feb  8 12:03:31.681 [DEBU] [id:40970] [type:dnsServer] (1.3)   passed SearchDomains: []
Feb  8 12:03:31.681 [TRAC] [id:40970] [type:dnsServer] (2)  ⎆ sdk/pkg/tools/dnsutils/noloop/noloopDNSHandler.ServeDNS()
Feb  8 12:03:31.681 [TRAC] [id:40970] [type:dnsServer] (3)   ⎆ sdk/pkg/tools/dnsutils/norecursion/norecursionDNSHandler.ServeDNS()
Feb  8 12:03:31.682 [TRAC] [id:40970] [type:dnsServer] (3.1)     message-request-diff=[{"type":"update","path":["MsgHdr","RecursionDesired"],"from":true,"to":false}]
Feb  8 12:03:31.682 [TRAC] [id:40970] [type:dnsServer] (4)    ⎆ sdk/pkg/tools/dnsutils/memory/memoryHandler.ServeDNS()
Feb  8 12:03:31.682 [TRAC] [id:40970] [type:dnsServer] (5)     ⎆ sdk/pkg/tools/dnsutils/fanout/fanoutHandler.ServeDNS()
Feb  8 12:03:31.682 [ERRO] [id:40970] [fanoutHandler:ServeDNS] [type:dnsServer] (5.1)       no urls to fanout
Feb  8 12:03:31.682 [TRAC] [id:40970] [type:dnsServer] (5.2)       message-response={"Id":40970,"Response":true,"Opcode":0,"Authoritative":false,"Truncated":false,"RecursionDesired":false,"RecursionAvailable":false,"Zero":false,"AuthenticatedData":false,"CheckingDisabled":false,"Rcode":2,"Question":[{"Name":"yelb-db.my-interdomain-vl3-network.","Qtype":1,"Qclass":1}],"Answer":[],"Ns":[],"Extra":[]}

All the other names can be resolved fine: only yelb-db has a problem.

Any idea about what could have gone wrong?

@glazychev-art
Copy link
Contributor

Is yelb-db another client as yelb-ui ?

Could you upload the full NSE log to which it is connected?

@fr-Pursuit
Copy link
Author

yelb-db is indeed another client. The logs of the nsc-cmd-init container contained errors about a timeout being exceeded.

I'm currently redeploying my clusters. It's probably better to restart with a clean setup, just to be sure my previous tinkering didn't affect anything.

@fr-Pursuit
Copy link
Author

fr-Pursuit commented Feb 8, 2023

It appears something was wrong with my setup. After a clean reinstall, everything works fine.

However, the resolution speedup only applies when I use the absolute DNS name:

#time nslookup -debug -type=a yelb-db.my-interdomain-vl3-network.
Server:         127.0.0.1
Address:        127.0.0.1#53

------------
    QUESTIONS:
        yelb-db.my-interdomain-vl3-network, type = A, class = IN
    ANSWERS:
    ->  yelb-db.my-interdomain-vl3-network
        internet address = 172.16.1.3
        ttl = 3406
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:   yelb-db.my-interdomain-vl3-network
Address: 172.16.1.3


real    0m0.017s
user    0m0.005s
sys     0m0.009s

If I use the relative name (my-interdomain-vl3-network being in the DNS search path), it still takes around 400ms everytime:

# time nslookup -debug -type=a yelb-db
Server:         127.0.0.1
Address:        127.0.0.1#53

------------
    QUESTIONS:
        yelb-db, type = A, class = IN
    ANSWERS:
    ->  yelb-db.my-interdomain-vl3-network
        internet address = 172.16.1.3
        ttl = 3311
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:   yelb-db.my-interdomain-vl3-network
Address: 172.16.1.3


real    0m0.305s
user    0m0.006s
sys     0m0.012s

@glazychev-art
Copy link
Contributor

@fr-Pursuit
Thank you for your results and comments! We really appreciate it.

I think it's not quite right to solve all dns problems in one issue. Could you create a separate ones for other problems? For example, about PTR requests.

@denis-tingaikin
Copy link
Member

Seems like root cause of the problem is fixed.

Next problem will be considered separatly #1425

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants