- Walk through some troubleshooting tips
- Discuss failure modes
- Review resources/DNS outage stories
- Deploy EKS Cluster
- Deploy DNS utils images
- DNS Resolution
- Walk Through failure modes
If the cluster from Lecture 4 is not running, walk through the Deployment from Ch03_L04 - Setting up an overlay network
Useful docker images that can be deployed to aid in troubleshooting dns issues
kubectl run --restart=Never -it --image infoblox/dnstools dnstools
dnsutils is used for kubernetes end to end testing
kubectl run --restart=Never -it --image gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 dnsutils
or
kubectl apply -f dnsutils.yml
Note
Alpine based images may also have issues with search domains since it uses libmusl
This can be worked around this by using fully qualified names or migrate to debian-slim.
More details here, here, and here
Deploying Hello Services from previous lecture
Hello Pod endpoints
kubectl apply -f deployment.yml
Hello ClusterIP Service
kubectl apply -f service.yml
Hello Headless Service
kubectl apply -f service-headless.yml
Resolve Hello service
kubectl exec -it dnsutils -- host -v -t a hello
Output
Trying "hello.default.svc.cluster.local"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12359
# Please edit the object below. Lines beginning with a '#' will be ignored,
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;hello.default.svc.cluster.local. IN A
;; ANSWER SECTION:
hello.default.svc.cluster.local. 30 IN A 10.99.200.108
Received 96 bytes from 10.96.0.10#53 in 0 ms
Resolve Hello Headless service
kubectl exec -it dnsutils -- host -v -t a hello-headless
Trying "hello-headless.default.svc.cluster.local"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28887
;; flags: qr aa rd; QUERY: 1, ANSWER: 7, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;hello-headless.default.svc.cluster.local. IN A
;; ANSWER SECTION:
hello-headless.default.svc.cluster.local. 30 IN A 10.244.0.7
hello-headless.default.svc.cluster.local. 30 IN A 10.244.0.8
hello-headless.default.svc.cluster.local. 30 IN A 10.244.0.6
hello-headless.default.svc.cluster.local. 30 IN A 10.244.0.4
hello-headless.default.svc.cluster.local. 30 IN A 10.244.0.5
hello-headless.default.svc.cluster.local. 30 IN A 10.244.0.9
hello-headless.default.svc.cluster.local. 30 IN A 10.244.0.10
Resolve google.com
kubectl exec -it dnsutils -- host -v -t a google.com
Output
Trying "google.com.default.svc.cluster.local"
Trying "google.com.svc.cluster.local"
Trying "google.com.cluster.local"
Trying "google.com"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47222
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;google.com. IN A
;; ANSWER SECTION:
google.com. 8 IN A 216.58.193.142
Received 54 bytes from 10.96.0.10#53 in 0 ms
Add autopath @kubernetes and pods verified to the coredns configmap
kubectl edit configmap coredns -n kube-system
Add the two highlight lines
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
* pods verified
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
* autopath @kubernetes
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
Redeploy the dnsutil image
kubectl delete po dnsutils
kubectl apply -f dnsutils
Resolve google.com again
kubectl exec -it dnsutils -- host -v -t a google.com
Output
Trying "google.com.default.svc.cluster.local"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52185
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;google.com.default.svc.cluster.local. IN A
;; ANSWER SECTION:
google.com.default.svc.cluster.local. 30 IN CNAME google.com.
google.com. 30 IN A 216.58.193.142
Received 140 bytes from 10.96.0.10#53 in 2 ms
- Verify the /etc/resolv.conf
kubectl exec -it dnsutils -- cat /etc/resolv.conf
Output
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5
The nameserver
directive should match the ClusterIP of the CoreDNS Service, in this case 10.96.0.10
kubectl get svc kube-dns -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 15h
- View all CoreDNS pods
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o wide
Output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-5644d7b6d9-42g9s 1/1 Running 0 16h 10.244.0.3 stack-control-plane <none> <none>
coredns-5644d7b6d9-mdgdb 1/1 Running 0 16h 10.244.0.2 stack-control-plane <none> <none>
- Check to make sure coreDNS has endpoints
kubectl get ep kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 10.244.0.2:53,10.244.0.3:53,10.244.0.2:53 + 3 more... 16h
kubectl describe ep kube-dns --namespace=kube-system
Output 2 Pods, with 3 ports exposed, 6 Endpoints
Name: kube-dns
Namespace: kube-system
Labels: k8s-app=kube-dns
kubernetes.io/cluster-service=true
kubernetes.io/name=KubeDNS
Annotations: endpoints.kubernetes.io/last-change-trigger-time: 2020-03-22T02:55:44Z
Subsets:
Addresses: 10.244.0.2,10.244.0.3
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
dns 53 UDP
dns-tcp 53 TCP
metrics 9153 TCP
Events: <none>
- Verify logs in CoreDNS
for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --namespace=kube-system $p; done
Output
○ → for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --namespace=kube-system $p; done
.:53
2020-03-22T02:55:38.525Z [INFO] plugin/reload: Running configuration MD5 = f64cb9b977c7dfca58c4fab108535a76
2020-03-22T02:55:38.525Z [INFO] CoreDNS-1.6.2
2020-03-22T02:55:38.525Z [INFO] linux/amd64, go1.12.8, 795a3eb
CoreDNS-1.6.2
linux/amd64, go1.12.8, 795a3eb
.:53
2020-03-22T02:55:38.533Z [INFO] plugin/reload: Running configuration MD5 = f64cb9b977c7dfca58c4fab108535a76
2020-03-22T02:55:38.533Z [INFO] CoreDNS-1.6.2
2020-03-22T02:55:38.533Z [INFO] linux/amd64, go1.12.8, 795a3eb
CoreDNS-1.6.2
linux/amd64, go1.12.8, 795a3eb
Known Issue with CoreDNS and /etc/resolv.conf
Some Linux distros use systemd-resolved and it replaces /etc/resolv.conf to /run/systemd/resolve/resolv.conf
This can be fixed manually by using kubelet’s --resolv-conf flag to point to the correct resolv.conf
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues
Total DNS outage due to OOM Killer
Default deployment memory requests and limits
resources:
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
View the current deployments memory limits and requests
kubectl get pods -l k8s-app=kube-dns -n kube-system -o=jsonpath="{range .items[*]}[{.metadata.name}, {.spec.containers[*].resources.limits.memory},{.spec.containers[*].resources.requests.memory} ] {end}"
Autoscaler
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/dns-horizontal-autoscaler
https://kubernetes.io/docs/tasks/administer-cluster/dns-horizontal-autoscaling/
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: coredns
namespace: default
spec:
maxReplicas: 20
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: coredns
targetCPUUtilizationPercentage: 50
Ensure you are monitoring and alerting on CoreDNS Metrics
https://coredns.io/plugins/metrics/
CoreDNS exports process and Go runtime metrics as well as CoreDNS-specific metrics
Set Thresholds on items such
coredns_dns_request_duration_seconds
- duration to process each query.
coredns_dns_request_count_total
- Counter of DNS requests made per zone, protocol and family
Also on pod level metrics
process_resident_memory_bytes
- Resident memory size in bytes
Many of the CoreDNS plugins also export metrics when the Prometheus plugin is enabled.
Example:
coredns_health_request_duration_seconds{}
- duration to process a HTTP query to the local /health endpoint.
As this a local operation it should be fast. A (large) increase in this duration indicates the CoreDNS process
is having trouble keeping up with its query load.
Ensure that your CoreDNS deployment has the priority class set
It is set in the default deployment yaml
Verify Priority Class Name
kubectl get pods -l k8s-app=kube-dns -n kube-system -o=jsonpath="{range .items[*]}[{.metadata.name},{.spec.priorityClassName} ] {end}"
These will also help see what is going on inside CoreDNS
Error - Any errors encountered during the query processing will be printed to standard output.
Trace - enable OpenTracing of how a request flows through CoreDNS
Log - query logging
health - CoreDNS is up and running this returns a 200 OK HTTP status code
ready - By enabling ready an HTTP endpoint on port 8181 will return 200 OK when all plugins that are able to signal readiness have done so.
Only use for Testing and NOT in production
debug - debug disables the automatic recovery upon a crash so that you’ll get a nice stack trace.
Ready and Health should be used in the deployment
livenessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
readinessProbe:
httpGet:
path: /ready
port: 8181
scheme: HTTP
View current coreDNS corefile
kubectl describe cm coredns -n kube-system
Name: coredns
Namespace: kube-system
Labels: <none>
Annotations: <none>
Data
====
Corefile:
----
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
Events: <none>
Make sure to delete this cluster
eksctl delete cluster --name test-cluster
[ℹ] eksctl version 0.12.0
[ℹ] using region us-west-2
[ℹ] deleting EKS cluster "test-cluster"
[ℹ] account is not authorized to use Fargate. Ignoring error
[✔] kubeconfig has been updated
[ℹ] cleaning up LoadBalancer services
[ℹ] 2 sequential tasks: { delete nodegroup "ng-d4923ffb", delete cluster control plane "test-cluster" [async] }
[ℹ] will delete stack "eksctl-test-cluster-nodegroup-ng-d4923ffb"
[ℹ] waiting for stack "eksctl-test-cluster-nodegroup-ng-d4923ffb" to get deleted
[ℹ] will delete stack "eksctl-test-cluster-cluster"
[✔] all cluster resources were deleted