add dns record orphan alert (#862)

* add dns record missing alert Signed-off-by: craig <cbrookes@redhat.com> add orphan record mitigation doc fix lint improve alert query * Update doc/user-guides/orphan-dns-records.md Co-authored-by: Michael Nairn <mnairn@redhat.com> * Update doc/user-guides/orphan-dns-records.md Co-authored-by: Michael Nairn <mnairn@redhat.com> * Update doc/user-guides/orphan-dns-records.md Co-authored-by: Michael Nairn <mnairn@redhat.com> --------- Co-authored-by: Michael Nairn <mnairn@redhat.com>
Kuadrant · Sep 24, 2024 · fbc1021 · fbc1021
1 parent 3475a1e
commit fbc1021
Show file tree

Hide file tree

Showing 6 changed files with 121 additions and 3 deletions.
diff --git a/Makefile b/Makefile
@@ -353,11 +353,12 @@ run: generate fmt vet ## Run a controller from your host.
 docker-build: GIT_SHA=$(shell git rev-parse HEAD || echo "unknown")
 docker-build: DIRTY=$(shell $(PROJECT_PATH)/utils/check-git-dirty.sh || echo "unknown")
 docker-build: ## Build docker image with the manager.
-	$(CONTAINER_ENGINE) build \
+		$(CONTAINER_ENGINE) build \
 		--build-arg QUAY_IMAGE_EXPIRY=$(QUAY_IMAGE_EXPIRY) \
 		--build-arg GIT_SHA=$(GIT_SHA) \
 		--build-arg DIRTY=$(DIRTY) \
 		--build-arg QUAY_IMAGE_EXPIRY=$(QUAY_IMAGE_EXPIRY) \
+		--load \
 		-t $(IMG) .
 
 docker-push: ## Push docker image with the manager.

diff --git a/config/observability/kustomization.yaml b/config/observability/kustomization.yaml
@@ -3,8 +3,7 @@ kind: Kustomization
 
 resources:
   - github.com/prometheus-operator/kube-prometheus?ref=release-0.13
-  - github.com/Kuadrant/gateway-api-state-metrics?ref=0.4.0
-  - github.com/Kuadrant/gateway-api-state-metrics/config/examples/dashboards?ref=0.4.0
+  - github.com/Kuadrant/gateway-api-state-metrics/config/kuadrant?ref=0.5.0
 # To scrape istio metrics, 3 configurations are required:
 # 1. Envoy metrics directly from the istio ingress gateway pod
   - prometheus/monitors/pod-monitor-envoy.yaml

diff --git a/config/observability/rbac/ksm_clusterrole_patch.yaml b/config/observability/rbac/ksm_clusterrole_patch.yaml
@@ -34,6 +34,7 @@
     - dnspolicies
     - ratelimitpolicies
     - authpolicies
+    - dnsrecords
     verbs:
     - list
     - watch
diff --git a/doc/user-guides/orphan-dns-records.md b/doc/user-guides/orphan-dns-records.md
@@ -0,0 +1,94 @@
+## Orphan DNS Records
+
+This document is focused around multi-cluster DNS where you have more than one instance of a gateway that shares a common hostname with other gateways and assumes you have the [observability](https://docs.kuadrant.io/0.10.0/kuadrant-operator/doc/observability/examples/) stack set up.
+
+### What is an orphan record?
+
+An orphan DNS record is a record or set of records that are owned by an instance of the DNS operator that no longer has a representation of those records on its cluster.
+
+### How do orphan records occur?
+
+Orphan records can occur when a `DNSRecord` resource (a resource that is created in response to a `DNSPolicy`) is deleted without allowing the owning controller time to clean up the associated records in the DNS provider. Generally in order for this to happen, you would need to force remove a `finalizer` from the `DNSRecord` resource, delete the kuadrant-system namespace directly or un-install kuadrant (delete the subscription if using OLM) without first cleaning up existing policies or delete a cluster entirely without first cleaning up the associated DNSPolicies. These are not common scenarios but when they do occur they can leave behind records in your DNS Provider which may point to IPs / Hosts that are no longer valid. 
+
+
+### How do you spot an orphan record(s) exist?
+
+There is a prometheus based alert that uses some metrics exposed from the DNS components to spot this situation. If you have installed the alerts for Kuadrant under the examples folder, you will see in the alerts tab an alert called `PossibleOrphanedDNSRecords`. When this is firing it means there are likely to be orphaned records in your provider.
+
+### How do you get rid of an orphan record?
+
+To remove an Orphan Record we must first identify the owner that is no longer aware of the record. To do this we need an existing DNSRecord in another cluster.
+
+Example: You have 2 clusters that each have a gateway and share a host `apps.example.com` and have setup a DNSPolicy for each gateway. On cluster 1 you remove the `kuadrant-system` namespace without first cleaning up existing DNSPolicies targeting the gateway in your `ingress-gateway` namespace. Now there are a set of records that were being managed for that gateway that have not been removed. 
+On cluster 2 the DNS Operator managing the existing DNSRecord in that cluster has a record of all owners of that dns name. 
+In prometheus alerts, it spots that the number of owners does not correlate to the number of DNSRecord resources and triggers an alert. 
+To remedy this rather than going to the DNS provider directly and trying to figure out which records to remove, you can instead follow the steps below.
+
+1) Get the owner id of the DNSRecord on cluster 2 for the shared host 
+
+```
+kubectl get dnsrecord somerecord -n my-gateway-ns -o=jsonpath='{.status.ownerID}'
+```
+
+2) get all the owner ids
+
+```
+kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.status.domainOwners}'
+
+## output
+["26aacm1z","49qn0wp7"]
+```
+
+3) create a placeholder DNSRecord with none active ownerID
+
+
+for each owner id returned that isn't the owner id of the record we got earlier that we want to remove records for, we need to create a dnsrecord resource and delete it. This will trigger the running operator in this cluster to clean up those records.
+
+```
+# this is one of the owner id **not** in the existing dnsrecord on cluster
+export ownerID=26aacm1z  
+
+export rootHost=$(kubectl get dnsrecord.kuadrant.io somerecord -n  my-gateway-ns -o=jsonpath='{.spec.rootHost}')
+
+# export a namespace with the aws credentials in it
+export targetNS=kuadrant-system 
+
+kubectl apply -f - <<EOF
+apiVersion: kuadrant.io/v1alpha1
+kind: DNSRecord
+metadata:
+  name: delete-old-loadbalanced-dnsrecord
+  namespace: ${targetNS}
+spec:
+  providerRef:
+    name: my-aws-credentials
+  ownerID: ${ownerID}
+  rootHost: ${rootHost}
+  endpoints:
+    - dnsName: ${rootHost}
+      recordTTL: 60
+      recordType: CNAME
+      targets:
+        - klb.doesnt-exist.${rootHost}
+EOF
+
+```
+
+4) Delete the dnsrecord
+
+```
+kubectl delete dnsrecord.kuadrant.io delete-old-loadbalanced-dnsrecord -n ${targetNS} 
+```
+
+5) verify 
+
+We can verify by checking the dnsrecord again. Note it may take a several minutes for the other record to update. We can force it by adding a label to the record
+
+```
+kubectl label dnsrecord.kuadrant.io somerecord test=test -n ${targetNS}
+
+kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.status.domainOwners}'
+
+```
+
+We should also see our alert eventually stop triggering.
diff --git a/examples/alerts/kustomization.yaml b/examples/alerts/kustomization.yaml
@@ -5,3 +5,4 @@ resources:
   - prometheusrules_policies_missing.yaml
   - slo-availability.yaml
   - slo-latency.yaml
+  - orphan_records.yaml
diff --git a/examples/alerts/orphan_records.yaml b/examples/alerts/orphan_records.yaml
@@ -0,0 +1,22 @@
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: dns-records-rules
+  namespace: monitoring
+spec:
+  groups:
+  - name: dns_records
+    rules:
+    - alert: PossibleOrphanedDNSRecords
+      expr: |
+        sum by(rootDomain) (
+          count by(rootDomain) (kuadrant_dnsrecord_status_root_domain_owners) / 
+          count by(rootDomain) (kuadrant_dnsrecord_status) - 
+          count by(rootDomain) (kuadrant_dnsrecord_status)
+        ) > 0
+      for: 5m
+      labels:
+        severity: warning
+      annotations:
+        summary: "The number of DNS Owners is greater than the number of records for root domain '{{ $labels.rootDomain }}'"
+        description: "This alert fires if the number of owners (controller collaborating on a record set) is greater than the number of records. This may mean a record has been left behind in the provider due to a failed delete"