Skip to content

Commit

Permalink
add dns record orphan alert (#862)
Browse files Browse the repository at this point in the history
* add dns record missing alert

Signed-off-by: craig <cbrookes@redhat.com>

add orphan record mitigation doc

fix lint

improve alert query

* Update doc/user-guides/orphan-dns-records.md

Co-authored-by: Michael Nairn <mnairn@redhat.com>

* Update doc/user-guides/orphan-dns-records.md

Co-authored-by: Michael Nairn <mnairn@redhat.com>

* Update doc/user-guides/orphan-dns-records.md

Co-authored-by: Michael Nairn <mnairn@redhat.com>

---------

Co-authored-by: Michael Nairn <mnairn@redhat.com>
  • Loading branch information
maleck13 and mikenairn authored Sep 24, 2024
1 parent 3475a1e commit fbc1021
Show file tree
Hide file tree
Showing 6 changed files with 121 additions and 3 deletions.
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -353,11 +353,12 @@ run: generate fmt vet ## Run a controller from your host.
docker-build: GIT_SHA=$(shell git rev-parse HEAD || echo "unknown")
docker-build: DIRTY=$(shell $(PROJECT_PATH)/utils/check-git-dirty.sh || echo "unknown")
docker-build: ## Build docker image with the manager.
$(CONTAINER_ENGINE) build \
$(CONTAINER_ENGINE) build \
--build-arg QUAY_IMAGE_EXPIRY=$(QUAY_IMAGE_EXPIRY) \
--build-arg GIT_SHA=$(GIT_SHA) \
--build-arg DIRTY=$(DIRTY) \
--build-arg QUAY_IMAGE_EXPIRY=$(QUAY_IMAGE_EXPIRY) \
--load \
-t $(IMG) .

docker-push: ## Push docker image with the manager.
Expand Down
3 changes: 1 addition & 2 deletions config/observability/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@ kind: Kustomization

resources:
- github.com/prometheus-operator/kube-prometheus?ref=release-0.13
- github.com/Kuadrant/gateway-api-state-metrics?ref=0.4.0
- github.com/Kuadrant/gateway-api-state-metrics/config/examples/dashboards?ref=0.4.0
- github.com/Kuadrant/gateway-api-state-metrics/config/kuadrant?ref=0.5.0
# To scrape istio metrics, 3 configurations are required:
# 1. Envoy metrics directly from the istio ingress gateway pod
- prometheus/monitors/pod-monitor-envoy.yaml
Expand Down
1 change: 1 addition & 0 deletions config/observability/rbac/ksm_clusterrole_patch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
- dnspolicies
- ratelimitpolicies
- authpolicies
- dnsrecords
verbs:
- list
- watch
94 changes: 94 additions & 0 deletions doc/user-guides/orphan-dns-records.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
## Orphan DNS Records

This document is focused around multi-cluster DNS where you have more than one instance of a gateway that shares a common hostname with other gateways and assumes you have the [observability](https://docs.kuadrant.io/0.10.0/kuadrant-operator/doc/observability/examples/) stack set up.

### What is an orphan record?

An orphan DNS record is a record or set of records that are owned by an instance of the DNS operator that no longer has a representation of those records on its cluster.

### How do orphan records occur?

Orphan records can occur when a `DNSRecord` resource (a resource that is created in response to a `DNSPolicy`) is deleted without allowing the owning controller time to clean up the associated records in the DNS provider. Generally in order for this to happen, you would need to force remove a `finalizer` from the `DNSRecord` resource, delete the kuadrant-system namespace directly or un-install kuadrant (delete the subscription if using OLM) without first cleaning up existing policies or delete a cluster entirely without first cleaning up the associated DNSPolicies. These are not common scenarios but when they do occur they can leave behind records in your DNS Provider which may point to IPs / Hosts that are no longer valid.


### How do you spot an orphan record(s) exist?

There is a prometheus based alert that uses some metrics exposed from the DNS components to spot this situation. If you have installed the alerts for Kuadrant under the examples folder, you will see in the alerts tab an alert called `PossibleOrphanedDNSRecords`. When this is firing it means there are likely to be orphaned records in your provider.

### How do you get rid of an orphan record?

To remove an Orphan Record we must first identify the owner that is no longer aware of the record. To do this we need an existing DNSRecord in another cluster.

Example: You have 2 clusters that each have a gateway and share a host `apps.example.com` and have setup a DNSPolicy for each gateway. On cluster 1 you remove the `kuadrant-system` namespace without first cleaning up existing DNSPolicies targeting the gateway in your `ingress-gateway` namespace. Now there are a set of records that were being managed for that gateway that have not been removed.
On cluster 2 the DNS Operator managing the existing DNSRecord in that cluster has a record of all owners of that dns name.
In prometheus alerts, it spots that the number of owners does not correlate to the number of DNSRecord resources and triggers an alert.
To remedy this rather than going to the DNS provider directly and trying to figure out which records to remove, you can instead follow the steps below.

1) Get the owner id of the DNSRecord on cluster 2 for the shared host

```
kubectl get dnsrecord somerecord -n my-gateway-ns -o=jsonpath='{.status.ownerID}'
```

2) get all the owner ids

```
kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.status.domainOwners}'
## output
["26aacm1z","49qn0wp7"]
```

3) create a placeholder DNSRecord with none active ownerID


for each owner id returned that isn't the owner id of the record we got earlier that we want to remove records for, we need to create a dnsrecord resource and delete it. This will trigger the running operator in this cluster to clean up those records.

```
# this is one of the owner id **not** in the existing dnsrecord on cluster
export ownerID=26aacm1z
export rootHost=$(kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.spec.rootHost}')
# export a namespace with the aws credentials in it
export targetNS=kuadrant-system
kubectl apply -f - <<EOF
apiVersion: kuadrant.io/v1alpha1
kind: DNSRecord
metadata:
name: delete-old-loadbalanced-dnsrecord
namespace: ${targetNS}
spec:
providerRef:
name: my-aws-credentials
ownerID: ${ownerID}
rootHost: ${rootHost}
endpoints:
- dnsName: ${rootHost}
recordTTL: 60
recordType: CNAME
targets:
- klb.doesnt-exist.${rootHost}
EOF
```

4) Delete the dnsrecord

```
kubectl delete dnsrecord.kuadrant.io delete-old-loadbalanced-dnsrecord -n ${targetNS}
```

5) verify

We can verify by checking the dnsrecord again. Note it may take a several minutes for the other record to update. We can force it by adding a label to the record

```
kubectl label dnsrecord.kuadrant.io somerecord test=test -n ${targetNS}
kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.status.domainOwners}'
```

We should also see our alert eventually stop triggering.
1 change: 1 addition & 0 deletions examples/alerts/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ resources:
- prometheusrules_policies_missing.yaml
- slo-availability.yaml
- slo-latency.yaml
- orphan_records.yaml
22 changes: 22 additions & 0 deletions examples/alerts/orphan_records.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: dns-records-rules
namespace: monitoring
spec:
groups:
- name: dns_records
rules:
- alert: PossibleOrphanedDNSRecords
expr: |
sum by(rootDomain) (
count by(rootDomain) (kuadrant_dnsrecord_status_root_domain_owners) /
count by(rootDomain) (kuadrant_dnsrecord_status) -
count by(rootDomain) (kuadrant_dnsrecord_status)
) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "The number of DNS Owners is greater than the number of records for root domain '{{ $labels.rootDomain }}'"
description: "This alert fires if the number of owners (controller collaborating on a record set) is greater than the number of records. This may mean a record has been left behind in the provider due to a failed delete"

0 comments on commit fbc1021

Please sign in to comment.