-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns: introduce CRD DNS architecture and DNS status reporting #251
dns: introduce CRD DNS architecture and DNS status reporting #251
Conversation
Here are some status examples: apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: default
namespace: openshift-ingress-operator
spec: {}
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2019-06-13T15:26:46Z"
status: "True"
type: Available
- lastTransitionTime: "2019-06-13T15:26:16Z"
message: The endpoint publishing strategy supports a managed load balancer
reason: WantedByEndpointPublishingStrategy
status: "True"
type: LoadBalancerManaged
- lastTransitionTime: "2019-06-13T15:26:19Z"
message: The LoadBalancer service is provisioned
reason: LoadBalancerProvisioned
status: "True"
type: LoadBalancerReady
- lastTransitionTime: "2019-06-13T15:26:16Z"
message: DNS management is supported and zones are specified in the cluster DNS
config.
reason: Normal
status: "True"
type: DNSManaged
- lastTransitionTime: "2019-06-13T17:29:32Z"
message: The record is provisioned in all reported zones.
reason: NoFailedZones
status: "True"
type: DNSReady
domain: apps.dmace.devcluster.openshift.com
endpointPublishingStrategy:
type: LoadBalancerService
selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
apiVersion: ingress.operator.openshift.io/v1
kind: DNSRecord
metadata:
creationTimestamp: "2019-06-13T16:00:59Z"
finalizers:
- operator.openshift.io/ingress
generation: 1
labels:
ingresscontroller.operator.openshift.io/owning-ingresscontroller: default
name: default-wildcard
namespace: openshift-ingress-operator
ownerReferences:
- apiVersion: operator.openshift.io/v1
controller: true
kind: IngressController
name: default
uid: 953bfbd0-8def-11e9-b5d6-02915d60e384
resourceVersion: "58271"
selfLink: /apis/ingress.operator.openshift.io/v1/namespaces/openshift-ingress-operator/dnsrecords/default-wildcard
uid: 6fdb10e0-8df4-11e9-b5d6-02915d60e384
spec:
dnsName: '*.apps.dmace.devcluster.openshift.com'
recordType: ALIAS
targets:
- a95dad0368def11e9b5d602915d60e38-93147362.us-east-2.elb.amazonaws.com
status:
zones:
- dnsZone:
tags:
Name: dmace-rl775-int
kubernetes.io/cluster/dmace-rl775: owned
- dnsZone:
id: Z3URY6TWQ91KVV |
Here's what things look like when dnsrecord status doesn't report any zones: apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
creationTimestamp: "2019-06-13T15:26:15Z"
finalizers:
- ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
generation: 1
name: default
namespace: openshift-ingress-operator
resourceVersion: "58272"
selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default
uid: 953bfbd0-8def-11e9-b5d6-02915d60e384
spec: {}
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2019-06-13T15:26:46Z"
status: "True"
type: Available
- lastTransitionTime: "2019-06-13T15:26:16Z"
message: The endpoint publishing strategy supports a managed load balancer
reason: WantedByEndpointPublishingStrategy
status: "True"
type: LoadBalancerManaged
- lastTransitionTime: "2019-06-13T15:26:19Z"
message: The LoadBalancer service is provisioned
reason: LoadBalancerProvisioned
status: "True"
type: LoadBalancerReady
- lastTransitionTime: "2019-06-13T15:26:16Z"
message: DNS management is supported and zones are specified in the cluster DNS
config.
reason: Normal
status: "True"
type: DNSManaged
- lastTransitionTime: "2019-06-13T15:26:19Z"
message: The record isn't present in any zones.
reason: NoZones
status: "False"
type: DNSReady
domain: apps.dmace.devcluster.openshift.com
endpointPublishingStrategy:
type: LoadBalancerService
selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
apiVersion: ingress.operator.openshift.io/v1
kind: DNSRecord
metadata:
creationTimestamp: "2019-06-13T16:00:59Z"
finalizers:
- operator.openshift.io/ingress
generation: 1
labels:
ingresscontroller.operator.openshift.io/owning-ingresscontroller: default
name: default-wildcard
namespace: openshift-ingress-operator
ownerReferences:
- apiVersion: operator.openshift.io/v1
controller: true
kind: IngressController
name: default
uid: 953bfbd0-8def-11e9-b5d6-02915d60e384
resourceVersion: "39161"
selfLink: /apis/ingress.operator.openshift.io/v1/namespaces/openshift-ingress-operator/dnsrecords/default-wildcard
uid: 6fdb10e0-8df4-11e9-b5d6-02915d60e384
spec:
dnsName: '*.apps.dmace.devcluster.openshift.com'
recordType: ALIAS
targets:
- a95dad0368def11e9b5d602915d60e38-93147362.us-east-2.elb.amazonaws.com
status: {} |
|
||
// If the DNS record was deleted, clean up and return. | ||
if record.DeletionTimestamp != nil { | ||
for i := range zones { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we look at record.Status.Zones instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
} | ||
if len(errs) == 0 { | ||
updated := record.DeepCopy() | ||
if slice.ContainsString(updated.Finalizers, "operator.openshift.io/ingress") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a constant for the annotation
}) | ||
} | ||
|
||
// TODO: only update if status changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement this and the AWS client caching can be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved the TODO but don't intend to fix now because:
- It's a bit of work to get right
- It involves refactoring the DNS provider implementations
- We're already better off than before because we now only call ensure on creation (so once during startup, and once during ingresscontroller creation), so fixing the last problem here has low ROI for this particular PR (which is already huge)
if dnsConfig.Spec.PublicZone == nil && dnsConfig.Spec.PrivateZone == nil { | ||
return []operatorv1.OperatorCondition{ | ||
operatorv1.OperatorCondition{ | ||
Type: operatorv1.DNSManagedIngressConditionType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the e2e test, if this condition is true, wait for DNS on the ingress status to be ready
The CI failure was caused by https://bugzilla.redhat.com/show_bug.cgi?id=1720678. It should be fixed shortly. |
I think this is actually ready for review |
03afe43
to
6d76a99
Compare
// The targets the DNS record points to | ||
Targets []string `json:"targets,omitempty"` | ||
// RecordType type of record, e.g. CNAME, A, SRV, TXT etc | ||
RecordType string `json:"recordType,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a new string type and define named constants for the record types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and switched A to CNAME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added a type definition for DNSRecordType
but continue to use string
instead of using DNSRecordType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually intentional for now due to a random issue I was having with generators that will take some time to run down — I need to do a followup here for deepcopy scripts and cleaning up these types.
Also file under: should this be a "v1" API? It's not yet set in stone.
pkg/api/v1/types.go
Outdated
// +kubebuilder:object:root=true | ||
// +kubebuilder:subresource:status | ||
|
||
// Record represents a DNS record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Record
→DNSRecord
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
return fmt.Errorf("missing alias record") | ||
} | ||
domain, target := alias.Domain, alias.Target | ||
domain, target := record.Spec.DNSName, record.Spec.Targets[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a "TODO" not to ignore all but the first target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -76,7 +79,7 @@ func (m *manager) Ensure(record *dns.Record) error { | |||
context.TODO(), | |||
*targetZone, | |||
client.ARecord{ | |||
Address: record.ARecord.Address, | |||
Address: record.Spec.Targets[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a "TODO" not to ignore all but the first target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -102,7 +105,7 @@ func (m *manager) Delete(record *dns.Record) error { | |||
context.TODO(), | |||
*targetZone, | |||
client.ARecord{ | |||
Address: record.ARecord.Address, | |||
Address: record.Spec.Targets[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, this needs a "TODO".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return fmt.Errorf("failed to get current wildcard dnsrecord: %v", err) | ||
} else { | ||
if record != nil { | ||
log.V(1).Info("waiting for wildcard dnsrecord to be deleted", "dnsrecord", record.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere we pass record
in its entirety to log.Info
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Domain: domain, | ||
Target: target, | ||
}, | ||
desired := desiredWildcardRecord(ic, service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could service.Status.LoadBalancer.Ingress
change from non-empty to empty, and should we handle that case? Probably that would only happen in case of an intermittent outage, in which case it leaving the old record might be preferable. In any case, I'd like us to make a deliberate choice, and document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO with these considerations. Not going to be able to solve it here yet.
if err := r.client.Create(context.TODO(), desired); err != nil { | ||
return nil, fmt.Errorf("failed to create dnsrecord %s/%s: %v", desired.Namespace, desired.Name, err) | ||
} | ||
log.Info("created dnsrecord", "namespace", desired.Namespace, "name", desired.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we log desired
in its entirety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
var recordType string | ||
|
||
if len(ingress.Hostname) > 0 { | ||
recordType = "ALIAS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ALIAS" is AWS-specific. Let's document that and think about where the logic to choose between "ALIAS" or "CNAME" belongs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments
} | ||
ingress := &operatorv1.IngressController{} | ||
if err := r.cache.Get(context.TODO(), types.NamespacedName{Namespace: record.Namespace, Name: ingressName}, ingress); err != nil { | ||
return reconcile.Result{}, fmt.Errorf("failed to get dnsrecord for record %s: %v", record.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a deliberate design choice not to handle orphaned DNSRecord
resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that this state implies the dnsrecord
itself an orphan. It usually means the resource was deleted and due to timing we have a benign leftover entry to flush from the controller queue.
Also, how would the resource have been deleted from storage without having passed through finalization in delete()
? So I'm also not sure there's any orphan in the cloud provider, either.
Help me understand the concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a possible state that we need to consider: a DNSRecord
exists with no corresponding IngressController
. If you want to say that we will do nothing to try to recover from this state because the only ways to get there involve, say, etcd corruption or user tinkering, then that is fine, but I want to make sure we don't simply ignore the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I tried to split the difference for now. I handle not found by logging a specific message, added a TODO, and made the retry far less aggressive in this case. For other errors, I retry somewhat more aggressively since those are likely transient.
@Miciah PTAL. I think all feedback is addressed |
@ironcladlou here is the status of the default wildcard
Do you feel this |
@ironcladlou I also see the operator continuing to reconcile the default
|
It appears #251 (comment) and #251 (comment) are related. Reconciliation for the default |
@ironcladlou after deleting the default |
if err != nil { | ||
errs = append(errs, err) | ||
} else { | ||
log.Info("provider deleted record %s in zone %v", record.Name, zone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe L154 should be something like:
log.Info("provider deleted record", "name", record.Name, "zone", zone)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@ironcladlou Here is the default-wildcard
A couple things come to mind. First, the spec indicates a |
// | ||
// TODO: If .status.loadbalancer.ingress is processed once as non-empty and then | ||
// later becomes empty, what should we do? Currently we'll treat it as an intent | ||
// to not have a desired record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
desiredWildcardRecord
will return nil, but ensureWildcardDNSRecord
will not delete the current record if desired
is nil, which is what I was getting at when I suggested adding a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete is deliberately not handled during ensure in this case, so that part... I think mutations of .status.ingress ingress are a larger topic at this point. Unless we compute desired from a persisted "last observed" state I think there are probably more than one of these edge cases we don't coherently account for...
@ironcladlou disregard #251 (comment). I no longer see the records in Route53 after deleting the default |
Agree — not sure yet how to resolve it in the abstract. The fact we're using an A+Alias in Route53 is a provider detail, and one we aren't exposing any configuration around, so maybe it's a doc gap in the meantime? I wonder if this api should be in an alpha or beta group or something?
Agree with this, but would like to address it in a followup if that's okay since we don't actually need it yet. |
} | ||
ingress := &operatorv1.IngressController{} | ||
if err := r.cache.Get(context.TODO(), types.NamespacedName{Namespace: record.Namespace, Name: ingressName}, ingress); err != nil { | ||
return reconcile.Result{}, fmt.Errorf("failed to get dnsrecord for record %s: %v", record.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate problem I failed to notice before: failed to get dnsrecord
should be failed to get ingresscontroller
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
76a5c4b
to
32eb66a
Compare
if len(ingress.Hostname) > 0 { | ||
recordType = iov1.CNAMERecordType | ||
target = ingress.Hostname | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the service have .Status.LoadBalancer.Ingress[0]
with empty .Hostname
and .IP
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, there are a lot of unresolved Q's about that status field and our handling of them... but the environments we support are predictable enough to support our current assumptions, so I'm not really focused on solving them in this PR
} | ||
} | ||
|
||
func (r *reconciler) currentWilcardDNSRecord(ic *operatorv1.IngressController) (*iov1.DNSRecord, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currentWilcardDNSRecord
→currentWildcardDNSRecord
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Anything substantial left blocking merge? |
Are we possibly going to want to extricate |
Looks like the ingresscontroller deployment's pods weren't being scheduled, check out the scheduler events complaining about affinity rules... very suspicious |
Maybe? Pretty far out in any case |
Going to retest to see if the scheduling issue is a flake — given the unexplained co-location I observed on my new cluster yesterday I'm getting nervous something is wrong there. |
/retest |
Also, the operator's container is inexplicably killed at 2019-06-18T21:51:02Z. |
Actually, it's the operand's container that is getting killed. I think I'm confused because Prow is doubling the test output and injecting misleading timestamps. |
Okay, looks like a flake after all. Seems unlikely to be related to anything going on in this PR. |
// +kubebuilder:object:root=true | ||
// +kubebuilder:subresource:status | ||
|
||
// DNSRecord represents a DNS record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a brief statement such as the following? "This resource kind is intended for internal use by the operator only."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to run this entire thing by API folks before code freeze. I'm not even sure if a v1 version is appropriate. We can tweak it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add something about how it is unstable too. The idea is that anyone looking at the resource using oc explain
or OpenShift Console would at least have some idea of what the thing is and know not to mess with it.
failedZones := []configv1.DNSZone{} | ||
for _, zone := range wildcardRecord.Status.Zones { | ||
for _, cond := range zone.Conditions { | ||
if cond.Type == iov1.DNSRecordFailedConditionType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need the following: && cond.Status == string(operatorv1.ConditionTrue)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
conditions := []iov1.DNSZoneCondition{} | ||
err := r.dnsProvider.Ensure(record, zone) | ||
if err != nil { | ||
conditions = append(conditions, iov1.DNSZoneCondition{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we decide to set Failed=False
when err
is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we explicitly decided. If we publish True when there's a problem, does that imply we must publish False when there's not a problem? There's no internal reason we need the condition when things are normal, not sure what the conventions are here. We can change in a followup.
return nil, fmt.Errorf("failed to create dnsrecord %s/%s: %v", desired.Namespace, desired.Name, err) | ||
} | ||
log.Info("created dnsrecord", "dnsrecord", desired) | ||
return desired, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we can do without update handling. I would like to test the following on AWS:
- Scale the default ingress controller to 1.
- Look up the AZ of the sole ingress controller pod.
- Kill ingress controller pods till one appears in a different AZ, which should cause the ELB's DNS name to change.
- Check whether the wildcard DNS name still resolves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I performed the test, and the ELB's DNS name does not change; only the address to which it resolves changes, and because we use an alias on AWS, the wildcard DNS record likewise changes over to the new address without requiring an update by the operator.
}, | ||
{ | ||
description: "private only A", | ||
description: "no ingresses", | ||
domain: "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case should have a non-empty value for domain
because an empty value also causes a nil result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
}, | ||
{ | ||
description: "global ALIAS", | ||
description: "hostname to ALIAS record", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALIAS
→CNAME
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
recorder record.EventRecorder | ||
operatorNamespace string | ||
dnsProvider dns.Provider | ||
dnsConfig *configv1.DNS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Half of these fields are not used:
recorder
is not used (but maybe it should be).operatorNamespace
is not used and probably should be deleted.dnsConfig
is not used and should be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ironcladlou, Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ironcladlou qq if you don't mind - is this going to work for UPI too on non-AWS env? |
The managed wildcard DNS record remains supported today only on AWS and Azure IPI. This change is a functionally transparent internal refactor and doesn't imply any change in our DNS management capabilities as expressed by the public API. |
Introduce s dnsrecord.ingress.operator.openshift.io resource type to represent
DNS records in a Kube-native API.
Refactor ingress controller to manage dnsrecords rather than inferring DNS
record requirements indirectly through services.
Teach the ingresscontroller status computer about dnsrecords so DNS management
status can surface through all operator CRDs in a way consistent with other
resources (e.g. services).
Introduce a DNS controller which syncs dnsrecords with the cluster's
configured DNS zones, reporting status on the dnsrecord.