-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: [sig-cluster-lifecycle] addons via operators #746
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,181 @@ | ||
KEP: Addons via Operators | ||
|
||
--- | ||
kep-number: 35 | ||
title: Addons via Operators | ||
authors: | ||
- "@justinsb" | ||
owning-sig: sig-cluster-lifecycle | ||
reviewers: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need some reviewers / approvers before we commit this. Maybe something to discuss tomorrow? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this has not been discussed in the SIG already it should be. The SIG should look at who the approvers are (the chairs?) The related sigs field should also be filled in. I would suggest SIG Architecture as one of them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [This was raised to SIG Cluster Lifecycle in today's meeting] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can someone post a summary of where the SIG is at on this? There's no harm in merging and iterating. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can chat about it during our next call. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point @mattfarina - I'd like to bring it to sig-architecture when we have something concrete to show them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SIG Arch is working to decentralize decision-making: I suggest discussing in SIG Cluster Lifecycle for now, and invite interested parties to attend. |
||
- TBD | ||
approvers: | ||
- TBD | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assign me as both reviewer and approver. |
||
editor: TBD | ||
creation-date: 2019-01-28 | ||
last-updated: 2019-01-28 | ||
status: provisional | ||
--- | ||
|
||
# Addons via Operators | ||
|
||
## Table of Contents | ||
|
||
* [Table of Contents](#table-of-contents) | ||
* [Summary](#summary) | ||
* [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Non-Goals](#non-goals) | ||
* [Proposal](#proposal) | ||
* [Risks and Mitigations](#risks-and-mitigations) | ||
* [Graduation Criteria](#graduation-criteria) | ||
* [Implementation History](#implementation-history) | ||
* [Infrastructure Needed](#infrastructure-needed) | ||
|
||
|
||
## Summary | ||
|
||
We propose to use operators for managing cluster addons. Each addon will have | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One thing I struggle with is: do we need a whole CRD for every addon. Or is kustomize + default bundle of YAML sufficient? I'm guessing this would depend on the addon. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CRDs are supposed to be cheap, so I'd say we should have one for every addon - we do expect there will be per-addon-settings, and so a CRD-per-addon lets us actually use all this machinery we've invested so much in. The hope is that we can make developing the controllers as easy as kustomize + default bundle of YAML though! I'll add a comment to this effect There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean we'll have e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes (modulo naming and pluralization rules) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it seem a bit weird to do that for singleton addons? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really to my mind - can you clarify why it's weird? I do think we should establish patterns for singletons (e.g. "name the instance default"). But I love the progress made on CRDs and think we should make use of them! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we have followed a convention similar to what @justinsb described. the operator basically only respects a single instance of a named cluster scoped CR for certain types of operators (not all). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A naming pattern like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I'm missing something here, but having a different CRD type for each oddon makes more difficult to create tools for generic operations, like listing all addons with a given status, or watch for the installation of addos? What approaches should be used in that cases? |
||
its own CRD, and users will be able to perform limited tailoring of the addon | ||
(install/don’t install, choose version, primary feature selection) by modifying | ||
the CR. The operator encodes any special logic (e.g. dependencies) needed to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the CRD. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changing to "the instance of the CRD" |
||
install the addon. | ||
|
||
We will create tooling to make it easy to build addon operators that follow the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we just upstream this work into controller runtime so that both kubebuilder and OperatorKit can leverage it? Or do you have more specific work in mind i.e something like AddOn(builder/kit)? |
||
best practices we identify as part of this work. For example, we expect that | ||
most addons will be declarative, and likely be specified as part of a “cluster | ||
bundle”, so we will make it easy to build basic addon operators that follow | ||
these patterns. | ||
|
||
We hope that components will choose to maintain their own operators, encoding | ||
their knowledge of how best to operate their addon. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the scope to also have all these managed k8s providers like gke/eks also move to this addon model ? Its currently a pain to not be able to modify easily the addons provided by default on the various providers . But at the same time, will CRD provide enough configurability or just need a way to expose all these addons to be modifiable by consumers. Every one has their unique requirements on what part of each addon they want to modify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not a requirement for anyone to move to the model, but we would like to create something that tooling chooses to move to. Figuring out ways to support a balance of modification and control to enable that is in scope. |
||
|
||
|
||
## Motivation | ||
|
||
Addons are components that are managed alongside the lifecycle of the cluster. | ||
They are often tied to or dependent on the configuration of other cluster | ||
components. Management of these components has proved complicated. Our | ||
existing solution in the form of the bash addon-manager has many known | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I like the idea of having add on manager replace the bash scripts. I think it generally improves our cluster turn up story for users of OSS kubernetes that do not have the benefit of leveraging a cloud provider service for their cluster management. |
||
shortcomings and is not widely adopted. As we focus more development outside of | ||
the kubernetes/kubernetes repo, we expect more addon components of greater | ||
complexity. This is one of the long-standing backlog items for | ||
sig-cluster-lifecycle. | ||
|
||
Use of operators is now generally accepted, and the benefits to other | ||
applications are generally recognized. We aim to bring the benefits of | ||
operators to addons also. | ||
|
||
### Goals | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've been working on operator lifecycle manager for a couple of years, and it shares a lot of these same goals:
There might be some places where OLM isn't perfectly aligned with what is needed for managing addons - but it seems pretty close, and we'd love to jumpstart this work by contributing what we've already done and iterating on it to meet the community's needs. As a first step, we could try to prototype installing today's addons with OLM and see if there are any gaps? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Of course - we'd welcome your collaboration! |
||
|
||
* Explore the use of operators for managing addons | ||
* Create patterns, libraries & tooling so that addons are of high quality, | ||
consistent in their API surface (common fields on CRDs, use of Application | ||
CRD, consistent labeling of created resources), yet are easy to build. | ||
* Build addons for the basic set of components, acting as a quality reference | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does bundles fit under these goals? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So ideally each addon operator references a set of manifests somewhere, rather than baking them into the image, so we don't always have to update the operator (just when the underlying change is non-trivial). Those manifests could be in any format really. They could be simple yaml files. I do think we should recommend a format. If we actually pick bundles as our format, this gives us a bit of metadata when pulling from https, but because bundles are CRDs, this also gives us the ability to fetch those manifests from the cluster rather than relying on some https server, i.e. it's "half" of airgap support. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't that mean that the manifests themselves are the artifacts and the image should be embedded in them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please not files on a disk that have to be rsync'ed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure I follow @smarterclayton - manifests do specify an image name, but I'm missing where you're going with that... @bgrant0607 I don't think we would recommend files on a disk for any production configuration :-) |
||
implementation suitable for production use. We aim also to demonstrate the | ||
utility and explore any challenges, and to verify that the tooling does make | ||
addon-development easy. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd add another goal here:
Or said another way:
The possible APIs for Kube-like components is unbounded and cannot be scoped, so the more API surface we have here the greater the burden on the project. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @liggitt cause we discussed this recently There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are CRDs though - they aren't all going to be k8s APIs. We would prefer homogeneity, so we would welcome any input from you and @liggitt on what a "recommended" API should look like. We have a prototype where we defined a few common fields, which feels right - better to have version always at There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are good points. I think they also apply elsewhere, such as the component config effort, exported metrics, Event contents, component log messages, and kubectl commands. We're accruing operational surface area over time. I think the only way that can be addressed is by enabling the technical oversight of these areas to be distributed, such as by mentoring/shadowing, documenting policies and conventions, writing linters and compatibility tests, etc. |
||
|
||
|
||
### Non-Goals | ||
|
||
* We do not intend to mandate that all installation tools use addon operators; | ||
installation tools are free to choose their own path. | ||
* Management of non-addons is out of scope (for example installation of end-user | ||
applications, or of packaged software that is not an addon) | ||
|
||
|
||
## Proposal | ||
|
||
This is the current plan of action; it is based on experience gathered and work | ||
done for Google’s GKE-on-prem product. However we don’t expect this will | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. openshift has been exploring this space as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a summary of your learnings written down? |
||
necessarily be directly applicable in the OSS world and we are open to change as | ||
we discover new requirements. | ||
|
||
* Extend kubebuilder & controller-runtime to make it easy to build operators for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is one of the project goals for the operator SDK, bootstrapping new operator projects so that they will be set up to follow best practices. The SDK is based on the controller-runtime project. |
||
addons | ||
* Build addons for the primary addons currently in the cluster/ directory | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we get more concrete here and enumerate the exact ones. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I put as a proposed list CoreDNS & kubeproxy (needed for conformance), dashboard (demos well) and metrics-server (non-trivial), LocalDNS-Agent (useful). I'm not sure whether you would rather we were more or less ambitious here @timothysc - happy to go eithher way. |
||
* Plug in those addons operators into kube-up / cluster-api / kubeadm / kops / | ||
others (subject to those projects being interested) | ||
* Develop at least one addon operator outside of kubernetes/kubernetes | ||
(LocalDNS-Cache?) and figure out how it can be used despite being out-of-tree | ||
* Investigate use of webhooks to prevent accidental mutation of child objects | ||
* Investigate the RBAC story for addons - currently the operator must itself | ||
have all the permissions that the addon needs, which is not really | ||
least-privilege. But it is not clear how to side-step this, nor that any of | ||
the alternatives would be better or more secure. | ||
* Investigate use of patching mechanisms (as seen in `kubectl patch` and | ||
`kustomize`) to support advanced tailoring of addons. The goal here is to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. patching might make upgrades tricky. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you please clarify why? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the underlying structure changes dramatically the patches won't apply. Added a comment about this, and that we'll need some form of error state, and that a good addon won't change structure just for fun! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this referring to patching the addons that the operator is managing (the operand), like patching a Deployment? In this context, patching opens the door to changes that could make the safety of upgrades unpredictable for the operator. Can they patch env vars, attached volumes, or command line arguments for the binary being run in the container? Will you limit what they can / can't patch? The CRD is the safe API contract. Once the admin strays outside of that and starts patching the generated resources, the operator can only offer a best effort attempt to upgrade. What if a command line flag on the binary needs to be deprecated or changed? This isn't out of the question and has happened before with kube binaries. If the operator is in control and this feature is defined at the API level, then rollout of this flag change can be done smoothly by the operator. If an admin instead has patched the Deployment to use flags that the operator isn't managing, when the new version tries to roll out and the patch is reapplied with an invalid flag, the new pods are going to crashloop. This wouldn't have been caught at the patching step, the patch itself was not invalid, it wouldn't be caught until the middle of the upgrade when things start to fail. |
||
make sure that everyone can use the addon operators, even if they “love it but | ||
just need to change one thing”. This ensures that the addon operators | ||
themselves can remain bounded in scope and complexity. | ||
|
||
|
||
We expect the following functionality to be common to all operators for addons: | ||
|
||
* A CRD per addon | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd really like to see example(s) for common core ones, e.g. proxy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, some concrete examples would help a lot. I think there are a few classes of add-ons and each of the classes has certain characteristics that are are shared within the class. Here is what I've been thinking about the classification: One type of addons are the components of Kubernetes that are strongly coupled with the Kubernetes releases, such as kube-dns/coredns, kube-proxy. Beside version coupling, you normally wouldn't want to run multiples of these running at the same time. A networking component is usually less strongly coupled to Kubernetes version, but it is more critical to cluster functionality. There is also not standard way of running multiple different networks. Secondly, there are kinds of add-ons that provide functionality that is rather critical to a user who runs workloads on the given cluster, ingress controllers and storage drivers would fall into this category. Some of these things can be seen as operators. There is also the class of add-ons that extends functionality even further, and are often a combination of components which may or may not be defined as standalone add-ons from one of the above categories also. Istio and knative would fall into this category. Additionally, there is a class of workloads that depends on versions of Kubernetes, but doesn't provide additional functionality to the cluster directly, observability products, dashboards, deployment tools and things like security scanners will fall into this category. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added an example. @errordeveloper you raise some good categorizations - some of these are going to be "apps". For example the nginx-ingress-controller isn't really tied to the k8s cluster lifecycle. We would expect those to be deployed using tools like helm or kustomize or kubectl. I do think we should figure out what to do about singleton objects. I've been called them "default" in "kube-system", but there are many approaches and even I only passionately hate about half of them ;-) I don't really know how to handle "pick one of a set" type operators (i.e. CNI). Operators do let us have better cleanup procedures, but realistically it's highly disruptive to switch. kops started labelling the addons so that we could prune the other ones, but most CNI providers leave some networking plumbing behind (e.g. iptables). We can't easily do a rolling-update in most cases because the network partitions between old-CNI and new-CNI. There are usually supporting infrastructure things e.g. opening firewall rules for IPIP for calico. My preference would be that we can set up the operators to do all they can, but realistically if we get a nice migration path for any CNI providers it'll be because the tooling put in special code to move between a particular transition. We at least have somewhere to put that code now! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
AWS ALB ingress controller would only ever work in AWS and needs IAM permissions, so it's tied to a few properties. When it comes to lifecycle and Cluster API, provisioning of such IAM permissions becomes a lifecycle concern. GKE has built-in ingress controller, and that's currently something GKE user selects as an add-on and it is managed by the add-on manager. This one is actually tied into lifecycle as it gets included during cluster creation, and cannot be removed easily as far as I know. Also, Heptio Contour ships a CRD, and that has versions dependencies. I agree that neither of these are tied into cluster lifecycle, but they have dependencies on cluster APIs, and provide vita functionality for many users.
Even if we are to say that some things like ingress controllers are seen as "apps with deep Kubernetes integration" and in some case dependencies on cloud resources associated with the cluster, how do we expect users to make this distinctions and find the right way of managing such things? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not the case with Weave Net, and I don't know when it is practically the case. I would expect any well-designed CNI addon to behave well.
Seems similar to IAM dependencies problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right that IAM provisioning likely brings it into cluster-scope; I don't think an ingress controller or a something like kube2iam are necessarily deal-breakers there. I would like to avoid "deploying a CRD => cluster-lifecycle". I know there are versioning concerns, but I would hope we could come up with some reasonable patterns such that in practice helm or something could manage an app that includes CRDs. i.e. although technically there are problems with incompatible CRDs, by being careful about versions we can avoid them. I think for users, we would expect their installation tooling (kubeadm, cluster-api etc) to include or reference a small set of cluster-addons. and for users to follow that lead. And I expect we'll naturally expect kubernetes component developers to identify whether they are really a cluster-addon or an app, and there will be natural pushback against inclusion from the installation tooling if they choose cluster-addon when they are really an app. If the CNI providers are in fact capable of being switched, that's much easier! We can probably start with simple docs for how to turn on Weave and turn off Calico, and I'm sure Weave will write those docs, just as I'm sure Calico will write the doc in the other direction! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Who reviews and approves the CRD APIs for these operators? I.e. if kube owns "official DNS installation API", do the core Kubernetes API approvers have veto power over what features DNS can support? How do we determine what features are allowed to be added to the "official DNS installation API"? We have a lot of experience with the shortcomings of domain-specific APIs like ingress, service load balancer, storage where multiple different implementations can have wildly different feature sets. What happens when the N+1 feature addition to DNS API gets NACKed by the maintainer of the "official DNS installation API"? Does someone create "unofficial DNS installation API"? Do we support random extensions? Do we pick a very hardline and encourage people to fork and write their own operators? I like the idea of a simple DNS config API. I don't like the idea of the DNS config API becoming just another ingress API problem. I don't see how these APIs won't fall into that trap. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An alternate approach would be:
Since operators are controllers the best operators will be a simple deployment, rbac rules, and maybe a secret. Do we need to go beyond that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to distinguish the types of addons? Several of the examples, such as kube-proxy and metrics server (https://github.com/kubernetes-incubator/metrics-server), are components developed as part of the Kubernetes project. Is there a reason why it wouldn't make sense for the owners of those components to also own the operator CRDs for those components? I'd assume SIG Network would own the DNS-related operators. On the simple DNS API: Offhand, I'd split the common parameters into the Ingress-like implementation-independent API (or just additional fields on some object[s]), and leave implementation-specific details in the operator CRDs. I haven't seriously looked at what such a split would look like, but candidate fields include those used to generate the default /etc/resolve.conf for containers and those documented as part of kube-dns and CoreDNS configuration: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we think we have an "official DNS installation API" today? |
||
* Common fields in spec that define the version and/or channel | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is a channel? Maintained and supported by the community? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A channel is what we're calling a stream of versions. Better names welcome, but I think channel is a good name that wise people wearing red hats use... I think it would be up to a project e.g. CoreDNS whether to support a channel for their operator and how they would define it. But again, suggestions & recommendations are welcome. |
||
* Common fields in status that expose the current health & version information | ||
of the addon | ||
* Addons follow a common structure, with the CR as root object, an Application | ||
CR, consistent labels of all objects | ||
* Some form of protection or rapid reconciliation to prevent accidental | ||
modification of child objects | ||
* Operators are declaratively driven, and can source manifests via https | ||
(including mirrors), or from data stored in the cluster itself | ||
(e.g. configmaps or cluster-bundle CRD, useful for airgapped) | ||
* Operators are able to expose different update behaviours: automatic immediate | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ahh automatic updates... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't say I would recommend it for prod clusters, but totally automated updates of e.g. CoreDNS would be a cool demo! |
||
updates; notification of update-available in status; purely manual updates | ||
* Operators are able to observe other CRs to perform basic sequencing | ||
* Addon manifests are able express an operator minimum version requirement, so | ||
that an addon with new requirements can require that the operator be updated | ||
first | ||
|
||
|
||
### Risks and Mitigations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think test and verification for a release is also super important. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point - added a paragraph specifically for that. I think if we get to parity with the "embedded manifest" approach that we know and love, that'll help initially. And I think we unlock approaches that are a little more flexible than embedding manifests, but I don't think we need to dictate anything there. |
||
|
||
This will involve running a large number of new controllers. This will require | ||
more resources; we can mitigate this by combining them into a single binary | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I expect operator owners will want to lifecycle and release new versions of their operators independently from each other. Especially for operators that aren't tied to a particular kubernetes release. Combining into a single operator binary for all addons also makes your RBAC concerns worse, now there is one operator with access to do everything all of the operands can do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A fair point. I'm also unhappy about the way RBAC permissions "trickle down", where an operator needs to have all the permissions of the addon - I hope we can investigate ways to address that. Is there anything we can learn from from Openshift's work here? |
||
(similar to kube-controller-manager). | ||
|
||
Automatically updating addons could result in new SPOFs, we can mitigate this | ||
through mirroring (including support for air-gapped mirrors). | ||
|
||
Providing a good set of addons could result in a monoculture where mistakes | ||
affect most/all kubernetes clusters (even if we don’t mandate adoption, if we | ||
succeed we hope for widespread adoption). We can continue with our strategies | ||
that we use for core components such as kube-apiserver: primarily we must keep | ||
the notion of stable vs less-stable releases, to stagger the risk of a bad | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we have the notion of stable vs less stable releases today. Individual components, like Core DNS, have become the default once they are considered GA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right - I'm saying that we would expect that addon X might continue to support 1.1 for a while after introducing 1.2 to avoid the monoculture risk. They might choose to leverage channels for this, for example choosing to define a "stable" channel that stays on 1.1 for a while, while the "beta" channel moves to 1.2 until they feel it has been derisked. Today this work is all on the kubernetes test & release teams, which feels very centralized. |
||
rollout. We must also consider this a trade-off against the risk that without | ||
coordination each piece of tooling must reinvent the wheel; we expect more | ||
mistakes (even measured per cluster) in that scenario. | ||
|
||
## Graduation Criteria | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These graduation criteria are a bit squishy. If you're going to have them could you look at the in work template updates for some more guidance. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We talked on the sig meeting to help refine this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! I realized that graduation critieria are the alpha->beta->GA criteria. So I added some alpha criteria that are adoption based, and renamed the fuzzy graduation criteria to be "success criteria" |
||
|
||
We will succeed if addon operators are: | ||
|
||
* Used: addon operators are adopted by the majority of cluster installation | ||
tooling | ||
* Useful: users are generally satisfied with the functionality of addon | ||
operators and are not trying to work around them, or making lots of proposals / | ||
PRs to extend them | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a great success criteria and the one I'm skeptical about in the current scope. Would we do an ingress controller operator? Which ingress controllers would it support? What options on those ingress controllers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If these all have operators, I was imagining there would be an nginx-ingress operator, and a distinct gclb-ingress operator, aws-alb operator etc. Ideally component projects end up owning their operators. I was not imagining we would try here to define a generic ingress operator that can install arbitrary ingress controllers - it doesn't feel very compatible with the idea of federating responsibility & of self-determination. It also feels like that would be a sig-network project. |
||
* Ubiquitous: the majority of components include an operator | ||
* Federated: the components maintain their own operators, encoding their | ||
knowledge of how best to run their addon. | ||
|
||
|
||
## Implementation History | ||
|
||
Addon Operator session given by jrjohnson & justinsb at Kubecon NA - Dec 2018 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. link is here for folks that haven't seen it already: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should add hyper link to this string? What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do - great call, thank you! |
||
KEP created - Jan 29 2019 | ||
|
||
## Infrastructure Needed | ||
|
||
Initial development of the tooling can probably take place as part of | ||
kubebuilder | ||
|
||
We should likely create a repo for holding the operators themselves. Eventually | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be good to have one repo per operator as opposed to a single repo for all operators. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm already concerned today with the fragmentation that exists in the CAP*, and I think this would provide yet another degree of freedom where I'd really like to bake the defaults into a single canonical location. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @detiber A single repo for all operators doesn't allow the owners of the addons to build their operators where and how they want. If we know of addon operators that don't need to cycle exactly with a particular kubernetes release, a single repo doesn't seem like the right approach to enabling that scenario. Define the API contract an operator must fulfill (status/condition reporting, how the operator says if it has rolled out the expected version, etc). Have tooling to make it easy for an operator to bootstrap a project to conform to that API contract. |
||
we would hope these would migrate to the various addon components, so we could | ||
also just store these under e.g. cluster-api. | ||
|
||
Unclear whether this should be a subproject? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i guess it depends if code will be owned in a new repo and by which SIG. but i think this effort needs a WG and a meeting schedule. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it a WG or subproject? Maybe up for discussion tomorrow? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was this discussed yet? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't believe we have discussed - as these operators are code, ideally that code will be owned by the SIGs that own the thing being installed - the X-addon-operator should be developed alongside X. It's not clear if we end up with anything at the cluster-lifecycle scope long-term (particularly if we punt as much as possible to kubebuilder). As the intention is not to be stuck holding the code bag, maybe a working-group makes more sense? But on the flip-side we likely want a repo to start developing the operators, until we can hand off ownership. Can a working-group get a repo temporarily? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can probably nix this statement now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks - nixed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All content has to follow the YAML front matter. See https://jekyllrb.com/docs/front-matter/