Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: Dynamic registration of admission controllers and initializers #611

Merged
Merged
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# Dynamic admission control configuration

## Background

[#132](https://github.com/kubernetes/community/pull/132) proposed making
admission control extensible. In the proposal, the `initializer admission
controller` and the `generic webhook admission controller` are the two
controllers that set default initializers and external admission hooks for
resources newly created. These two admission controllers are in the same binary
as the apiserver. This [section](https://github.com/smarterclayton/community/blob/be132e88f7597ab3927b788a3de6d5ab6de673d2/contributors/design-proposals/admission_control_extension.md#dynamic-configuration)
of #132 gave a preliminary design of the dynamic configuration of the list of
the default admission controls. This document hashes out the implementation
details.

## Goals

* Admins are able to predict what initializers/webhooks will be applied to newly
created objects.

* Do not block the entire cluster if the intializers/webhooks are not ready
after registration.

## Specification

We assume initializers could be "fail open". We need to update #132 if this is
accepted.

The schema is copied from
[#132](https://github.com/kubernetes/community/pull/132) with a few
modifications.

```golang
type AdmissionControlConfiguration struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that we would want to allow components to register initializers independently. I.e. the component that provides "foo" initializer might be completely decoupled from initializer "bar", and therefore want to register in a decoupled way. That however assumes that two components can lazily define "before" and "after".

I think in general, most initializers don't care about before or after, and simply want a chance to fire. For those that do... what should we do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we would never trust initializers to register themselves. I thought it would be a human admin who wrote ACC. A single careless initializer could destroy the carefully ordered list of initializers, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking that most initializers are going to be like 3rd party controllers. They're independently delivered. If we have one resource, the install steps for those controllers are going to be ugly patch statements. If we have multiple resources, we'll have to declare dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. In the future, will we want to enforce order among webhooks? If not, we can make the webhook part of ACC multiple instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think webhook "phases" or "before or after" are inevitable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single ACC is much easier if both intilaizers and webhooks will have order.

If the initializer and webhook wish to do self registration, they can specify partial order with the latest strategic merge patch syntax.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would that look like?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 for using PATCH for installation.

I think we probably eventually want the user to POST all the initializers/webhooks, and then a controller assembles the ordered list. However I think the thing Chao has here will be a lot easier to implement in the remaining time. Having this list here also lets us push off modeling the dependency graph into the future. So I propose to keep it as is for 1.7, even though it means installing these things will be a bit harder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://github.com/kubernetes/community/pull/537/files, it would look like:

resourceInitializers:
  - resource
    $setElementOrder/initializers:
    - existing_initialier_a
    - existing_initialier_b
    - new_initializer
    - existing_initialier_c
    initilizers:
    - name: new_initiailizer
      failurePolicy: "ignore"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with harder for now.

TypeMeta // although this object could simply be serialized like ComponentConfig
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need an objectMetal, otherwise we cannot persist the ACC in the storage.

To make ACC a singleton, we can define a canonical name for ACC and ignore ACC of other names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do singletons in general. Add object meta, and have validation that only allows one name (for now), which we can relax in the future.


// ResourceInitializers is a list of resources and their default initializers
ResourceInitializers []ResourceDefaultInitializer

ExternalAdmissionHooks []ExternalAdmissionHook
}

// Because the order of initializers matters, and each resource might need
// differnt order, the ResourceDefaultInitializers are indexed by Resource.
type ResourceDefaultInitializer struct {
// Resource identifies the type of resource to be initialized that should be
// initialized
Resource GroupResource
// Initializers are the default names that will be registered to this resource
Initializers []Initializer
}

type Initializer struct {
// Name is the string that will be registered to the resource that needs
// initialization.
Name string

// **Optional for alpha implement**
// FailurePolicy defines what happens if there is no initializer controller
// takes action. Allowed values are Ignore, or Fail. If "Ignore" is set,
// apiserver removes initilizer from the initializers list of the resource
// if the timeout is reached; If "Fail" is set, apiserver returns timeout
// error if the timeout is reached.
FailurePolicy FailurePolicyType
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton do you agree with adding FailurePolicy to initializers? It's not in your original proposal #132.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use case here is "the system doesn't grind to a halt"?

Who in this scenario reaps these? Without this, initializers don't need a controller (or we can avoid it). With it, we must have one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use case here is "the system doesn't grind to a halt"?

Yes.

Yes, we'll need a controller. I thought we could let apiserver do the reaping, but lavalamp pointed out apiserver could crash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can say initializers always fail closed for 1.7.


// **Optional for alpha implement**
// If timeout is reached, the intializer is removed from the resource's
// initializer list by the apiserver.
// Default to XXX seconds.
Timeout *int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a per-resource per-initializer timeout seems extreme. Better to have a global timeout?

}

type FailurePolicyType string

const (
Ignore FailurePolicyType = "Ignore"
Fail FailurePolicyType = "Fail"
)

type ExternalAdmissionHook struct {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whitlockjc how about this version? It solves Overlapping Criteria Combinations, and Negation can be supported easily in the future.

@lavalamp i think it's not complex for a human to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a lot of corner cases here. It's not easy to predict what calls I'll get if I'm trying to fill out a yaml file.

Copy link

@whitlockjc whitlockjc May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an operation-specific version of my PoC. So while this does address conflicts in criteria combinations, in the event you want to support multiple admission.Operations for the same resource(s), you now need one rule per operation per resource(s). Here are a few questions:

  • RBAC uses APIGroup for the same thing you're using Group for. Worth changing for consistency?
  • RBAC uses Resource for a combination of Resource and Subresource. Worth changing for consistency?
  • Why not allow namespaces as criteria? (this is basically the main criteria my proposal supports that yours does not)

Your proposal seems to be getting closer and closer to the implementation I've proposed. Originally you had one rule with M overlapping criteria. Now you have N rules with a single, non-overlapping criteria.

My concern with this approach is you now need more rules to describe similar rules with overlapping but non-conflicting criteria, not only in the case where you want to support multiple operations for the same resource but also where you want to support the same operation for multiple resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be as consistent as possible with RBAC. I do think resource and sub resource should be combined unless @deads2k can articulate that that was a bad decision.

I'd probably prefer to omit namespaces now, and potentially come up with a generic mechanism via annotations or some other concept at a later point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be as consistent as possible with RBAC. I do think resource and sub resource should be combined unless @deads2k can articulate that that was a bad decision.

Combining the two seem to match user expectation pretty well when they want to specify multiple values in a single rule.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton What's the reason for not supporting namespaces? That seems like a fairly common need. With namespaces being boundaries, it's very common for people to be interested in processing admission requests for some namespaces but not others. For example, just being able to not process anything from kube-system seems like a starting point for anyone. And if you do multi-tenancy, this is even more important.

// Operations is the list of operations this hook will be invoked on - Create, Update, or *
// for all operations. Defaults to '*'.
Operations []OperationType
// Resources are the resources this hook should be invoked on. '*' is all resources.
Resources []Resource
// Subresources is list of subresources. If non-empty, this hook should be invoked on
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton i changed the comment. Does it match what you have in mind?

// all combinations of Resources and Subresources. '*' is all subresources.
Subresources []string

// ClientConfig defines how to talk to the hook.
ClientConfig AdmissionHookClientConfig

// FailurePolicy defines how unrecognized errors from the admission endpoint are handled -
// allowed values are Ignore, Fail. Default value is Fail
FailurePolicy FailurePolicyType
}

type Resource struct {
// Group is the API group the resource belongs to.
Group string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include the version?

Copy link
Member Author

@caesarxuchao caesarxuchao May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I don't think webhook should behave different for different versions of a resource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Representation could change drastically between versions. This determines the version of an object that will be sent to the webhook. So version does need to be something that is selected on.

// Resource is the name of the resource.
Resource string
}

type OperationType string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't see the purpose of mirroring admission.Operation here. I don't see it happening but what happens if an admission.Operation is modified or added or removed? Why not just use admission.Operation as the type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let the api depends on the admission package sounds wrong. For example, if we decide to move the API to k8s.io/api, i don't want to make k8s.io/api depends on k8s.io/apiserver.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But since we're building against/for admission controllers, why is depending on admission wrong?

/cc @deads2k @smarterclayton

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will create import cycle.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that would depend on where we put these new types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you are already doing it in pkg/apis/admission: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/admission/types.go#L52

I think reversing the dependency is better, i.e. putting the definition in the API definition, because the API packages are usually the leaves in the dependency graph.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused as to what goes into an API group vs. what ends up elsewhere. I could see some of these things in plugin/pkg/admission ending up in pkg/apis/admission myself. As for dependency stuff, I'd rather a stable object model than one that has multiple sources of truth. I realize that the chances of this are low right now but that doesn't mean impossible. Having admission.Operation being used by the internal admission stuff and having a shadow type to reflect it seems like one link that's easy to break. Just my two cents.


const (
All OperationType = "*"
Create OperationType= "Create"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POST and PUT?

Update OperationType= "Update"
)

// AdmissionHookClientConfig contains the information to make a TLS
// connection with the webhook
type AdmissionHookClientConfig struct {
// Service is a reference to the service for this webhook. It must communicate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say-- if multiple ports are exposed, it will use 443 if available and be an error condition otherwise. If only a single port is exposed it can just use that.

// on port 443
Service ServiceReference
// CABundle is a PEM encoded CA bundle which will be used to validate webhook's server certificate.
CABundle []byte
}

// ServiceReference holds a reference to Service.legacy.k8s.io
type ServiceReference struct {
// Namespace is the namespace of the service
Namespace string
// Name is the name of the service
Name string
}
```

## Synchronization of AdmissionControlConfiguration (**optional for alpha implement**)

If the `initializer admission controller` and the `generic webhook admission
controller` watch the `AdmissionControlConfiguration` and act upon deltas, their
cached version of the configuration might be arbitrarily delayed. This makes it
impossible to predict what initializer/hooks will be applied to newly created
objects.

We considered a few ways to make the behavior of the `initializer admission
controller` and the `generic webhook admission controller` predictable.

(I prefer #2. #1 is inefficient, #3 requires complex schema and is not intuitive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clear specification of the options. IMO 2 is an optimization of 1, we should see if 1 is really problematic before doing the extra work. I would guess that an extra trip to etcd is going to be an insignificant amount of latency compared to the initializers/webhooks themselves. @smarterclayton agree?

@wojtek-t This is as good a place as any to ask about how this is going to impact our SLO, whether the 1 second return SLO will still make sense for create requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lavalamp - thanks for looping me in.

So I added a question about #2: https://github.com/kubernetes/community/pull/611/files#r116245850 which I think I don't understand.

The additional read from etcd for every create request sounds problematic, but we need to do some math to compute how many QPS it will generate.
I think that the only thing we are afraid about it latency of the call to etcd. We probably should care mostly about larger clusters. So let's say we are targeting 100pods/s throughput. So there will be O(100) create requests then.
In 5000-node clusters, we have 500 writes/s of just Node updates, so it's not that in such clusters we will have 10x more requests - it will be more like 10-20%. Still a lot though.


#### 1. Always do consistent read

The `initializer admission controller` and the `generic webhook admission
controller` always do consistent read of the `AdmissionControlConfiguration`
before applying the configuration to the incoming objects. This adds latency to
every CREATE request. Because the two admission controllers are in the same
process as the apiserver, the latency mainly consists of the consistent read
latency of the backend storage (etcd), and the proto unmarshalling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fairly expensive today.

Thinking about this... watch from within the same etcd server is effectively free. We need to guarantee that the clock time for any given ACC is within some bound (because an admission controller can be arbitrarily delayed between when the data is read from etcd and when it applies the rules, so a hard boundary is probably too hard).

The scenario of greatest concern is when an apiservers watch client is delayed relative to the mutation operations being executed, perhaps due to a network hang. If we had a single etcd watch on a set of key ranges, our minimum bound is:

The most recent admission control configuration delivered on the same watch channel as any returned event on that watch channel must be no older than X seconds after the last update operation on this server was performed, otherwise a consistent read must be performed.

That can degrade to just performing a consistent read from each api server every X seconds and delaying / rejecting requests until the read returns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Said another way, our etcd client simply has to ensure that the open watch channel from the server that also watches ACC has delivered ANY event within the time window we care about.

Copy link
Contributor

@smarterclayton smarterclayton May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the etcd client and server preserve the guarantee that would be necessary for the optimization there though. The client has to construct a watch, and the etcd server has to guarantee a "delivered before" behavior over the connection (event on ACC key has to be delivered before another key with higher index). gRPC substreams (one for each watch) might not support that if two watches are in different sub streams because the packets for the one stream can be delayed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or the api server write to the endpoints object has to track the member observed high etcd index and delay writes if other members of the cluster have higher observed indices (if the watch cache value is higher than the read value of the GET on endpoints)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like the upper bound and the degrading thing. Will make e2e tests slow though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from first principles:

  1. admin must be able to know when config is applied
  2. admin can always restart an apiserver and guarantee it sees the latest config (fallback)
  3. admin needs to be able to ensure config eventually applies within some bound

Those are good goals for an alpha api. To just do 3, could we:

  1. do a consistent read every 30s
  2. if the consistent read fails, block the admission plugin (fail) until it succeeds

Tests would have to get clever about verifying the config is working. If we're concerned, we can simply add a watch on the config objects and always use the latest, but the consistent read must succeed every 30s. That would allow tests to work most of the time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 2 an explicit goal? That sounds very easy to achieve.

Sure. I'll add this option to the doc, and let's see if we are on the same page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean as - it's the thing that they have to do today, and it's worked so far. So it's ok to say "you must restart your masters to guarantee the latest config applied", which is probably reasonable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton - got it. Thanks a lot for explanation. That makes perfect sense to me.
I added a potential improvement for it in #611 (comment)


#### 2. Optimized version of #1, do consistent read of a smaller object

Instead of having the two controllers do consistent read of the entire
`AdmissionControlConfiguration` object, we let the registry store the
resourceVersion of the `AdmissionControlConfiguration` (perhaps in a configMap),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no atomic update guarantee right? Apiserver can crash after updating the configmap but before updating AdmissionControlConfiguration.
We should be fine if we always update the configmap first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, apiserver should always update the configmap first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, apiserver would check for changes and rebuild the configmap on startup if necessary.

and let the two controllers always do consistent read of the resourceVersion and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I don't understand this option. Where is the "resourceVersion of the ACC" coming from? Do we still get it from etcd?
If not (just from local cache), I don't understand how it would work e.g. in HA setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apiserver that handles the ACC update will persist the updated ACC's rv in a configmap. When updating the rv, the apiserver needs to ensure its increasing.

only read the entire `AdmissionControlConfiguration` if the local version is
lower than the stored one.

#### 3. Don't synchronize, but report what is the cached version

The main goal is *NOT* to always apply the latest
`AdmissionControlConfiguration`, but to make it predictable what
initializers/hooks will be applied. If we introduce the
`generation/observedGeneration` concept to the `AdmissionControlConfiguration`,
then a human (e.g., a cluster admin) can compare the generation with the
observedGeneration and predict if all the initializer/hooks listed in the
`AdmissionControlConfiguration` will be applied.

In the HA setup, the `observedGeneration` reported by of every apiserver's
`initializer admission controller` and `generic webhook admission controller`
are different, so the API needs to record multiple `observedGeneration`.

A tentative schema:

```golang
Type AdmissionControlConfiguration struct {
...
// Generation is set by the registry
Geneartion int64
// ObserverdGenerations is set by `initializer admission controller` and
// `generic webhook admission controller` in each apiserver.
ObserverdGenerations []ObservedGenerationByServer
}

type ObservedGenerationByServer struct {
// Address of this server
// This can be a hostname, hostname:port, IP or IP:port
Server string
// The entity that reports the observedGeneration
AdmissionController AdmissionControllerType
ObservedGeneration int64
}

type AdmissionControllerType string

const (
InitializerAdmissionController AdmissionControllerType = "initializer admission controller"
GenericWebhookAdmissionController AdmissionControllerType = "generic webhook admission controller"
)
```

## What if an initializer controller/webhook is not ready after registered? (**optional for alpha implement**)

This will block the entire cluster. We have a few options:

1. only allow initializers/webhooks to be created as "fail open". They can
upgrade themselves to "fail closed" via the normal Update operation. A human
can also update them to "fail closed" later.

2. less preferred: add readiness check to initializer and webhooks, `initializer
admission controller` and `generic webhook admission controller` only apply
those have passed readiness check. Specifically, we add `readiness` fields to
`AdmissionControllerConfiguration`; then we either create yet another
controller to probe for the readiness and update the
`AdmissionControllerConfiguration`, or ask each initializer/webhook to update
their readiness in the `AdmissionControllerConfigure`. The former is complex.
The latter is essentially the same as the first approach, except that we need
to introduce the additional concept of "readiness".