Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: Dynamic registration of admission controllers and initializers #611

Merged
Merged
Original file line number Diff line number Diff line change
@@ -0,0 +1,371 @@

## Background

The extensible admission control
[proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_extension.md)
proposed making admission control extensible. In the proposal, the `initializer
admission controller` and the `generic webhook admission controller` are the two
controllers that set default initializers and external admission hooks for
resources newly created. These two admission controllers are in the same binary
as the apiserver. This
[section](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_extension.md#dynamic-configuration)
gave a preliminary design of the dynamic configuration of the list of the
default admission controls. This document hashes out the implementation details.

## Goals

* Admin is able to predict what initializers/webhooks will be applied to newly
created objects.

* Admin needs to be able to ensure initializers/webhooks config will be applied within some bound

* As a fallback, admin can always restart an apiserver and guarantee it sees the latest config

* Do not block the entire cluster if the intializers/webhooks are not ready
after registration.

## Specification

We assume initializers could be "fail open". We need to update the extensible
admission control
[proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_extension.md)
if this is accepted.

The schema is evolved from the prototype in
[#132](https://github.com/kubernetes/community/pull/132).

```golang
// InitializerConfiguration describes the configuration of intializers.
type InitializerConfiguration struct {
metav1.TypeMeta

v1.ObjectMeta

// Initializers is a list of resources and their default initializers
// Order-sensitive.
// When merging multiple InitializerConfigurations, we sort the intializers
// from different InitializerConfigurations by the name of the
// InitializerConfigurations; the order of the intializers from the same
// InitializerConfiguration is preserved.
// +optional
Initializers []Initializer `json:"initializers,omitempty" patchStrategy:"merge" patchMergeKey:"name"`
}

// Initializer describes the name and the failure policy of an initializer, and
// what resources it applies to.
type Initializer struct {
// Name is the identifier of the initializer. It will be added to the
// object that needs to be initialized.
// Name should be fully qualified, e.g., alwayspullimages.kubernetes.io, where
// "alwayspullimages" is the name of the webhook, and kubernetes.io is the name
// of the organization.
// Required
Name string `json:"name"`

// Rules describes what resources/subresources the initializer cares about.
// The intializer cares about an operation if it matches _any_ Rule.
Rules []Rule `json:"rules,omitempty"`

// FailurePolicy defines what happens if the responsible initializer controller
// fails to takes action. Allowed values are Ignore, or Fail. If "Ignore" is
// set, initializer is removed from the initializers list of an object if
// the timeout is reached; If "Fail" is set, apiserver returns timeout error
// if the timeout is reached. The default timeout for each initializer is
// 5s.
FailurePolicy *FailurePolicyType `json:"failurePolicy,omitempty"`
}

// Rule is a tuple of APIGroups, APIVersion, and Resources.It is recommended
// to make sure that all the tuple expansions are valid.
type Rule struct {
// APIGroups is the API groups the resources belong to. '*' is all groups.
// If '*' is present, the length of the slice must be one.
// Required.
APIGroups []string `json:"apiGroups,omitempty"`

// APIVersions is the API versions the resources belong to. '*' is all versions.
// If '*' is present, the length of the slice must be one.
// Required.
APIVersions []string `json:"apiVersions,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I set groups to [apps, batch, core] and versions to [v1, v2alpha1, v1beta1], and not all versions are in all groups?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's recommended to separate them to different Rules, and make sure in each Rule all the tuple expansions make sense. The best practice is documented in the struct comment.

Also, even if the best practice is not followed, the Rule still works, the invalid tuple expansion is ignored.


// Resources is a list of resources this rule applies to.
//
// For example:
// 'pods' means pods.
// 'pods/log' means the log subresource of pods.
// '*' means all resources, but not subresources.
// 'pods/*' means all subresources of pods.
// '*/scale' means all scale subresources.
// '*/*' means all resources and their subresources.
//
// If '*' or '*/*' is present, the length of the slice must be one.
// Required.
Resources []string `json:"resources,omitempty"`
}

type FailurePolicyType string

const (
// Ignore means the initilizer is removed from the initializers list of an
// object if the initializer is timed out.
Ignore FailurePolicyType = "Ignore"
// For 1.7, only "Ignore" is allowed. "Fail" will be allowed when the
// extensible admission feature is beta.
Fail FailurePolicyType = "Fail"
)

// ExternalAdmissionHookConfiguration describes the configuration of intializers.
type ExternalAdmissionHookConfiguration struct {
metav1.TypeMeta

v1.ObjectMeta
// ExternalAdmissionHooks is a list of external admission webhooks and the
// affected resources and operations.
// +optional
ExternalAdmissionHooks []ExternalAdmissionHook `json:"externalAdmissionHooks,omitempty" patchStrategy:"merge" patchMergeKey:"name"`
}

// ExternalAdmissionHook describes an external admission webhook and the
// resources and operations it applies to.
type ExternalAdmissionHook struct {
// The name of the external admission webhook.
// Name should be fully qualified, e.g., imagepolicy.kubernetes.io, where
// "imagepolicy" is the name of the webhook, and kubernetes.io is the name
// of the organization.
// Required.
Name string `json:"name"`

// ClientConfig defines how to communicate with the hook.
// Required
ClientConfig AdmissionHookClientConfig `json:"clientConfig"`

// Rules describes what operations on what resources/subresources the webhook cares about.
// The webhook cares about an operation if it matches _any_ Rule.
Rules []RuleWithVerbs `json:"rules,omitempty"`

// FailurePolicy defines how unrecognized errors from the admission endpoint are handled -
// allowed values are Ignore or Fail. Defaults to Ignore.
// +optional
FailurePolicy *FailurePolicyType
}

// RuleWithVerbs is a tuple of Verbs and Resources. It is recommended to make
// sure that all the tuple expansions are valid.
type RuleWithVerbs struct {
// Verbs is the verbs the admission hook cares about - CREATE, UPDATE, or *
// for all verbs.
// If '*' is present, the length of the slice must be one.
// Required.
Verbs []OperationType `json:"verbs,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calling these Verbs but they are not, they are admission.Operations. I think we should be consistent with admission.Attributes and use Operations as the struct member name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why isn't Operations just a member of Rule instead of being wrapped in another type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rbac called it verbs. I think being consistent with rbac is more important, because it's user-facing.

The rule doesn't contain verbs because it's used by initializers which only cares about CREATE.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the case here. If someone were writing a Go-based AC, they would be operating with admission.Attributes and since our rules will filter/match based on admission.Attributes, I think that is the contract. I get that RBAC consistency is nice but I think this is one case where it's more confusing to do so.

/cc @deads2k @smarterclayton

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whitlockjc you are right. I thought authorizer interface and admission interface would use the same term, but the are not, authorizer interface's method is called GetVerb, while the admission interface's method is called GetOperation. Is that intentional?

// Rule is embedded, it describes other criteria of the rule, like
// APIGroups, APIVersions, Resources, etc.
Rule `json:",inline"`
}

type OperationType string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't see the purpose of mirroring admission.Operation here. I don't see it happening but what happens if an admission.Operation is modified or added or removed? Why not just use admission.Operation as the type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let the api depends on the admission package sounds wrong. For example, if we decide to move the API to k8s.io/api, i don't want to make k8s.io/api depends on k8s.io/apiserver.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But since we're building against/for admission controllers, why is depending on admission wrong?

/cc @deads2k @smarterclayton

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will create import cycle.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that would depend on where we put these new types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you are already doing it in pkg/apis/admission: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/admission/types.go#L52

I think reversing the dependency is better, i.e. putting the definition in the API definition, because the API packages are usually the leaves in the dependency graph.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused as to what goes into an API group vs. what ends up elsewhere. I could see some of these things in plugin/pkg/admission ending up in pkg/apis/admission myself. As for dependency stuff, I'd rather a stable object model than one that has multiple sources of truth. I realize that the chances of this are low right now but that doesn't mean impossible. Having admission.Operation being used by the internal admission stuff and having a shadow type to reflect it seems like one link that's easy to break. Just my two cents.


const (
VerbAll OperationType = "*"
Create OperationType = "CREATE"
Update OperationType = "UPDATE"
Delete OperationType = "DELETE"
Connect OperationType = "CONNECT"
)

// AdmissionHookClientConfig contains the information to make a TLS
// connection with the webhook
type AdmissionHookClientConfig struct {
// Service is a reference to the service for this webhook. If there is only
// one port open for the service, that port will be used. If there are multiple
// ports open, port 443 will be used if it is open, otherwise it is an error.
// Required
Service ServiceReference `json:"service"`
// CABundle is a PEM encoded CA bundle which will be used to validate webhook's server certificate.
// Required
CABundle []byte `json:"caBundle"`
}

// ServiceReference holds a reference to Service.legacy.k8s.io
type ServiceReference struct {
// Namespace is the namespace of the service
// Required
Namespace string `json:"namespace"`
// Name is the name of the service
// Required
Name string `json:"name"`
}
```

Notes:
* There could be multiple InitializerConfiguration and
ExternalAdmissionHookConfiguration. Every service provider can define their
own.

* This schema asserts a global order of initializers, that is, initializers are
applied to different resources in the *same* order, if they opt-in for the
resources.

* The API will be placed at k8s.io/apiserver for 1.7.

* We will figure out a more flexible way to represent the order of initializers
in the beta version.

* We excluded `Retry` as a FailurePolicy, because we want to expose the
flakeness of an admission controller; and admission controllers like the quota
controller are not idempotent.

* There are multiple ways to compose `Rules []Rule` to achieve the same effect.
It is recommended to compact to as few Rules as possible, but make sure all
expansions of the `<Verbs, APIGroups, APIVersions, Resource>` tuple in each
Rule are valid. We need to document the best practice.

## Synchronization of admission control configurations

If the `initializer admission controller` and the `generic webhook admission
controller` watch the admission control configurations and act upon deltas, their
cached version of the configuration might be arbitrarily delayed. This makes it
impossible to predict what initializer/hooks will be applied to newly created
objects.

To make the behavior of `initializer admission controller` and the `generic
webhook admission controller` predictable, we let them do a consistent read (a
"LIST") of the InitializerConfiguration and ExternalAdmissionHookConfiguration
every 1s. If there isn't any successful read in the last 5s, the two admission
controllers block all incoming request. One consistent read per second isn't
going to cause performance issues.

In the HA setup, apiservers must be configured with --etcd-quorum-read=true.

See [Considered but REJECTED alternatives](#considered-but-rejected-alternatives) for considered alternatives.

## Handling initializers/webhooks that are not ready but registered

We only allow initializers/webhooks to be created as "fail open". This could be
enforced via validation. They can upgrade themselves to "fail closed" via the
normal Update operation. A human can also update them to "fail closed" later.

See [Considered but REJECTED alternatives](#considered-but-rejected-alternatives) for considered alternatives.

## Handling fail-open initializers

The original [proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_extension.md) assumed initializers always failed closed. It is dangerous since crashed
initializers can block the whole cluster. We propose to allow initializers to
fail open, and in 1.7, let all initializers fail open.

#### Implementation of fail open initializers.

In the initializer prototype
[PR](https://github.com/kubernetes/kubernetes/pull/36721), the apiserver that
handles the CREATE request
[watches](https://github.com/kubernetes/kubernetes/pull/36721/files#diff-2c081fad5c858e67c96f75adac185093R349)
the uninitialized object. We can add a timer there and let the apiserver remove
the timed out initializer.

If the apiserver crashes, then we fall back to a `read repair` mechanism. When
handling a GET request, the apiserver checks the objectMeta.CreationTimestamp of
the object, if a global intializer timeout (e.g., 10 mins) has reached, the
apiserver removes the first initializer in the object.

In the HA setup, apiserver needs to take the clock drift into account as well.

Note that the fallback is only invoked when the initializer and the apiserver
crashes, so it is rare.

See [Considered but REJECTED alternatives](#considered-but-rejected-alternatives) for considered alternatives.

## Future work

1. Figuring out a better schema to represent the order among
initializers/webhooks, e.g., adding fields like lists of initializers that
must execute before/after the current one.

2. #1 will allow parallel initializers as well.

3. implement the fail closed initializers according to
[proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_extension.md#initializers).

4. more efficient check of AdmissionControlConfiguration changes. Currently we
do periodic consistent read every second.

5. block incoming requests if the `initializer admission controller` and the
`generic webhook admission controller` haven't acknowledged a recent change
to AdmissionControlConfiguration. Currently we only guarantee a change
becomes effective in 1s.

## Considered but REJECTED alternatives:

### synchronization mechanism

#### Rejected 1. Always do consistent read

Rejected because of inefficiency.

The `initializer admission controller` and the `generic webhook admission
controller` always do consistent read of the `AdmissionControlConfiguration`
before applying the configuration to the incoming objects. This adds latency to
every CREATE request. Because the two admission controllers are in the same
process as the apiserver, the latency mainly consists of the consistent read
latency of the backend storage (etcd), and the proto unmarshalling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fairly expensive today.

Thinking about this... watch from within the same etcd server is effectively free. We need to guarantee that the clock time for any given ACC is within some bound (because an admission controller can be arbitrarily delayed between when the data is read from etcd and when it applies the rules, so a hard boundary is probably too hard).

The scenario of greatest concern is when an apiservers watch client is delayed relative to the mutation operations being executed, perhaps due to a network hang. If we had a single etcd watch on a set of key ranges, our minimum bound is:

The most recent admission control configuration delivered on the same watch channel as any returned event on that watch channel must be no older than X seconds after the last update operation on this server was performed, otherwise a consistent read must be performed.

That can degrade to just performing a consistent read from each api server every X seconds and delaying / rejecting requests until the read returns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Said another way, our etcd client simply has to ensure that the open watch channel from the server that also watches ACC has delivered ANY event within the time window we care about.

Copy link
Contributor

@smarterclayton smarterclayton May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the etcd client and server preserve the guarantee that would be necessary for the optimization there though. The client has to construct a watch, and the etcd server has to guarantee a "delivered before" behavior over the connection (event on ACC key has to be delivered before another key with higher index). gRPC substreams (one for each watch) might not support that if two watches are in different sub streams because the packets for the one stream can be delayed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or the api server write to the endpoints object has to track the member observed high etcd index and delay writes if other members of the cluster have higher observed indices (if the watch cache value is higher than the read value of the GET on endpoints)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like the upper bound and the degrading thing. Will make e2e tests slow though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from first principles:

  1. admin must be able to know when config is applied
  2. admin can always restart an apiserver and guarantee it sees the latest config (fallback)
  3. admin needs to be able to ensure config eventually applies within some bound

Those are good goals for an alpha api. To just do 3, could we:

  1. do a consistent read every 30s
  2. if the consistent read fails, block the admission plugin (fail) until it succeeds

Tests would have to get clever about verifying the config is working. If we're concerned, we can simply add a watch on the config objects and always use the latest, but the consistent read must succeed every 30s. That would allow tests to work most of the time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 2 an explicit goal? That sounds very easy to achieve.

Sure. I'll add this option to the doc, and let's see if we are on the same page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean as - it's the thing that they have to do today, and it's worked so far. So it's ok to say "you must restart your masters to guarantee the latest config applied", which is probably reasonable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton - got it. Thanks a lot for explanation. That makes perfect sense to me.
I added a potential improvement for it in #611 (comment)



#### Rejected 2. Don't synchronize, but report what is the cached version

Rejected because it violates Goal 2 on the time bound.

The main goal is *NOT* to always apply the latest
`AdmissionControlConfiguration`, but to make it predictable what
initializers/hooks will be applied. If we introduce the
`generation/observedGeneration` concept to the `AdmissionControlConfiguration`,
then a human (e.g., a cluster admin) can compare the generation with the
observedGeneration and predict if all the initializer/hooks listed in the
`AdmissionControlConfiguration` will be applied.

In the HA setup, the `observedGeneration` reported by of every apiserver's
`initializer admission controller` and `generic webhook admission controller`
are different, so the API needs to record multiple `observedGeneration`.

#### Rejected 3. Always do a consistent read of a smaller object

Rejected because of the complexity.

A consistent read of the AdmissionControlConfiguration object is expensive, we
cannot do it for every incoming request.

Alternatively, we record the resource version of the AdmissionControlConfiguration
in a configmap. The apiserver that handles an update of the AdmissionControlConfiguration
updates the configmap with the updated resource version. In the HA setup, there
are multiple apiservers that update this configmap, they should only
update if the recorded resource version is lower than the local one.

The `initializer admission controller` and the `generic webhook admission
controller` do a consistent read of the configmap *everytime* before applying
the configuration to an incoming request. If the configmap has changed, then
they do a consistent read of the `AdmissionControlConfiguration`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section for future work here?


### Handling not ready initializers/webhook

#### Rejected 1.

add readiness check to initializer and webhooks, `initializer admission
controller` and `generic webhook admission controller` only apply those have
passed readiness check. Specifically, we add `readiness` fields to
`AdmissionControllerConfiguration`; then we either create yet another controller
to probe for the readiness and update the `AdmissionControllerConfiguration`, or
ask each initializer/webhook to update their readiness in the
`AdmissionControllerConfigure`. The former is complex. The latter is
essentially the same as the first approach, except that we need to introduce the
additional concept of "readiness".

### Handling fail-open initializers

#### Rejected 1. use a controller

A `fail-open initializers controller` will remove the timed out fail-open
initializers from objects' initializers list. The controller uses shared
informers to track uninitialized objects. Every 30s, the controller

* makes a snapshot of the uninitialized objects in the informers.
* indexes the objects by the name of the first initialilzer in the objectMeta.Initializers
* compares with the snapshot 30s ago, finds objects whose first initializers haven't changed
* does a consistent read of AdmissionControllerConfiguration, finds which initializers are fail-open
* spawns goroutines to send patches to remove fail-open initializers