Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design details for Custom resource definition webhook validation. #1418

Closed
wants to merge 1 commit into from

Conversation

brendandburns
Copy link
Contributor

@kubernetes/sig-api-machinery-api-reviews @lavalamp

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 20, 2017
@lavalamp
Copy link
Member

Yeah, that was my suggestion. Just let me repeat that I'd like to get some miles on the webhooks before adding an auto-configuration layer.

@@ -381,6 +381,28 @@ Note:

2. For migration of CRDs with no validation to CRDs with validation, we can create a controller that will validate and annotate invalid CRs once the spec changes, so that the custom controller can choose to delete them (this is also essentially the status condition of the CRD). This can be achieved, but it is not part of the proposal.

### Admission Webhook

Custom resource definitions use the normal REST endpoint implementation and only customizes the registry and the codecs consequentaly dynamic web-hook admission controllers can be used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword, hard to read-- this is missing a comma or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2017
@brendandburns
Copy link
Contributor Author

@lavalamp comment addressed, please re-check.

Thanks!

@@ -381,6 +381,29 @@ Note:

2. For migration of CRDs with no validation to CRDs with validation, we can create a controller that will validate and annotate invalid CRs once the spec changes, so that the custom controller can choose to delete them (this is also essentially the status condition of the CRD). This can be achieved, but it is not part of the proposal.

### Admission Webhook

Custom resource definitions use the the same REST endpoint as built-in Kubernetes API objects. Consequentaly standard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Consequentaly/Consequently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

create arbitrary admission controllers, it is necessary to have an alternate solution for
adding web-hook validation to custom resource definitions.

The solution is that the `CustomResourceDefinition` itself has an array of webhooks in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically two arrays: mutating, non-mutating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

and in many cases the creator of a `CustomResourceDefinition` is third-party extension code
(e.g. an instance of the _operator pattern_) which should not have the ability to
create arbitrary admission controllers, it is necessary to have an alternate solution for
adding web-hook validation to custom resource definitions.
Copy link
Contributor

@sttts sttts Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question about how to create+distribute the necessary certificates. Currently, we don't have a mechanism to auto-create and inject

  • the webhook server key+certificate (trusted by the apiextensions-apiserver)
  • and the kube-apiserver proxy client CA (trusted and verified by the webhook).

We need both. I don't want to see every CRD controller using this implementing its own certificate generation code, and even worse creating something insecure.

In OpenShift we have a controller that's listening on a custom annotation on apiserver pods and then creates a secret on-demand. This gives a smooth and secure setup with zero config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to solve the certificate problem. I had a brief chat with @deads2k a few weeks ago about extracting the service signing certificate controller from OpenShift and making it generally available. Whether we do that or find another solution, I believe we need to do something. Otherwise, I feel it's too much of a burden to ask users to figure out certificate generation, signing, etc. to be able to do things like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @ncdc is saying this, but I think this is a general problem with dynamic webhooks, not specific to this proposal. We need to make it much easier for people to build dynamic webhooks in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some text about this.

@brendandburns
Copy link
Contributor Author

Comments addressed, please take another look.

This ensures that third-party extension code can not register admission controllers for
arbitrary API objects.

To achieve this, the `CustomResourceDefinition` has two arrays of `admissionregistration.Webhook`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not much new in this paragraph. just move the type admissionregistration.Webhook into the previous paragraph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@sttts
Copy link
Contributor

sttts commented Dec 15, 2017

One nit, otherwise lgtm.

There are a number of details like what of admissionregistration.Webhook can be set by the user and which not. But I think we don't need that here.

@brendandburns
Copy link
Contributor Author

@sttts comments addressed, please re-check.

Thanks!

However, because the creation of admission controllers is a fairly high-privilege activity,
and in many cases the creator of a `CustomResourceDefinition` is third-party extension code
(e.g. an instance of the _operator pattern_) which should not have the ability to
create arbitrary admission controllers, it is necessary to have an alternate solution for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing operators in the wild that install CRDs and RBAC rules, I question that there is a real difference to "creation of admission controllers" today. Of course, I totally agree that there should be. But for that we need a similar in-CRD mechanism for RBAC rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is clearly a job for a future design PR :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, just thinking out loud here ;)

definition itself. The arrays will be of type `admissionregistration.Webhook`.
One array will contain non-mutating admission controllers for things like
validation, and one will have mutating admission controllers for the purposes of defaulting.
The custom resource controller (a piece of trusted code), will register
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom resource definition controller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@brendandburns
Copy link
Contributor Author

Comments addressed, please re-check.

Thanks!

@sttts
Copy link
Contributor

sttts commented Jan 3, 2018

@deads2k @liggitt ptal

@deads2k
Copy link
Contributor

deads2k commented Jan 4, 2018

Thinking about the overall flow of creating a CRD, I see this

  1. Create a CRD and affect global discovery information, including shadowing existing resources and stealing shortnames. This is currently reserved to cluster admins by default.
  2. Create an RBAC cluster role to allow users to use the new custom resource. This requires passing an access escalation check. Since the custom resource won't exist, that means it is reserved to cluster admins too.
  3. Create an AdmissionRegistration. This currently requires cluster-admin as well.

Given that the first two items require cluster-admin access and the RBAC cluster role implies the power to grant access to any resources, I don't see an avenue for attack that this locks down. If I can create a useful and accessible CRD, I can already read every pod and secret in the system (or grant access to do so).

@lavalamp
Copy link
Member

lavalamp commented Jan 4, 2018 via email

@deads2k
Copy link
Contributor

deads2k commented Jan 5, 2018

I think Brendan probably wants to make things easy, not lock them down.

I'm not sure where this comment is going. In a cluster that cares about security, I think my point stands and if you grant permissions creating separate resources works fine. In a cluster that doesn't care about security, you don't have a restriction on those things and creating separate resources works fine. Are you suggesting there's a third case? Could you describe it?

@lavalamp
Copy link
Member

lavalamp commented Jan 5, 2018 via email

@brendandburns
Copy link
Contributor Author

brendandburns commented Feb 5, 2018

(sorry returning to this)

@deads2k regarding your point, there are two things:

  1. Regardless of security, I think this design prevents accidents. I'd rather not have CRDs accidentally creating badly-written/broken admission controllers and breaking the entire cluster.

  2. My hoped for flow for a secure cluster would look like:

Cluster-admin creates a stub CRD with names/short-names but no real details.
CRD-Operator is created with permissions to update their own CRD, but no other permissions.
CRD-Operator updates the CRD to add a web-hook for validation.

In this model, you can keep CRD creation with cluster-admin, but delegate the details of the CRD definition to the CRD operator.

I don't see anything in your concerns that should block merging this design proposal so that we can start building it, let me know if you feel differently and I'll address those concerns.

Thanks

@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Feb 5, 2018
@deads2k
Copy link
Contributor

deads2k commented Feb 5, 2018

I don't see anything in your concerns that should block merging this design proposal so that we can start building it, let me know if you feel differently and I'll address those concerns.

My comments and concerns address the core concept of this pull and I think that until we have agreement in concept, we should not merge this change.

CRD-Operator is created with permissions to update their own CRD, but no other permissions.

The power to update the CustomResourceDefinition allows power over global discovery. Allowing the CRD-operator the power to change the CRD is akin to allowing a process to modify its own configuration.

The CRD-operator should not get to decide the shape of the CRD that it uses and it also not be able to control the validation associated with it. When the CRD is created, it should contain its validation rules. This means the responsibility lies with the CRD creator who is also creating the roles. It looks like a cluster-admin to me.

@lavalamp
Copy link
Member

lavalamp commented Feb 5, 2018

Maybe we would do well to work on having some common goal / requirements around usability + security for CRDs. I don't think we can come to agreement on a specific proposal until we agree what we're even trying to accomplish.

@brendandburns
Copy link
Contributor Author

@deads2k given that that is up to the cluster admin to decide via RBAC, I don't see the issue here.

IF the cluster admin wants to force manual install for all CRDs that's their call.

That said, I think there are lots of clusters were people want things to behave like plugins and just dynamically work. But forcing CRD developers to learn about admission controllers and to configure them correctly just adds complexity and work to the job of the CRD developer.

Could you describe the disagreement you see in the core concept?

For me the core-concept(s) are:

  • We should have webhook validation for CRDs
  • We should have an controlled and automated method to make it easier for CRD developers/operators to activate those webhook validators without writing raw Admission Controllers

Nothing in this proposal is affecting the security model one way or the other, the cluster admin is free to set RBAC however they choose.

Please clarify, thanks!

@deads2k
Copy link
Contributor

deads2k commented Feb 6, 2018

We should have an controlled and automated method to make it easier for CRD developers/operators to activate those webhook validators without writing raw Admission Controllers

This piece. It isn't obvious to me what value this adds. You're duplicating an existing API in a different object to save setting three fields. In the use-cases we have already the actor creating a useful CRD already has full permission on cluster, so there isn't much value in duplicating the API.

You're still going to be writing the admission webhook code either way. You're just eliminating writing a rule from the resource manifest. Also, writing a library to make the admission plugin webhook is fairly easy. See https://github.com/openshift/generic-admission-server as an example in golang.

@brendandburns
Copy link
Contributor Author

The most important feature of this isn't the simplification (though I think that's important too) it's preventing someone from making mistakes and accidentally registering an Admission Controller that has broader scope than just the particular CRD.

Given that your objection is based on not seeing the value (and I believe there is the value outlined above) can we agree to disagree and move forward with this proposal? I don't think it hurts anything, and I personally believe there is significant value in terms of ease of use and prevention of mistakes.

@liggitt
Copy link
Member

liggitt commented Feb 7, 2018

Given that your objection is based on not seeing the value (and I believe there is the value outlined above) can we agree to disagree and move forward with this proposal?

I'd prefer we not expand/nest APIs if we don't have to. I'd rather see API additions justified by enabling use cases that are not currently possible.

@brendandburns
Copy link
Contributor Author

I don't think this is nesting, this is like the fact that PodTemplate is embedded in multiple different objects. It is re-use of a common object. We're not nesting a complete AdmissionController.

One of the recurring benefits of CRDs (and the reason why they are way more popular than aggregated API servers) is ease of use.

If you tell someone that if they want to validate their CRD they need to either learn Swagger or learn about admission controllers, you are erecting serious barriers to entry.

I think you need to focus on the usability for the average user, not the person who spends their time reading sig-api-machinery.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2018
@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 30, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: deads2k

Assign the PR to them by writing /assign @deads2k in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 29, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants