Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EP for synchronous webhook validation #685

Merged
merged 1 commit into from
Mar 30, 2021

Conversation

imjasonh
Copy link
Contributor

Changes

Adds a proposal as provisional to gather feedback.

See also #596

/kind feature

Submitter Checklist

  • [n/a] Includes tests if functionality changed/was added
  • [n/a] Includes docs if changes are user-facing
  • [y] Set a kind label on this PR
  • [y] Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

NONE

@openshift-ci-robot openshift-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note-none Label for when a PR does not need a release note labels Mar 19, 2021
docs/proposals/webhook-validation.md Outdated Show resolved Hide resolved
docs/proposals/webhook-validation.md Outdated Show resolved Hide resolved
docs/proposals/webhook-validation.md Show resolved Hide resolved
docs/proposals/webhook-validation.md Show resolved Hide resolved
Do we want to require that operators run another job to synchronously validate client requests?

Do we want to rely on Knative's webhook packages, or write our own code to manage certs ourselves?
Is there anything already out there in controller-gen we should be using?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a req for change, but rather an aside for those who might look at this, @coreydaley on our team has a non-shipwright assignment that happens to involve investigating the particulars here around webhook cert management outside the confines of knative

hopefully we can take what he discovers and apply it to the consideration here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: more details are in #596 though some of them are in Red Hat only slack discussion threads that need to be summarized in something consumable by the shipwright community.

High level summary: there is not at the moment in the operator framework / controller-gen space an equivalent of the auto cert generation / mgmt features in the knavite webhook abstraction layers.

Copy link
Member

@gabemontero gabemontero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this captures all the elements of prior conversations that I recall @imjasonh

thanks

Copy link
Contributor

@qu1queee qu1queee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking some minor question around the EP, looks good to me in general

docs/proposals/webhook-validation.md Show resolved Hide resolved
docs/proposals/webhook-validation.md Show resolved Hide resolved
docs/proposals/webhook-validation.md Show resolved Hide resolved
docs/proposals/webhook-validation.md Show resolved Hide resolved

## Drawbacks

This requires us to build a new component in the Shipwright installation, and for the operator component to operate it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That list is quite short. :-)

There is one caveat with the webhook: it validates only on API request (create or update is relevant in this context). But, our validations often include another resource which can also change.

We have some validations where this does not really matter, for example the Git repository validation because we have no trigger if this goes away anyway, or the non-existing Build on a BuildRun creation because the BuildRun is a one-shot thing.

But, we also validate the referenced secrets of a Build and would notice if a secret that is referenced by a Build gets deleted (assuming the secret has the necessary annotation) and subsequently mark the Build as not registered. With the webhook, this would not be possible anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, thanks for bringing them up.

I think it could be simpler (for our implementation, and our users' mental model) for Builds to be assumed to be valid until we find out they're not the next time they're run. I'd like to avoid the scenario where deleting/updating a Secret results in dozens or hundreds of Builds having to be reconciled to check if they're still valid. Especially if that validation includes reaching out to a registry to check that the new credentials are valid, potentially lots of times.

In Tekton, you can create a Task, create a Pipeline that uses the Task, then delete the Task, and the Pipeline doesn't proactively report as invalid. The next time you try to run that Pipeline, PipelineRun creation will fail saying the referenced Task isn't valid. This avoids unnecessary duplicative validation work while a user doesn't care, and puts the error message directly in the user's face the next time they try it.

I'm open to discussing this point more though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Always accepting and storing resources, even invalid ones, can also lead to additional storage of known-invalid objects, which clutter the user experience and can lead to confusion, and in extreme cases, cluster resource exhaustion.

Mixing reconciliation and validation code in the controller also complicates our codebase, leading to PRs like https://github.com/shipwright-io/build/pull/641 which add new validation-centric statuses that the user must watch for.
As we add new validation scenarios, we currently add new `Reason` strings that clients must be aware of, which is cumbersome and won't scale indefinitely.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo this would not change with a webhook. We will have different validations and should continue to report the problem as specific as possible, for example using an error code that would be very much the same as the reason that we have today. A good client would have to provide the user different assistance depending on what the error code is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A goal (perhaps unstated) of this proposal is to not have to define and maintain a table of all possible validation error scenarios and associated unique codes/reasons. That doesn't scale well much beyond where we're already at, and unless we invest in generating documentation, we will have them fall out of sync eventually. Incorrect or incomplete documentation can be worse than no documentation.

Instead, I'd like to invest in human-readable descriptive error messages that we can proactively put in front of users' faces. Those errors can also include codes (i.e., along the lines of HTTP 400 vs 404) to classify the kinds of errors, if we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will change @SaschaSchwarze0 , that does not mean we will get rid of how we populate our Status.Conditions, we still need the core states of a resource that runs to completion, e.g. True, False, Unknown and all potential flavors of the Reason field. What we will do is reduce some of those flavors, e.g. BuildStrategyNotFound and allow the admission webhook to block the creation of a BuildRun if the strategy is not there.

@imjasonh I think the table of error scenarios is still needed till some extent, probably an smaller table of error scenarios but we still need to decide in the logic around when to mark something as False with a Failed Reason, or with a BuildRunTimeout one. I dont see this going away.

Also, I´m not sure if the error categorization discussion is out of the scope on this PR, maybe too much of an implementation detail, which we might want to take somewhere else. I do have some strong opinions around entering the business of parsing errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good discussion. I was mainly focusing on those validations that we would perform before a resource gets created (like referenced secrets and buildstrategy when a Build creation happens). I agree that we will have HTTP status codes which can help us to categorize errors already. Though, I guess we will have many errors that translate into the same HTTP status codes. Not just because of that, I think that errors should be categorized using a computer-readable error code (like the current reasons we have for Builds). I am definitely not against human-readable descriptions. But, those have a tendency to be changed from time to time, and client interfaces - especially localized one's - typically cannot rely on the human-readable description. They need other means to identify an error to guide their users to make the necessary correction.

@qu1queee qu1queee self-requested a review March 29, 2021 10:13
Copy link
Contributor

@qu1queee qu1queee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 29, 2021
@gabemontero
Copy link
Member

/approve

/hold

@imjasonh @SaschaSchwarze0 feel free to unhold if you think the latest discussion threads between you two are resolved. They seemed to be but I was not certain enough to assume that.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 29, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 29, 2021
@SaschaSchwarze0
Copy link
Member

SaschaSchwarze0 commented Mar 30, 2021

@imjasonh @SaschaSchwarze0 feel free to unhold if you think the latest discussion threads between you two are resolved. They seemed to be but I was not certain enough to assume that.

I am generally fine with it. Will make sure that the errors that the validation errors that the webhook returns are good enough to be machine-readable when it gets implemented.

/unhold

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 30, 2021
@openshift-merge-robot openshift-merge-robot merged commit d2520f8 into shipwright-io:master Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note-none Label for when a PR does not need a release note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants