-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding GEP-713: Policy Attachment #715
Conversation
default: | ||
maxRetries: 5 | ||
targetRef: | ||
group: networking.acme.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
way outside of scope, but would be interesting to standardize this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, agreed. I almost used gateway.networking.k8s.io
here as the example but wanted to leave that for a later discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, Rob.
Huge lift and you have been phenomenally pushing this forward, great work!
I'm still reading through the linked Gdoc and I will have more comments here.
site-src/geps/gep-713.md
Outdated
Consumer to Backend. | ||
|
||
| Ingress | Sidecar Consumer | | ||
|-|-| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no GatewayClass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even a cluster-level default without a reference to class (I'm not sure what would be the target ref in that case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked the same somewhere, the answer was that it would need to be a cluster scoped policy to reference a cluster scoped resource. Which seems to make sense to me.
For Istio specifically we do have similar cluster-level policies, but we have a special namespace that is highly privileged and considered the "root", which allows us to bypass the cluster scoping. It works well for us, but I am not sure it makes sense as a part of the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I'm proposing direct + local references from policy to target, the policy resource needs to exist in the same scope as the target. That means that any policy attached to a GatewayClass would need to be a cluster scoped resource as well. So far I've thought that GatewayClass parameters would be the simplest way to express defaults at this level. I'll add some notes to the GEP to summarize this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means that any policy attached to a GatewayClass would need to be a cluster scoped resource as well. So far I've thought that GatewayClass parameters would be the simplest way to express defaults at this level. I'll add some notes to the GEP to summarize this.
Parameters ref can point to a namespaced resource?
What if someone decides to have two policy resources: TimeoutPolicy and TimeoutClusterPolicy, where one is namespace scoped and the other is cluster scoped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameters ref can point to a namespaced resource?
That's completely fine because it still requires someone with cluster-scoped write access to GWC to make that reference.
What if someone decides to have two policy resources: TimeoutPolicy and TimeoutClusterPolicy, where one is namespace scoped and the other is cluster scoped?
I think that's the only way policy attachment to GatewayClass could work. I've added some notes to describe this.
policy references to each Resource. This works very naturally for Ingress use | ||
cases where all requests follow a path through Gateways, Routes, and Backends. | ||
Adding policy attachment at each level enables different roles to define | ||
defaults and allow overrides at different levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on @howardjohn 's comment: https://github.com/kubernetes-sigs/gateway-api/pull/715/files#r668252011
Did we consider having a structure of this sort:
spec:
overridableConfiguration:
enforcedConfiguration:
The default
word in a precedence order seems confusing. The above proposal is verbose but it reduces the chances of misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is out of scope for this GEP (as noted already in the doc) so please ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to clear up the difference between default
and override
in terms of precedence in my latest commit. Hopefully that helps here. I'm also open to different name than default
here but would prefer somewhat shorter names if possible.
site-src/geps/gep-713.md
Outdated
## Out of scope | ||
|
||
* Define all potential policies that may be attached to resources | ||
* Design the full structure and configuration of policies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robscott, have we considered having a GenericPolicy.gateway.networking.k8s.io resource which defines the API and allows for an arbitrary map[string]string?
Pros:
- For some tiny features, implementers can use that
- Implementers have a concrete knowledge of what is it that we are recommending
Cons: - If this results in a wide-spread adoption, having type-less CRDs will result in more pain elsewhere in the community
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a really interesting idea. I think it's significantly better than annotations, but I'm concerned that it may pull away some implementations that would otherwise create CRDs (which I think results in the ideal UX). After chatting a bit more about this with you I've added this as an alternative in the GEP. It's really more of a potential variation but I wasn't sure where to put that.
// | ||
// +kubebuilder:validation:MinLength=1 | ||
// +kubebuilder:validation:MaxLength=253 | ||
Name string `json:"name"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing weird about namespace references:
kind: ServicePolicy
metadata:
namespace: foo
spec:
targetRef:
kind: Namespace
name: bar
This is obviously not valid. We could make targetRef
optional and apply to its own Namespace by default if omitted as one way to give users an easy way to avoid this footgun.
One thing that is really nice about this is that you can just have a single YAML file and apply it to N namespaces. Real world example that is common in Istio:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
spec:
mtls:
mode: STRICT
Then can easily enable mtls in a namespace with kubectl apply -f strict.yaml -n my-namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although that would be a nice shortcut, I'm a bit concerned about how easy it would make it to accidentally apply policy too broadly. This shortcut would be a great way to cut down yaml/templating, but I think making accidental policy applications less likely should be prioritized. I think many would expect a policy with an empty targetRef to apply to nothing, not to the entire namespace it was in. If anyone made this mistake, it could have a pretty large blast radius.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that having the default apply to nothing, and having the user need to take definite action to have the Policy apply to larger objects is much, much safer. Obviously the tradeoff is that you'll need more and more specific Policy objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal experience with k8s RBAC is this namespace-in-spec requirement makes things really annoying. Almost all Kubernetes types can elide the namespace, which makes re-usable configs, helm charts, and docs much easier.
I see the point about the policy broadness, and would be fine with making empty targetRef apply to nothing (or probably invalid if it does nothing?), but I really want to avoid having the namespace in the spec if possible. If you got through the top helm charts, almost half of them break if you do helm template CHART | kubectl apply -f my-namespace
because of RBAC namespace references.
One alternative is an alias for "same namespace as policy is in". Like:
kind: ServicePolicy
metadata:
namespace: foo
spec:
targetRef:
kind: Namespace
name: .
(or some better syntax)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I'm a bit hesitant to loosen validation here, I think it might make sense at some point to make name optional when the kind is Namespace. Unfortunately that's impossible to enforce with CRD validation, and since each policy is going to be a different resource, it would be difficult to validate consistently with a webhook.
I'm not sure there's any precedent in k8s for interpreting "." as same namespace, so that may not hold up to Kubernetes API review. It is of course possible for individual implementations to support this as an extended concept though. Since we're trying to avoid breaking changes going post-v1alpha2, I'm biased towards stricter validation up front to retain the option to loosen it in the future if needed.
Given that this is mostly a UX issue, I don't think we absolutely need to solve this now. In the interim, this is absolutely something that can be solved with templating, whether it's Helm, Kustomize, or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly agree with everything you said. I would add, as a real world example of why this matters: istio/istio#34035.
Istio fails to properly document security best practices due to this exact same thing, ultimately leading a "High" severity finding in NCC audit https://istio.io/latest/blog/2021/ncc-security-assessment/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about making name optional, to allow:
kind: ServicePolicy
metadata:
namespace: foo
spec:
targetRef:
kind: Namespace
to refer to same namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That very quickly opens the door to a way to apply policy to all resources of a certain kind within the local namespace. Although that could also be helpful, it adds additional complexity in terms of hierarchy and conflict resolution. I'm currently biased towards explicit config even if it requires a bit more templating.
With v1alpha2 and the goal to not make any breaking changes after this release, any optional field is optional forever, we have no path to make it required. On the other hand, we can always go from required -> optional in the future if we're completely sure that's what we'd want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this proposal is excellent, and a great distillation of lots of conversation.
However, I think that we need to make sure that the document is accessible for people without much context, and we need to add a bit more explanation early on to get there. Otherwise, we're going to spend a lot of time giving context to API reviewers that already lack bandwidth.
// | ||
// +kubebuilder:validation:MinLength=1 | ||
// +kubebuilder:validation:MaxLength=253 | ||
Name string `json:"name"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that having the default apply to nothing, and having the user need to take definite action to have the Policy apply to larger objects is much, much safer. Obviously the tradeoff is that you'll need more and more specific Policy objects.
Although these concerns are not unsolvable, they lead to the conclusion that | ||
a Kubectl plugin should be our primary approach to providing visibility here, | ||
with a possibility of adding policies to status at a later point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I would really like to end up with some sort of status eventually. I think it makes sense to use the API for a bit before deciding on what it looks like, particularly if there's a kubectl plugin to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, completely agree on this one.
### Disadvantages | ||
* May be difficult to understand which policies apply to a request | ||
|
||
## Removing BackendPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceivably, BackendPolicy could end up as the example Policy object, provided under the project's api group?
Probably outside the scope of this design though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with that. I think we could/should try to find enough portable concepts that could fit in a shared policy resource, but don't think that BackendPolicy provides any value as it exists today. I think it was mostly meant to establish a pattern for policy attachment, but the actual config included is so limited that I don't think it would be broadly used.
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an implementer this seems powerful, extensible, and practical. I like it a lot. I have a couple of small comments, and then I may have some confusion about how policy hierarchy is intended to work which I've put up for discussion. Thank you 🙇
A > B > C. When attaching the concept of defaults and overrides to that, the | ||
hierarchy would be expanded to this: | ||
|
||
A override > B override > C override > C default > B default > A default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I miss anything if A > B > C, why C default > B default > A default ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I need to explain this better. The rationale is that overrides are usually top down whereas defaults are usually bottom up. IE as a Gateway admin I want to have final say on requirements in my cluster, but defaults should be overridable by app owners or others down the stack. That means that an override attached to the top of the stack should have the highest precedence and a default attached at the same location should have the lowest. I don't know, does that make any sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added what I hope is a clearer explanation here, let me know if I missed anything or if anything doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clarification is fine. Thanks.
Just curious, why not just follow both override and defaults top down to avoid footgun, I mean, easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I do think there are legitimate arguments for both. It would be simpler if both forms of hierarchy flowed in the same direction. With that said, my thinking here is that most users will want more specific defaults to be given precedence. In most cases, a Gateway will refer to multiple Routes which will refer to multiple Services. In that case, I think it makes sense for defaults applied to the Gateway to be overridden by defaults attached to a specific Route.
Admittedly it is possible for this to work the other way with multiple Gateways referring to the same Route, but I think that's going to be much less common. I do want to get some more feedback here though.
@hbagdi @jpeach @youngnick @howardjohn What do you think makes the most sense for hierarchy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents would be that the proposal makes the most sense to me as it stands. It happens to be exactly the same model I came up with independently for the same problem a few years ago in Consul based on our user requests:
- Infrastructure operators want to provide "easy defaults" accross a whole environment while empowering service owners to customize where needed without needing central ops intervention (hence defaults gain precedence the more specific their attachment).
- Infrastructure operators still need to retain ultimate control to enforce regulatory compliance issues, or to mitigate security threats without needing to coordinate with many app owners, hence overrides take precedence the more general their attachment.
Thanks to everyone for the great feedback on this! I just pushed a pretty large update to this that I think addresses all the feedback so far. Some highlights:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This GEP LGTM. One minor issue inline.
Thanks for your hard work on this!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: robscott, shaneutt The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM aside from one concern about status.
I think we either need to define one Condition, or leave status explicitly undefined.
I lean towards defining an Attached
Condition for now, and worrying about what happens when multiple implementations use the same Policy later (if that indeed ever becomes a problem).
I also suggested a few make-the-language-shouty-RFC changes.
site-src/geps/gep-713.md
Outdated
When a conflict occurs, implementations MUST set a "Conflict" condition in the | ||
`status.conditions` list for all affected policy resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't really done a status design for Policy objects yet.
We really should be providing some basic Conditions and Reasons as part of this design that are expected for all policies.
I suggest:
Condition: type: Attached
- positive polarity (true is good). Indicates that the policy successfully attached to the targetRef
. If that fails, we provide Reasons that summarise common errors (of which Conflict
would be one, as would things like NotFound
or similar). Exact reasons to be added should be part of the implementation, but this main flow is important to get right in the design.
(I chose "Attached" although we've used "Admitted" in other places for a similar Condition, I think that it's clearer in this case because that's what the Policy is doing.
That means there should be some documentation in the Status
section, or earlier in the structs that says something like:
#### Policy Conditions
Implementations implementing policy MUST ensure an `Attached` condition is
present in the `status.conditions` stanza on any objects that reference an object
under their control (that is, on Policy objects that reference Gateway API objects
that fall under a GatewayClass they implement.)
The `Attached` Condition MUST be set to `status: true` if and only if the policy object
is successfully attached to the referenced object with no errors. If there are errors,
the `Attached` Condition MUST be set to `status: false`, with a Reason indicating the
reason why the policy could not attach. Sample Reasons are provided in the
Gateway API structs.
When an object falls out of the implementation's scope, it MUST NOT remove the
Condition itself, that will be handled by the next thing to bring the policy into scope.
Then, this section can be amended to:
When a conflict occurs, implementations MUST set a "Conflict" condition in the | |
`status.conditions` list for all affected policy resources. | |
When a conflict occurs, implementations MUST set the "Admitted" condition in the | |
`status.conditions` list for all affected policy resources to `status: false`, with a Reason of `Conflict`, | |
and a message that indicates what other policies are conflicting. |
This design does, however, have the problems mentioned in the Status
section that, if a Policy object is relevant to more than one implementation, then they will fight over the Admitted
condition.
After reading this over again, I think we should either do what I have here or leave the status flow completely alone for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! You're completely right that we end up with all the same status problems we had with BackendPolicy. One solution might be to have a controller name in the condition struct, but even that's insufficient as a given controller may be implementing multiple Gateways or even classes that target the same backend. I think this is solvable, but not something we need to solve right now.
I've pulled out any requirements to set conditions in status, and have left the only reference to status as part of the skeleton resource which just includes a generic list of conditions in status. If that makes sense, I'll plan on creating a tracking issue for this discussion if/when this GEP merges.
Thanks for all the great feedback on this GEP! Based on the conversation at today's community meeting, it sounds like we're really close to consensus on this one, and no one has any major hesitations with this merging. Given that, I'm going to remove the hold on this so any LGTM will result in this merging in. /hold cancel |
I should also note the major changes from the last commit I pushed:
|
With those latest changes, I think this is all good. /lgtm |
Hi @robscott Thanks for your time! |
Hey @Amila-Rukshan, thanks for asking! I think we could probably justify adding this to the existing GEP. Because policy attachment is already complex, I think the most important thing will be to show that adding support for targeting sections of resources is actually widely useful. It may be helpful to create an issue describing this first to try to gauge the level of interest in this. |
@Amila-Rukshan started a similar discussion here if you want to chime in as well: #1489 |
What type of PR is this?
/kind gep
What this PR does / why we need it:
This adds a GEP for policy attachment (#713).
Does this PR introduce a user-facing change?:
Deploy Preview: https://deploy-preview-715--kubernetes-sigs-gateway-api.netlify.app/geps/gep-713/