Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Route handling of invalid BackendRefs needs clarification #1112

Closed
1 of 11 tasks
mikemorris opened this issue Apr 13, 2022 · 13 comments
Closed
1 of 11 tasks

Route handling of invalid BackendRefs needs clarification #1112

mikemorris opened this issue Apr 13, 2022 · 13 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@mikemorris
Copy link
Contributor

mikemorris commented Apr 13, 2022

What would you like to be added:

Add documentation, possibly a new page in the Reference section of the website, and additional RouteConditionReason and ListenerConditionReason types as needed to clarify intended handling of invalid BackendRefs configurations. This is a bit more complicated than just #1109, as the implications of initially applying or resolving changes to a Route's BackendRefs or a ReferencePolicy may require triggering actions or settings statuses across several objects.

Why this is needed:

There are a few distinct cases by which a BackendRef could be or become invalid, and I think it may be helpful to define and use more specific terms than "invalid" in the spec, as well as centralize and specify the intended behavior for each case more explicitly.

One specific difficulty is the lack of specificity around when a Route Accepted condition should be set, and whether an Accepted { status: false, reason: Invalid } condition should be set in any of these cases (and/or reflected in a ListenerConditionType status) given the lack of granularity in indicating status per BackendRef versus the status of the Route or Listener as a whole.

It's also slightly unclear if attachedRoutes on ListenerStatus is intended to communicate the same thing as a route having an Accepted { status: true, reason: Accepted } route parent status condition referencing that listener because of the difference between the words used ("attached" vs "accepted", refs #1111).

"not accepted"

This case is when the Route has not been accepted by a Gateway Listener. The HTTPRouteInvalidCrossNamespace conformance test illustrates one example where this is an intended outcome, but it's not clear how this status should be communicated. The test suggests that no RouteParentStatus should be set on the Route because, having been rejected, it does not have an actual parent, only an intended parent - setting an Accepted { status: False } condition (what RouteConditionReason?) could be an alternative way to communicate this status though.

RouteParentStatus conditions

  • None?
  • [Alternatively?] Accepted { status: False }

Listener handling

  • A Route that is in a namespace not allowed by a Gateway Listener's allowedRoutes should be rejected by the GatewayController.
  • This is similar to the InvalidRouteKinds case, but I'm guessing that no similar status should be set due to cross-namespace security concerns because a Route in another namespace shouldn't be able to have any impact on a Gateway to which it is not allowed to bind.

ListenerStatus

  • attachedRoutes should not count this Route
  • Conditions:
    • Ready { status: True, reason: Ready }

TODO:

  • Update the HTTPRouteInvalidCrossNamespace conformance test to check for an Accepted {status: False} condition. This feels reasonable enough because even though the route was not accepted by the gateway (and therefore shouldn't be able to affect the gateway in any way), there should be a definitive controller which evaluated and rejected the route which has sufficient privileges to manage a route status appropriately.
  • Add an InvalidRouteKind RouteConditionReason and update the HTTPRouteDisallowedKind conformance test for communicating the case where a route is rejected because its kind is not specified in the allowedRoutes.kinds field of the listener it is targeting.
  • Add an InvalidRouteNamespace RouteConditionReason and conformance test for communicating the cases (Selector, Same) where a route is rejected because it lives in a namespace that is not allowed by the allowedRoutes.namespaces field of the listener it is targeting.
  • Add an InvalidRouteHostname RouteConditionReason and update the HTTPRouteListenerHostnameMatching conformance test or add a new one for communicating the case where a route is rejected because it does not have at least one intersecting hostname with the listener it is targeting.

"unpermitted"

This case is when a Route has a BackendRef to an object in another namespace, where the object in the other namespace does not have a ReferencePolicy explicitly allowing the reference. This could reasonably be expected to be checked when the Route is created, and should be clarified to note that it should additionally be checked whenever the Route is modified or a ReferencePolicy with a to field resolving to a BackendRef of the route is deleted, modified, or created.

This should be expected to be dynamic - revoking a ReferencePolicy should stop routing traffic to a BackendRef.

RouteParentStatus conditions

  • Accepted { status: True }
  • ResolvedRefs { status: False, reason: RefNotPermitted }
  • Ready { status: False, reason: Invalid } [Not Implemented]

Listener handling

  • As described in HTTPBackendRef

    If there is a cross-namespace reference to an existing object that is not covered by a ReferencePolicy, the controller must ensure the “ResolvedRefs” condition on the Route is set to status: False, with the “RefNotPermitted” reason and not configure this backend in the underlying implementation.

  • This wording is likewise a bit confusing in that it only describes how each specific BackendRef should be handled, and not whether this should impact if the entire Route is determined to be "invalid". Notably, this case omits any mention of "dropping" anything from a Gateway as the "not found" case describes.

  • Should likely implement the same behavior described above in the "not found" section for sending HTTP 503 responses for invalid unpermitted BackendRefs.

ListenerStatus

  • attachedRoutes should count this Route as long as it has RouteConditionType Accepted { status: True }
  • Conditions:
    • ResolvedRefs { status: False, reason: RefNotPermitted }
    • UNRESOLVED: Should this set Ready { status: False, reason: Invalid } if any single BackendRef on an accepted route is not permitted?

TODO:

  • UNRESOLVED: (needs consensus before implementation) Update the HTTPRouteInvalidCrossNamespaceBackendRef conformance test to assert that a route with zero permitted backends:
    • sets either an Accepted {status: True} or Accepted {status: False} route condition
    • is either counted (if accepted) or not counted (if rejected) in listener attachedRoutes
    • sets a ResolvedRefs {status: False, reason: ReasonNotPermitted} listener condition if route issues of accepted routes should "bubble up" to listeners
  • conformance: add test for partial acceptance of a route with one allowed and one unpermitted backend reference #1143

"not found"

This case is when the Route has at least one BackendRef that does not exist or cannot be resolved. This could reasonably be expected to be checked when the Route is created, but unclear if/when state changes could cause this to need to be updated.

UNRESOLVED:

  • Does this case include zero service instances ready?
  • It's unclear if this case should prevent the Route from being accepted by a Gateway when initially applied, even if all BackendRefs on a Route are not found. I could see potential value in a use case for being able to configure Routes before deploying a new service.

RouteParentStatus conditions

  • ResolvedRefs { status: False, reason: InvalidBackendRef }

Listener handling

  • As described in HTTPBackendRef:

    If the referent cannot be found, this HTTPBackendRef is invalid and must be dropped from the Gateway. The controller must ensure the “ResolvedRefs” condition on the Route is set to status: False and not configure this backend in the underlying implementation.

  • The part that's a bit confusing with this wording is that it only describes how each specific BackendRef should be handled, and not whether this should impact if the entire Route is determined to be "invalid". The term "dropped" seems to imply active management of some object directly related to a Gateway (as opposed to the underlying implementation), which seems to be overly broad behavior if inferred to mean removing an entire Route from a Listener if one BackendRef became invalid.

  • The backendRefs field on HTTPRouteRule additionally states:

    If unspecified or invalid (refers to a non-existent resource or a Service with no endpoints), the rule performs no forwarding. If there are also no filters specified that would result in a response being sent, a HTTP 503 status code is returned. 503 responses must be sent so that the overall weight is respected; if an invalid backend is requested to have 80% of requests, then 80% of requests must get a 503 instead.

    (cobbling this information together from disjointed sources is an example of the difficulty described in docs: API specification information architecture obscures important content #1031)

ListenerStatus

  • attachedRoutes should count this Route as long as it has RouteConditionType Accepted { status: True }
  • Conditions
    • ResolvedRefs { status: False, reason: InvalidBackendRef }
      • Would require adding InvalidBackendRef ListenerConditionReason
      • UNRESOLVED: Should a route with this issue should affect listener status?
    • UNRESOLVED: Should this set Ready { status: False, reason: Invalid } if any single BackendRef on an accepted route is not found?

TODO:

  • Add InvalidBackendRef RouteConditionReason, with a description similar to the InvalidCertificateRef ListenerConditionReason
  • Clarify lifecycle watch expectations of when this status would be updated (service created/deleted, zero endpoints ready?)

"invalid"

This case is when the Route is syntactically or semantically invalid.

NOTE:

  • Ideally, the admission webhook should prevent this Route from ever being applied to avoid accidentally dropping traffic if modifying an existing Route.

RouteParentStatus conditions

  • Accepted { status: False, reason: Invalid | InvalidParameters }

Listener handling

  • A Route that is syntactically or semantically invalid should be rejected by the GatewayController or admission webhook.

ListenerStatus

  • attachedRoutes should not count this Route
  • Conditions:
    • Ready { status: True, reason: Ready }

"not ready yet"

This case is when the Route is syntactically or semantically valid, all BackendRefs exist, can be resolved, and are permitted by a ReferencePolicy as applicable, and the route has been accepted by the GatewayController, but the underlying implementation is not fully configured or routing traffic yet. Expected to be a transient state.

NOTE:

  • It can be quite difficult for implementations to know when a route has transitioned from this state to being fully "ready" and actively able to route traffic.

RouteParentStatus conditions

  • Accepted { status: True }
  • Ready { status: False, reason: Pending } [Not Implemented]

Listener handling

  • Not yet defined.

ListenerStatus

  • attachedRoutes should count this Route
  • Conditions:
    • Ready { status: True, reason: Ready }
@andrewstucki
Copy link

So, just to chime in, overall the part of the spec that gives me some of the biggest headaches is how route validation is reflected in statuses. You pointed out a number of areas where something about the route status, as specified in the spec, seems to need to be propagated onto the Gateway's ListenerStatus.

For example, the current usage for ResolvedRefs as a ListenerConditionReason that states:

This reason is used with the “ResolvedRefs” condition when one of the Listener’s Routes has a BackendRef to an object in another namespace, where the object in the other namespace does not have a ReferencePolicy explicitly allowing the reference.

But as you point out other areas of the spec (i.e. under HTTPBackendRef), refer to setting the route status for ResolvedRefs. The latter (i.e. setting the status on the route), makes sense to me, the former does not, because what doesn't have resolved refs? The route, not the listener.

The notion overall that the ListenerStatus itself should ever reflect the status of underlying routes that it has successfully bound doesn't really make sense to me. Why? Because the Listener itself isn't bad, the Route is, and its status should be updated accordingly.

The only time I can see that propagating something to a ListenerStatus based on a Route's binding makes sense is if somehow the underlying behavior of the Listener can't fully be determined and thus the entire thing is in an invalid state. For example, in our current implementation of TCPRoute binding in the Consul API Gateway we consider a TCP-based listener invalid if it has more than a single route bound to it due to our inability to currently load-balance between multiple TCP backends.

@robscott
Copy link
Member

This is great content @mikemorris and @andrewstucki!

The notion overall that the ListenerStatus itself should ever reflect the status of underlying routes that it has successfully bound doesn't really make sense to me. Why? Because the Listener itself isn't bad, the Route is, and its status should be updated accordingly.

I think this is an artifact of the v1alpha1 API where listeners selected Routes. While I think it could potentially be helpful for a Gateway owner to understand that something is attempting to attach to it but can't, it seems like the primary way to communicate that failure in status should be on the route.

I like a lot of the suggestion above for addition to status. Historically we have not required GEPs for new reasons being added to status, I don't think we have any precedent for independent changes that introduced new conditions, not sure what to do there. Maybe a PR that introduces the reasons suggested above would be a good starting point, and we could separately discuss the idea of a "Ready" and/or "Programmed" condition?

@youngnick
Copy link
Contributor

This is an outstanding summary, thanks @mikemorris!

Reading through this, there are a few places where I have assumptions that aren't written down. I may have spoke about these before and assumed they're in the spec, happy to talk about if they're valid assumptions.

I'll go through the questions in the order I read them.

"not accepted"

In general, I think we've tried to have the default status behavior be that if things aren't a valid reference, the referred object is not updated. So, we don't update the ListenerStatus with the names of Routes that don't attach.

However, I agree that Accepted with status: false should be set when a Route has specifically requested to be attached to a Gateway and cannot for some reason. This allows a Route user to understand if their attachment request has succeeded without having to look at the Gateway status.

I think that, as Rob said, a lot of the weirdness here is us not coming back and updating this after the switch to route-initiated attachment.

"not found"

I think that having zero backends inside otherwise-valid backendRefs should not mean that the backendRef is not found. The backendRef is there, but it's empty. If we want to reflect that in the Route status, it should come under Ready or similar Condition. I think it's very undesirable for the Attachment to flap based on the presence or absence of backends.

The assumption I had here, that I think is not documented, is that the rules for backendRefs are:

  • if there's at least one valid backendRef, then the route as a whole is accepted (with caveat Conditions describing the backendRefs that aren't valid). 503s must be returned for the invalid backendRefs.
  • if there are no valid backendRefs, then the route as a whole must not be accepted.

Note that these two rules also require the "zero endpoints doesn't make things unattached" rule.

However, there's some interaction with the backendRefs field docs in that set of rules - should that empty config produce 503s? I'm inclined to say no, personally.

Maybe a better way to summarize this is "Routes that generate some config in the underlying data plane must attach, and routes that generate no config in the underlying data plane must not."?

That would cover the 503 issue, because it's talking about routes that have at least some config.

If we agree here, this absolutely needs better documentation, since it's plainly not present now.

"not permitted"

I agree that this should behave the same as the "not found" case in respect to partial or no validity.

This does mean that a Route that loses its ReferencePolicy will flip to not attached, but that seems desirable to me.

"invalid" and "not ready yet"

I agree with all of this.

I think the next step here is for everyone to read what I think should happen and see if we agree. If so, we should move to a PR.

@mikemorris
Copy link
Contributor Author

mikemorris commented May 4, 2022

So, we don't update the ListenerStatus with the names of Routes that don't attach.

Accepted with status: false should be set when a Route has specifically requested to be attached to a Gateway and cannot for some reason. This allows a Route user to understand if their attachment request has succeeded without having to look at the Gateway status.

👍 , this makes sense to me too. I added a few specific TODO items to the "not found" section for new RouteConditionReasons to help communicate these rejection cases.

We should additionally still clarify if accepted routes should affect listener status, but this at least narrows the scope a bit.

I think that having zero backends inside otherwise-valid backendRefs should not mean that the backendRef is not found.

Alright - I think this makes sense but will likely require a docs update to remove the mention of or a Service with no endpoints as an "invalid" case in each xRouteRule. Definitely agreed that flapping service health should not affect the accepted/attached state of a route.

if there's at least one valid backendRef, then the route as a whole is accepted (with caveat Conditions describing the backendRefs that aren't valid). 503s must be returned for the invalid backendRefs.

👍 LGTM

if there are no valid backendRefs, then the route as a whole must not be accepted.

I'd like to hear from additional implementations on this - I could see a plausible use case for an operator or team wanting to configure routing rules before actually deploying a new or updated backend service.

This does mean that a Route that loses its ReferencePolicy will flip to not attached, but that seems desirable to me.

@youngnick What I'd like to avoid (expanded on in #1143) is the case where revoking a ReferencePolicy permitting one backend reference on a route with several would have a large blast radius, potentially dropping a large amount of permitted traffic instead of only rejecting the newly-unpermitted traffic. I'm uncertain if revoking the ReferencePolicy for the only backend reference on a route should be handled differently and reject/detach it (similar use case to the "not found" zero valid backend refs case).

skriss added a commit to skriss/gateway-api that referenced this issue May 20, 2022
- fixes path prefix match YAML
- removes check for HTTPRoute to be Accepted pending kubernetes-sigs#1112

Signed-off-by: Steve Kriss <krisss@vmware.com>
@youngnick
Copy link
Contributor

I'd like to hear from additional implementations on this - I could see a plausible use case for an operator or team wanting to configure routing rules before actually deploying a new or updated backend service.

I still feel pretty strongly that the guideline of "no config in the data plane means not accepted" is the easiest one to both understand and implement. But interested to talk more about it.

By that logic, I agree that the test in #1143 makes sense, and also agree that revoking the ReferencePolicy for a BackendRef that is the only target in a Route should mean that the Route is no longer Accepted. That is, if revoking a ReferencePolicy means there's no config to add, the Route is not Accepted.

@caboteria
Copy link
Contributor

I'm trying to figure out how exactly the statuses of backendRefs influence the status of parentRefs. Let's see if I understand:

If I were a user I would find it confusing if a parentRef reported RefNotPermitted when the actual parentRef was OK but a backendRef wasn't. The HTTPRouteInvalidReferenceGrant conformance test is an example of this, I think.

I think that the implication of this would be that a Route shouldn't attempt to connect to a parent until it has at least one valid backendRef. Maybe if a Route had no valid backendRefs it could report the parentRef status as something like RefNotAttempted?

@youngnick
Copy link
Contributor

I think that the implication of this would be that a Route shouldn't attempt to connect to a parent until it has at least one valid backendRef. Maybe if a Route had no valid backendRefs it could report the parentRef status as something like RefNotAttempted?

I like this idea.

@mikemorris
Copy link
Contributor Author

mikemorris commented Jun 28, 2022

If I were a user I would find it confusing if a parentRef reported RefNotPermitted when the actual parentRef was OK but a backendRef wasn't. The HTTPRouteInvalidReferenceGrant conformance test is an example of this, I think.

I think that isn't quite what this is intended to express. RouteParentStatus is intended to describe the status of the route, from the route's perspective - the relation to ParentRef is mostly to contextualize conditions when a route targets multiple parents, such as setting Accepted: { status: False, reason: NotAllowedByListeners } if a route failed to attach to one parent due to a restrictive AllowedRoutes Gateway listener config (but may have been accepted by a different parent). Bundling status conditions (such as ResolvedRefs) that could be common/affect all parent references into this same field is a bit confusing/redundant, but hopefully this helps explain the structure.

Whether to bubble this status up to a status field on the parent itself is a separate question, as described in #1112 (comment) (where it probably does not make sense to populate ResolvedRefs on ListenerStatus from the status of a Route backendRef, instead only when something like a certificateRef on a Listener GatewayTLSConfig fails)

Route shouldn't attempt to connect to a parent until it has at least one valid backendRef

I believe this would be a separate logical change, and #1211 addresses some of the implications (and implementation challenges) of this in more detail.

[Oops, didn't mean to close.]

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2022
@mikemorris
Copy link
Contributor Author

mikemorris commented Oct 12, 2022

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2022
@youngnick youngnick self-assigned this Oct 17, 2022
@youngnick
Copy link
Contributor

I've reviewed those issues and PRs @mikemorris was tracking, thanks Mike. I think we've largely got this handled, and that invalid backendRefs are pretty well clarified now. The last remaining piece is #1143 - which I think should absolutely be updated, but we should also add or update tests to ensure that partial validity is checked in other cases too.

@youngnick
Copy link
Contributor

With the closing of #1143, I think we can call this one done (yay!)

/close

@k8s-ci-robot
Copy link
Contributor

@youngnick: Closing this issue.

In response to this:

With the closing of #1143, I think we can call this one done (yay!)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

7 participants