Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEP: Standard Mechanism to Merge Multiple Gateways #1713

Open
dprotaso opened this issue Feb 8, 2023 · 24 comments · May be fixed by #3213
Open

GEP: Standard Mechanism to Merge Multiple Gateways #1713

dprotaso opened this issue Feb 8, 2023 · 24 comments · May be fixed by #3213
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@dprotaso
Copy link
Contributor

dprotaso commented Feb 8, 2023

What would you like to be added:

Support for merging Gateways should have standard & documented mechanic.

Why this is needed:

Prior Discussions: #1248, #1246

Knative generates on demand per-service certificates using HTTP-01 challenges. There can be O(1000) Knative Services in the cluster which means we have O(1000) distinct certificates. The Gateway Resource is a contention point since is the only place to attach listeners to gateways with certificates.

The spec currently has language to indicate implemenations MAY merge Gateways resources but the mechanic isn't defined.

// determines that the Listeners in the group are "compatible". An
// implementation MAY also group together and collapse compatible
// Listeners belonging to different Gateways.

Secondly, given similar use-cases/needs for control planes to manage listeners it might make sense that merging Gateways is part of extended conformance.

@dprotaso dprotaso added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 8, 2023
@shaneutt shaneutt added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 8, 2023
@arkodg
Copy link
Contributor

arkodg commented Feb 8, 2023

+1 to this proposal.

  • This has been asked by users of Envoy Gateway to make their runtime deployments more efficient (currently a Gateway resource maps to a unique Data Plane (Envoy Proxy) deployment)

  • The spec says compatible listeners can be collapsed, but since Gateways can be applied dynamically
    it is bound to cause traffic disruption in the data plane if the implementation collapses or separates listener implementations in runtime.

  • If there is way to add a knob which sets constraints for the L3/L4 fields of the Listener such as ensuring a port must be unique across all Listeners within all Gateways within a GatewayClass, statically collapsing listeners in implementations should be doable.
    Below is the current spec

    // Each listener in a Gateway must have a unique combination of Hostname,

@howardjohn
Copy link
Contributor

FWIW how we do this in Istio is https://istio.io/latest/docs/tasks/traffic-management/ingress/gateway-api/#manual-deployment. Basically the same way to apply 1 Gateway to a manually deployed Deployment/Service instead of automatically deployed one can be used to apply N Gateways and merge them.

If they want automated deployment and merging, they would make one "root" Gateway and attach all others. Actually that is another use case for #1596

@arkodg
Copy link
Contributor

arkodg commented Feb 8, 2023

@howardjohn conformance tests dont pass when the implementation merges the listeners, envoyproxy/gateway#349 has more context

@howardjohn
Copy link
Contributor

I don't think something implicitly deciding to collapse or not is a great idea which is the root cause of that. In the Istio way users have to explicitly opt into merging, which the conformance tests don't do, so we don't have an issue.

@arkodg
Copy link
Contributor

arkodg commented Feb 8, 2023

I don't think something implicitly deciding to collapse or not is a great idea which is the root cause of that. In the Istio way users have to explicitly opt into merging, which the conformance tests don't do, so we don't have an issue.

yeah, Im hoping the intent to merge is defined by Gateway API, else implementations will add their own

@shaneutt shaneutt added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Feb 15, 2023
@dprotaso
Copy link
Contributor Author

@shaneutt curious what does the label priority/awaiting-more-evidence mean and what's the lifecycle of issues in this sub project?

@shaneutt shaneutt added this to the v1.0.0 milestone Feb 21, 2023
@shaneutt shaneutt added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Feb 21, 2023
@shaneutt
Copy link
Member

@dprotaso we're in the midst of trying to better organize things based on priority. Given that 3 people are interested in this one, priority/awaiting-more-evidence was not really the right fit so I've changed it to priority/important-longterm. Please note that while these labels can be helpful for us to organize work and releases, they aren't necessarily indicative of the actual order in which things will be done. It's reasonable for someone with keen interest to champion an issue like this and take it straight into in-progress if they feel strongly about it.

@dprotaso
Copy link
Contributor Author

@arkodg

The spec says compatible listeners can be collapsed, but since Gateways can be applied dynamically
it is bound to cause traffic disruption in the data plane if the implementation collapses or separates listener implementations in runtime.

Knative uses multiple Kubernetes Services to resolve port conflicts.

Our custom Istio installation has a single k8s deployment serving public traffic and internal traffic. This is because we create one K8s Service with type=LoadBalancer and the other being type=ClusterIP pointing to the same deployment but targetting different ports.

Thus if the Gateway resource had a scope property to define whether gateways were public (to the internet) and local to the cluster then I could forsee merging of

apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: prod-internal
spec:
  gatewayClassName: acme-lb
  group: knative
  scope: Internal
  listeners:  
  - protocol: HTTP
    port: 80
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: prod-web
spec:
  gatewayClassName: acme-lb
  group: knative
  scope: Public
  listeners:  
  - protocol: HTTP
    port: 80

to be done by creating the following K8s Services

apiVersion: v1
kind: Service
metadata:
   name: prod-internal
spec:
  type: ClusterIP
  ports:
    - name: http2
      port: 80
      targetPort: 8081 # Dynamically allocated port
---
apiVersion: v1
kind: Service
metadata:
   name: prod-web
spec:
  type: LoadBalancer
  ports:
    - name: http2
      port: 80
      targetPort: 80 # Dynamically allocated port

Each gateway could return a unique IP in the status.addresses block.

@arkodg
Copy link
Contributor

arkodg commented Feb 21, 2023

@dprotaso above case helps merge K8s deployments (improving efficiency), there is also a use case where some users might want to further merge gateways within a single K8s service, reducing the number of cloud LBs needed (ratio of cloud LB to gateways is 1: N)

@dprotaso
Copy link
Contributor Author

there is also a use case where some users might want to further merge gateways within a single K8s service

@arkodg Yeah we would benefit from that as well. Curious if you have any details/issues to link about user expectations around this feature?

@arkodg
Copy link
Contributor

arkodg commented Feb 23, 2023

@dprotaso end users prefer lesser Cloud LBs / Services to reduce the management / operational complexity for managing the IPs, security, observailibity of these components

I think your proposed solution of introducing scope does solve the problem of merging Services if multiple Gateways have the same value of scope set.

Now if scope is of type string it is allowing the platform admins to create hints for the underlying implementation to group & isolate Gateway resources into domains in the infrastructure. Isnt that isolation handled at the GatewayClass level today :) ?

@dprotaso
Copy link
Contributor Author

I think my example of scope in #1713 (comment) shows that the same infrastructure could serve different scoped Gateways.

Now if scope is of type string it is allowing the platform admins to create hints for the underlying implementation to group & isolate Gateway resources into domains in the infrastructure. Isnt that isolation handled at the GatewayClass level today :) ?

I think a type string makes sense cause then you can have standard scopes and vendored-prefixed ones.

it is allowing the platform admins to create hints for the underlying implementation to group & isolate Gateway resources into domains in the infrastructure. Isnt that isolation handled at the GatewayClass level today :) ?

I think there are two aspects to this issue

  1. When creating a Gateway how do you hint it should be merged with another (this helps scale Gateway/listener management)
  2. How do you map Gateways onto infrastructure.

I can see 1. being a distinct property on a Gateway - In my example I was using the property group so Gateways would declare themselves part of a set.

For 2. I think this would need to be part of the GatewayClass as it's implementation specific.

@arkodg
Copy link
Contributor

arkodg commented Feb 23, 2023

yeah thinking more on this, I'm a +1 on either group or scope to allow users to bucket Gateways into domains.
This field can be made more powerful than a Gateway annotation if the implementation / validation webhook in Gateway API ensures all Gateway listeners within a specific group are compatible

@youngnick
Copy link
Contributor

I'm very -1 on some sort of group or scope field, that's explicitly what GatewayClass is supposed to be doing. It's intended that an implementation be able to reconcile multiple GatewayClasses with slightly different settings, that's why we used the controllerName field rather than explicitly specifying a single GatewayClass. Any time you think "we need a way to bucket these Gateways", we should be thinking of a GatewayClass thing first.

I think that @howardjohn's ideas in GEP-1762 (#1757) about using Address seem like a pretty viable way to handle both the manually-managed infrastructure use case, and the "how to specify a set of merged Gateways" use case. I had originally anticipated that "merge these Gateways" would be specified as a GatewayClass setting, but it seems like there may be too much shenanigans possible if we do it that way. I'm going to think more about it though.

@dprotaso
Copy link
Contributor Author

Related - @youngnick take a look at the discussion in the GEP - #1653

I have yet to update GEP contents will do that later tonight based on the discussion

@candita
Copy link
Contributor

candita commented Mar 6, 2023

@dprotaso is this a separate GEP from #1653? Why so much discussion here before someone is assigned?

@youngnick
Copy link
Contributor

It is a separate GEP, yes. But I think the eventual solution will be a combination of pieces from #1651 (PR #1653), and #1762 (PR #1757), rather than its own GEP.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024
@youngnick
Copy link
Contributor

/remove-lifecycle stale

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2024
@dprotaso
Copy link
Contributor Author

dprotaso commented Feb 6, 2024

FYI - the GEP PR has been updated to support cross-namespace. From my comment #1863 (comment)

I now have a use case that would warrant having a parent in a different namespace. Imagine the cluster operator persona is split into two parts - the cluster operator and the application operator.

The cluster operator provisions a 'parent' Gateway in some system namespace. They're responsible for installing/upgrading the underlying package that implements the Gateway API. They want to ensure costs are kept low so they want their Gateway instance to be shared with app operators. Thus there's only a single LB for the cluster.

The application operator owns a namespace and sets up the namespace to be used by app developers. The app operator will want to define a child Gateway with the right listeners and attaches the 'child' Gateway to 'parent' Gateway mentioned above.

The application developer creates HTTPRoute and attaches them to the child gateway in their namespace.

Some implementations have expressed interest:

@arkodg
Copy link
Contributor

arkodg commented Feb 6, 2024

thanks for restarting work on this GEP @dprotaso. This has been a popular ask for users in Envoy Gateway, and we finally ended up adding a knob (in our previous v0.6.0 release ) at the GatewayClass level (via the parametersRef custom resource called EnvoyProxy) called MergeGateways . This is primarily because there is a 1:1 mapping between GatewayClass and data plane infrastructure in Envoy Gateway today as all infrastructure specific fields like resources and replicas can only be configured via the custom resource thats attached to the GatewayClass

@shaneutt
Copy link
Member

We do need this, especially as several implementations are already doing "something" in this regard.

/triage accepted

However we are targeting early April for v1.1.0, and in review we don't expect that this will be ready or have had enough soak time for that timeline, so we're looking at v1.2.0 for this currently.

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Mar 12, 2024
@shaneutt shaneutt removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 12, 2024
@shaneutt shaneutt added this to the v1.2.0 milestone Mar 12, 2024
@dprotaso
Copy link
Contributor Author

/assign @dprotaso

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: In Progress
8 participants