Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data plane provisioner to support conformance tests #634

Closed
Tracked by #305
pleshakov opened this issue May 9, 2023 · 2 comments · Fixed by #683
Closed
Tracked by #305

Add data plane provisioner to support conformance tests #634

pleshakov opened this issue May 9, 2023 · 2 comments · Fixed by #683
Assignees
Labels
area/control-plane General control plane issues conformance Relates to passing Gateway API conformance tests refined Requirements are refined and the issue is ready to be implemented.
Milestone

Comments

@pleshakov
Copy link
Contributor

pleshakov commented May 9, 2023

Parent issue -- #305

Implementation PRs (the issue requires multiple PRs)

Overview

This issue introduces a new component -- the provisioner -- to unblock us from not running the conformance tests (short-term) and to take care of data plane management for our users (long term).

Background

The Gateway API prescribes that Gateway implementations support more than one Gateway resources: for every deployed Gateway resources, the implementation must process it and configure the data plane accordingly. This is what implicitly assumed in the Gateway API conformance tests.

Our implementation doesn't support multiple Gateways, which makes it a blocker for running conformance tests.

We have considered a few approaches to unblock the conformance tests:

  • Modify the conformance tests so that they don't require the support for multiple Gateways. The discussion happened here and in the community meetings. Our proposals didn't meet enough support to make changes to the tests.
  • Modify NKG so that it supports multiple Gateways:
    • Merging Gateways. With this approach, NKG merges the listeners from all Gateways. We don't pursue this approach because (1) merging is not well defined by the Gateway API spec yet and (2) the listeners in the conformance tests are conflicting (from the point of a single Gateway) -- same port-protocol-hostname listeners appear across multiple Gateways.
    • Merging Gateways in data plane configuration, in which NKG processes Gateways independently but then it generates a shared data plane configuration for all Gateways which resolves any conflicts. This approach was suggested here. We don't pursue this approach because while it can unblock the conformance tests, it allows conflicting routing rules and listeners propagate to the data plane configurations.
    • Logical data planes configuration in a single data plane configuration. With this approach (1) each Gateway gets dedicated ports in data plane configuration non-overlapping with the ports of the other Gateways and (2) NKG provisions a dedicated Service for each Gateway so that the logical data plane can be accessed by the clients via a dedicated IP. This approach was prototyped here. We don't pursue it because (a) it requires significant changes to how NKG generates NGINX configuration and (b) we're not confident we want to support multiple Gateways this way going forward for production use.
    • Data plane provisioner. Where a separate component -- the provisioner -- provisions an NKG deployment for each Gateway. We choose this approach because (a) most of the other Gateway API implementations went this way (b) we believe there will be demand for it in the future from our users (c) at the same time, it can be implemented in an isolation from the core NKG code so we can easily get rid of it later if we change our course. This approach was prototyped here.

High-Level Requirements

We introduce a new component -- the provisioner, which will take care of provisioning NKG deployments for Gateways of the configured GatewayClass. It will be a Kubernetes controller that watches for changes in Gateway resources that belong to its GatewayClass and ensures that for each Gateway, a corresponding NKG deployment is running in the cluster. The provisioner will own the GatewayClass resource -- it will report its status.

At the same, NKG gets extended so that:

  • It supports processing a specific Gateway for which it was provisioned by the provisioner.
  • It supports disabling reporting the status of the GatewayClass resource.

Detailed Requirements

Important note -- the existing way of handling multiple Gateways is preserved and remains the recommended one. The provisioner is introduced only to unblock us from not running the conformance test and is not recommended for production use (for now).

We introduce the provisioner in the same binary as NKG, so that we don't need to build a separate container. As a result, the binary will have two modes (commands):

  • Provisioner.
  • NKG - what we have now with a few extensions.

Provisioner

  • Processes a specific GatewayClass resource (its name configured via a cli argument along with the controller name that must match the controllerName field of the GatewayClass resource) and reports its status.
  • Processes Gateways that belong to its Gateway class. For each Gateway resource:
    • on Create: Create an NKG deployment with one replica in the namespace where the provisioner is running.
    • on Update: Do nothing.
    • on Delete: Delete the corresponding deployment.
  • Doesn't report the status of Gateway resources - the corresponding NKG will do that.
  • Includes a warning in its log that it is not recommended to be used in production.
  • Resides in the same binary as NKG so that we don't need to build a separate container.
  • All NKG deployments share the same SA, ConfigMaps and other installation resources, only the deployments are different. Creating those resources will be a prerequisite to use the provisioner.

NKG

  • Supports configuring a single Gateway resource (via a cli arg) to watch.
  • Supports not reporting the GatewayClass status, so that it doesn't conflict with the provisioner.

Out of Scope

For now, we will not address many concerns that will be require for production use of the provisioner:

  • High availability.
  • Robustness. No user input (Gateway API resources) validation unless it is needed for the conformance tests.
  • Performance.
  • Configurability. No need to support customization the provisioned NKG deployments.
  • Reliability. Handling restarts and failures is not necessary. Terminating for error handling is acceptable.

Note: However, while the scope is reduced, we will not compromise on the software quality, including design, implementation and testing.

Note: We will leave the option to re-evaluate if supporting multiple Gateways via the provisioner is an optimal approach and will remove it if we found a better one.

Acceptance Criteria

  • The provisioner is introduced and NKG is extended to support the requirements above.
  • The new command line arguments are documented. However, the installation of the provisioner is not documented in the installation docs to discourage the usage outside of development.
  • A new folder is introduced in the repo for developer docs -- docs for the developers of the NKG project. Add the provisioner installation and running instructions there.
@brianehlert
Copy link

Stepping back and thinking about this as if it was a feature proposal (not a bandage to run conformance):

I think a new deployment for every Gateway object is excessive and a waste of precious K8s cluster resources.
And systems to that do so will be deemed "resource heavy".
Though it might be easier to compartmentalize the proxy config for development purposes.
I think that a deployment per GatewayClass is a far more natural assumption when thinking about production or 'in practice'.

Proposed with the constraints given, I think it helps us meet the need to align with conformance test assumptions and gives us a way to probe on this resource consumption issue I raise.
I believe that resource consumption will become very important moving forward.
NGINX needs to be true to its roots of lightweight, performant, and highly stable.

@pleshakov
Copy link
Contributor Author

I think a new deployment for every Gateway object is excessive and a waste of precious K8s cluster resources.
And systems to that do so will be deemed "resource heavy".
Though it might be easier to compartmentalize the proxy config for development purposes.
I think that a deployment per GatewayClass is a far more natural assumption when thinking about production or 'in practice'.

this is a valid concern. However, the current state of the Gateway spec through the conformance tests wants to see separate data planes per Gateway. However, using a single data plane for multiple Gateways will be supported when explicit support for merging is accepted -- kubernetes-sigs/gateway-api#1863

@pleshakov pleshakov added area/control-plane General control plane issues refined Requirements are refined and the issue is ready to be implemented. and removed area/control-plane General control plane issues proposal labels May 10, 2023
@pleshakov pleshakov self-assigned this May 10, 2023
@pleshakov pleshakov added this to the v0.4.0 milestone May 10, 2023
@pleshakov pleshakov added tests Pull requests that update tests conformance Relates to passing Gateway API conformance tests and removed tests Pull requests that update tests labels May 10, 2023
@pleshakov pleshakov moved this from 🆕 New to 🏗 In Progress in NGINX Gateway Fabric May 11, 2023
pleshakov added a commit to pleshakov/nginx-gateway-fabric that referenced this issue May 18, 2023
This commit:
- Changes the cmd package to support multiple commands in
gateway binary.
- Adds the root command, which simply prints help.
- Adds control-plane command, which starts the control plane --
the previously available functionally of the gateway binary.
- Adds provisioner command, currently not implemented.

Needed by nginxinc#634
pleshakov added a commit that referenced this issue May 20, 2023
This commit:
- Changes the cmd package to support multiple commands in
gateway binary.
- Adds the root command, which simply prints help.
- Adds static-mode command, which starts 
NGINX Kubernetes Gateway in static mode --
the previously available functionally of the gateway binary.
- Adds provisioner-mode command, currently not implemented.

Needed by #634
pleshakov added a commit to pleshakov/nginx-gateway-fabric that referenced this issue May 23, 2023
This commit adds two features for the static mode:
- Support configuring a single Gateway resource to watch.
- Support not reporting the GatewayClass status,
so that it doesn't conflict with the status updates done by
the provisioner.

Needed by nginxinc#634
pleshakov added a commit that referenced this issue May 24, 2023
This commit adds two features for the static mode:
- Support configuring a single Gateway resource to watch.
- Support not reporting the GatewayClass status,
so that it doesn't conflict with the status updates done by
the provisioner.

Needed by #634
@pleshakov pleshakov mentioned this issue May 25, 2023
6 tasks
@pleshakov pleshakov moved this from 🏗 In Progress to 👀 In Review in NGINX Gateway Fabric May 25, 2023
pleshakov added a commit to pleshakov/nginx-gateway-fabric that referenced this issue Jun 1, 2023
Problem:
Support provisioning NGINX data plane per Gateway, as expected
by the Gateway API conformance tests.
See nginxinc#634
for more info

Solution:
- Implement provisioner command which which provisions a Deployment
of NKG (static mode) for each Gateway of the provisioner GatewayClass.
- Add provisioner manifests and docs

Fixes nginxinc#634

Additionally, introduce PrepareTimeForFakeClient helper function, which
fixes an error that appeared on GitHub Action pipeline, not locally (see below).
(To reproduce it locally, run `make TZ=123 unit-tests` and ensure you
compare Conditions in the status as in
Expect(clusterGc.Status.Conditions).To(Equal(expectedConditions)) )

The timezone of the time in a resource field returned by the fake client was different
from the one set in the field when updating the resource.

The commit adds PrepareTimeForFakeClient() which ensures that the time is prepared
correctly so that the timezone is the same.

The problem is only present when comparing status Conditions  using gomega like
Expect(clusterGc.Status.Conditions).To(Equal(expectedConditions))
but not present if comparing using cmp like
Expect(helpers.Diff(expectedGc, latestGc)).To(BeEmpty()).

  [FAILED] Expected
      <*time.Location | 0x30d0b00>: {
          name: "Local",
          zone: [
              {name: "UTC", offset: 0, isDST: false},
          ],
          tx: [
              {
                  when: -9223372036854775808,
                  index: 0,
                  isstd: false,
                  isutc: false,
              },
          ],
          extend: "UTC0",
          cacheStart: -9223372036854775808,
          cacheEnd: 9223372036854775807,
          cacheZone: {name: "UTC", offset: 0, isDST: false},
      }
  to equal
      <*time.Location | 0x309f240>: {name: "UTC", zone: nil, tx: nil, extend: "", cacheStart: 0, cacheEnd: 0, cacheZone: nil}
pleshakov added a commit that referenced this issue Jun 1, 2023
Problem:
Support provisioning NGINX data plane per Gateway, as expected
by the Gateway API conformance tests.
See #634
for more info

Solution:
- Implement provisioner command which which provisions a Deployment
of NKG (static mode) for each Gateway of the provisioner GatewayClass.
- Add provisioner manifests and docs

Fixes #634

Additionally, introduce PrepareTimeForFakeClient helper function, which
fixes an error that appeared on GitHub Action pipeline, not locally (see below).
(To reproduce it locally, run `make TZ=123 unit-tests` from a commit before
this one and ensure you compare Conditions in the status as in
Expect(clusterGc.Status.Conditions).To(Equal(expectedConditions)) )

The timezone of the time in a resource field returned by the fake client was different
from the one set in the field when updating the resource.

The commit adds PrepareTimeForFakeClient() which ensures that the time is prepared
correctly so that the timezone is the same.

The problem is only present when comparing status Conditions  using gomega like
Expect(clusterGc.Status.Conditions).To(Equal(expectedConditions))
but not present if comparing using cmp like
Expect(helpers.Diff(expectedGc, latestGc)).To(BeEmpty()).

  [FAILED] Expected
      <*time.Location | 0x30d0b00>: {
          name: "Local",
          zone: [
              {name: "UTC", offset: 0, isDST: false},
          ],
          tx: [
              {
                  when: -9223372036854775808,
                  index: 0,
                  isstd: false,
                  isutc: false,
              },
          ],
          extend: "UTC0",
          cacheStart: -9223372036854775808,
          cacheEnd: 9223372036854775807,
          cacheZone: {name: "UTC", offset: 0, isDST: false},
      }
  to equal
      <*time.Location | 0x309f240>: {name: "UTC", zone: nil, tx: nil, extend: "", cacheStart: 0, cacheEnd: 0, cacheZone: nil}
@github-project-automation github-project-automation bot moved this from 👀 In Review to ✅ Done in NGINX Gateway Fabric Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane General control plane issues conformance Relates to passing Gateway API conformance tests refined Requirements are refined and the issue is ready to be implemented.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants