Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a validating webhook for ingress sanity check #3802

Merged
merged 1 commit into from
May 2, 2019

Conversation

tjamet
Copy link
Contributor

@tjamet tjamet commented Feb 22, 2019

What this PR does / why we need it:

  • Add a feature to reject the admission of ingresses that generates invalid configuration

Running nginx ingress controller in production for a while, we are pretty happy about it, but we occasionally have an issue where an ingress with a corrupted configuration is injected, and prevents from any other future update of the ingress nginx configuration.
Subsequently, we need to find out the ingress causing problems and fix it.
This PR increases the feedback loop by allowing, optionally, to reject the ingress, before it gets inserted and hence preserves the sanity of the configuration.

Once the patch is applied, and the controller configured, this is the user experience in case of an invalid ingress:

image

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #
fixes #2833
fixes #3218
fixes #3435
fixes #3459

Special notes for your reviewer:
I am aware that this PR needs testing and documentation.
In case the option of adding this feature is accepted, I will as well contribute documentation and tests.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Feb 22, 2019
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 22, 2019
@aledbf
Copy link
Member

aledbf commented Feb 22, 2019

@tjamet first, thank you for working on this!
That said:

  • please rebase
  • to update the dependencies please run:
    • rm -rf images/grpc-fortune-teller/ (you can restore that before the commit)
    • make dep-ensure

@aledbf
Copy link
Member

aledbf commented Feb 22, 2019

@tjamet as you said, we need tests for this :)
What's the behavior when you have multiple deployments with different classes?

ings := n.store.ListIngresses()
ings = append(ings, &ingress.Ingress{*ing, annotations.NewAnnotationExtractor(n.store).Extract(ing)})

// sort Ingresses using the CreationTimestamp field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required because ListIngresses alredy sorts the ingresses

@tjamet
Copy link
Contributor Author

tjamet commented Feb 22, 2019

@aledbf thanks for the fast feedback
I am currently rebasing

Indeed I will then work on tests and documentation!
Regarding multiple deployments, the expected behavior is to accept ingresses that do not match the class. This said, I need further tests with multiple deployments

@aledbf
Copy link
Member

aledbf commented Feb 22, 2019

Regarding multiple deployments, the expected behavior is to accept ingresses that do not [match the class](c07f553#diff-
86ac9ff9d75a0c5005c116e766a6127dR203).

Yes, I saw that but I am not sure about the webhook behavior with multiple deployments

@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@aledbf aledbf added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 22, 2019
@ElvinEfendi
Copy link
Member

Also fixes #3588

@ElvinEfendi
Copy link
Member

@tjamet thanks for working on this! It'd be great if you describe your approach, discuss how much overhead this is adding to the core event loop, and why would I not want to have this always on.

@tjamet tjamet force-pushed the admission-controller branch 2 times, most recently from 40a177d to da477d0 Compare February 24, 2019 14:45
@tjamet
Copy link
Contributor Author

tjamet commented Feb 25, 2019

@aledbf I am seeing so far 2 parameters that can change the set of ingresses considered for configuration generation in a given controller:

  • --ingress-class
  • --watch-namespace
    I am hence suspecting multiple deployments to have 4 different ways to consider or not an ingress.

Following your comment, I added the filter based on the namespace name as well for to behave consistently with the deployment. This should handle all combinations of --ingress-class and --watch-namespace, as soon as the validating webhook is enabled for all deployments.

The approach @ElvinEfendi is pretty simple in the end, it consists of exposing, outside of the core event loop an HTTP handler, compliant with Kubernetes validating webhook interface, to ensure any ingress that is about to be inserted in the cluster would generate a syntactically valid configuration together with the rest of the known ingresses using the pre-existing features. This handler should be exposed through a service to a ValidatingWebhookConfiguration object.

As this handler is run outside of the core event loop, I have tested a benchmark that shows the following results (reported by go test -bench '.' -benchmem -run Benchmark -v within the ./bin/go-in-docker.sh):

number of pre existing ingresses (go) memory used per request time per validating request
0 136342 B/op 25926680 ns/op
20 637230 B/op 26366018 ns/op
50 1415960 B/op 38304690 ns/op
100 2587819 B/op 47551526 ns/op
200 5566906 B/op 71309565 ns/op
500 12790632 B/op 128302310 ns/op
1000 28805841 B/op 226466720 ns/op

Because the request is served through a service to using the ingress controller as a backend, webhook requests are load-balanced through all ingress controllers and hence reduces the load on a single pod in case multiple ingresses are inserted at the same time.

Another option to get better performances could be to drop the test with all pre-existing ingresses, which could probably be good enough to detect most configuration errors (syntax errors mainly).

Regarding the default enabling of the feature, I would tend to, as a system operator, to recommend to enable it systematically.
Although, it looks to me difficult to enable it by default directly after implementation. This would change the permissions the ingress controller would require to grant it system wide privileges to inject the ValidatingWebhookConfiguration object.
Also the way ValidatingWebhook works, they require a self signed certificate generated for the correct <service>.<namespace>.svc domain, which would be more difficult to integrate out of the box inside the ingress controller. My approach would rather be to make it easy in the chart to use such a feature to start with. Happy to read other opinions or concerns

@aledbf
Copy link
Member

aledbf commented Mar 5, 2019

This should handle all combinations of --ingress-class and --watch-namespace, as soon as the validating webhook is enabled for all deployments.

👍

Regarding the default enabling of the feature, I would tend to, as a system operator, to recommend to enable it systematically.

This must be optional (at least for now)

Also the way ValidatingWebhook works, they require a self signed certificate generated for the correct ..svc domain, which would be more difficult to integrate out of the box inside the ingress controller. My approach would rather be to make it easy in the chart to use such a feature to start with. Happy to read other opinions or concerns

We can provide a script like https://gist.github.com/aledbf/11b3b4f882c632d021532edda8a5dc52 (I need to cleanup that) to use the feature already available in k8s to generate such certificate

@tjamet
Copy link
Contributor Author

tjamet commented Mar 9, 2019

We can provide a script like https://gist.github.com/aledbf/11b3b4f882c632d021532edda8a5dc52 (I need to cleanup that) to use the feature already available in k8s to generate such certificate

Sure!
I can add it to this PR
We have it deployed using helm, using:

{{- $cn := printf "%s.%s.svc" ( include "nginx-ingress.validatingWebhook.fullname" . ) .Release.Namespace }}
{{- $ca := genCA (printf "%s-ca" ( include "nginx-ingress.validatingWebhook.fullname" . )) .Values.validatingWebhook.certificateValidity -}}
{{- $cert := genSignedCert $cn nil nil .Values.validatingWebhook.certificateValidity $ca -}}

This could also be integrated afterwards to the helm chart, that I could contribute to as well

Regarding impact on resources, here are graphs, with the deployment in the middle.
The previous version was 0.19.0
screenshot 2019-03-06 at 16 40 34

@tjamet tjamet force-pushed the admission-controller branch 2 times, most recently from 50dd1ab to fd69b0b Compare March 19, 2019 15:58
@zq-david-wang
Copy link

Just a silly question, would this validating webhook reject ingress resource that are valid for ingress controller of other kind? For example, the regular expression syntax for traefik is weird and would cause nginx ingress controller crash, if I deploy nginx ingress controller with this validating webhook, can I use traefik within the same cluster? (To be honest, the metrics endpoint for traefik is way more useful than nginx's)

@tjamet tjamet force-pushed the admission-controller branch from fd69b0b to 21efe30 Compare April 4, 2019 13:12
@tjamet
Copy link
Contributor Author

tjamet commented Apr 4, 2019

Hi @zq-david-wang

would this validating webhook reject ingress resource that are valid for ingress controller of other kind

The validating webhook is filtering on the kubernetes.io/ingress.class, ingress resources that match a different class will be accepted.
In case you run multiple ingress controllers (we do run several nginx ingress controllers) in a cluster, I would expect to find controllers deployed with different classes, hence ingresses for different ingresses kind (traefik in the case you are talking about) will be accepted.

In case both ingresses are run with the same class, both ingress controllers would generate a configuration for the ingress and hence the validating webhook will check the validity of the generated configuration

@tjamet
Copy link
Contributor Author

tjamet commented Apr 11, 2019

Hi,
I think I addressed the concerns, are there any other one I need to address? Or is it good enough for a first version?
I will be happy to keep having a look at it and improve it with other ideas later on

@aledbf
Copy link
Member

aledbf commented Apr 15, 2019

@tjamet please rebase and squash the commits

@tjamet tjamet force-pushed the admission-controller branch from 21efe30 to 5e6478c Compare April 15, 2019 20:11
@tjamet
Copy link
Contributor Author

tjamet commented Apr 15, 2019

Rebased and commits squashed

@aledbf aledbf removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2019
@aledbf aledbf mentioned this pull request Apr 17, 2019
@aledbf
Copy link
Member

aledbf commented Apr 18, 2019

@tjamet please rebase. After that, I plan to build a dev image to test and merge this PR

In case some ingress have a syntax error in the snippet configuration,
the freshly generated configuration will not be reloaded to prevent tearing down existing rules.
Although, once inserted, this configuration is preventing from any other valid configuration to be inserted as it remains in the ingresses of the cluster.
To solve this problem, implement an optional validation webhook that simulates the addition of the ingress to be added together with the rest of ingresses.
In case the generated configuration is not validated by nginx, deny the insertion of the ingress.

In case certificates are mounted using kubernetes secrets, when those
changes, keys are automatically updated in the container volume, and the
controller reloads it using the filewatcher.

Related changes:

- Update vendors
- Extract useful functions to check configuration with an additional ingress
- Update documentation for validating webhook
- Add validating webhook examples
- Add a metric for each syntax check success and errors
- Add more certificate generation examples
@ElvinEfendi
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 2, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ElvinEfendi, tjamet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2019
@k8s-ci-robot k8s-ci-robot merged commit b4f2880 into kubernetes:master May 2, 2019
@ElvinEfendi
Copy link
Member

Thanks @tjamet!

@tjamet tjamet deleted the admission-controller branch May 3, 2019 08:14
@towolf
Copy link
Contributor

towolf commented May 5, 2019

This is really cool. Thanks @tjamet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
7 participants