Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live Nginx configuration update without reloading #2174

Merged
merged 1 commit into from
Mar 18, 2018

Conversation

ElvinEfendi
Copy link
Member

@ElvinEfendi ElvinEfendi commented Mar 6, 2018

Requires #2172 to be merged first and published the new Nginx image.

What this PR does / why we need it:
The PR implements live configuration update for Nginx upstreams/k8s endpoints. That means every time you deploy your app Nginx won't be reloaded. This is very important when you are front-ending many applications behind a single deployment of ingress-nginx and those app gets deployed frequently and have significant traffic.

I've mainly focused on laying out proper foundation for further development of live Nginx (re)configuration feature in ingress-nginx. The PR implements only Round Robin algorithm but can easily be extended to support more. I have another PR at #2167 to configure load balancing algorithm per ingress, that configuration will be used by Lua code as well to decide which LB algorithm to use for a given app(namespace/service).

The feature can be enabled by using --enable-dynamic-configuration command line flag. If you wanna see more Lua logs you can also set error-log-level to info to see what it does on endpoints changes.

In order to avoid JSON decoding in the request path, for every Nginx worker I spin up a periodic function that reads raw configuration from a shared Lua dictionary and decodes it into local cache per worker. Then balancer uses its local cache to get the list of endpoints and other backend configuration for given namespace/service.

If you want to test you can use index.docker.io/elvinefendi/nginx-ingress-controller:0.0.1

Next steps: Setup a testing framework for Lua code, add support for EWMA and more LB algorithms, add support for live update of secrets by using certificate_by_lua

Which issue this PR fixes: n/a

Special notes for your reviewer:
I'm aware that there's another similar PR being worked on at #2152. I was told about that only after I started working on this feature at Shopify#16 (comment). There are significant differences in the implementation between the two PRs and I thought there could be value in creating this PR as well.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 6, 2018
@ElvinEfendi
Copy link
Member Author

/assign @aledbf

@aledbf
Copy link
Member

aledbf commented Mar 6, 2018

@ElvinEfendi @valeriano-manassero can we get consensus about which version we are going to push?
We cannot have two version of the same feature.

@ElvinEfendi
Copy link
Member Author

@aledbf I’d love to hear what @valeriano-manassero thinks about going forward with this PR and collaboratively extending it later. I am going to work on this feature in coming weeks consistently and if you want I can put together a small RFC as well to discuss the future of this work in a more organized way.

I have already commented on other PR and raised my concerns. In my humble opinion the controller change in this PR is more concise and cleaner and the Lua implementation is more performant(no ‘json’ decoding in request path), correct(I use lua-resty-lock to avoid possible race condition between different Nginx workers accessing load balancing state in shared dictionary) and easier to extend(configuration.lua has its clear responsibility of accepting raw config piece per component and storing it in general shared dictionary which will be consumed by the respective middleware as needed. For example as of now ‘balancer.lua’ needs ‘backends’ only and it is responsible what needs to be done with that data alone. In future we can POST ‘servers’ config as well that will be consumed in a similar way by i.e ‘certificate.lua’ that serves certs dynamically etc).

We are also already running this code in our test cluster without any issue and also starting to migrate some less critical services to it(I will update with more data later).

Given the above compelling reasons I suggest we move forward with this PR.

Please let me know what do you think.

@valeriano-manassero
Copy link

@aledbf @ElvinEfendi Sorry for the delay, I had a couple of very busy days.
I agree with you this PR is cleaner so it's ok for me to go on.
I will rebase further implementations I need on this PR when you'll merge in master.
Maybe a it's good if you want to discuss with me what I would like to implement for the future.
Thanks.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 7, 2018
@ElvinEfendi
Copy link
Member Author

ElvinEfendi commented Mar 7, 2018

The CI should be fixed once #2172 gets merged and new Nginx image gets published.

@ElvinEfendi
Copy link
Member Author

Thanks @valeriano-manassero! I would be happy to discuss that with you.

@ElvinEfendi
Copy link
Member Author

ElvinEfendi commented Mar 9, 2018

@aledbf could you push the new version of Nginx updated at #2172? Once I have that I'll fix the CI here.

@aledbf
Copy link
Member

aledbf commented Mar 9, 2018

@ElvinEfendi today I will publish the new nginx image

@aledbf
Copy link
Member

aledbf commented Mar 9, 2018

@ElvinEfendi please squash the commits (no more than three or four)

@codecov-io
Copy link

codecov-io commented Mar 9, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@41cefeb). Click here to learn what that means.
The diff coverage is 56.09%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2174   +/-   ##
=========================================
  Coverage          ?   36.85%           
=========================================
  Files             ?       70           
  Lines             ?     4963           
  Branches          ?        0           
=========================================
  Hits              ?     1829           
  Misses            ?     2855           
  Partials          ?      279
Impacted Files Coverage Δ
internal/ingress/controller/config/config.go 98% <ø> (ø)
internal/ingress/controller/controller.go 0% <0%> (ø)
cmd/nginx/flags.go 83.94% <100%> (ø)
internal/ingress/controller/template/template.go 63.94% <100%> (ø)
internal/ingress/controller/nginx.go 5.19% <15.38%> (ø)
internal/file/bindata.go 51.8% <84.61%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 41cefeb...af6ae14. Read the comment docs.

@ElvinEfendi
Copy link
Member Author

@aledbf looks like you've published the new Nginx image already, thanks!

CI is 🍏 now. Please let me know if you think this PR requires some more changes.

@ElvinEfendi
Copy link
Member Author

@aledbf is there anything blocking this PR from merging?

@aledbf
Copy link
Member

aledbf commented Mar 13, 2018

@ElvinEfendi only testing. If I don't find any issues I will merge this in two days.
Apologies but I've been really busy with other stuff.

@ElvinEfendi
Copy link
Member Author

@aledbf thanks for the update!

if isSticky(host, location, backend.SessionAffinity.CookieSessionAffinity.Locations) {
upstreamName = fmt.Sprintf("sticky-%v", upstreamName)
}
if !dynamicConfigurationEnabled {
Copy link
Member Author

@ElvinEfendi ElvinEfendi Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently buggy, it will ignore https proto when dynamic configuration is enabled. I have fixed it at https://github.com/Shopify/ingress/pull/29/files. @aledbf let me know if you want me include the fix(and corresponding regression test) in this PR or in a separate PR after this gets merged.

related to https://github.com/kubernetes/ingress-nginx/pull/2174/files#diff-f16f9102ce1211ffe529bbb935a8d46eR40

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include the fix in this PR

newBackends = append(newBackends, pcfg.Backends[i].DeepCopy())
}

n.runningConfig.Backends = []*ingress.Backend{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not change the state of n.runningConfig. Please make a copy and use that to make the comparison.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aledbf notice that I'm changing only n.runningConfig.Backends and I have the copy of its original value at https://github.com/kubernetes/ingress-nginx/pull/2174/files#diff-cde3fffe2425ad7efaa8add1d05ae2c0R756 where I restore at https://github.com/kubernetes/ingress-nginx/pull/2174/files#diff-cde3fffe2425ad7efaa8add1d05ae2c0R772.

The idea is to make sure that the change between new and running config is only Backends change. Please let me know if you think there's a better way of doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not change n.runningConfig. I prefer a full copy that we can be disposed of later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there already a utility function to deep copy runningConfig?

@ElvinEfendi
Copy link
Member Author

ElvinEfendi commented Mar 15, 2018

@aledbf I included bug fix(c778654) and corresponding test cases(3879d72).

I've also refactored(2a9f13d) IsDynamicallyConfigurable implementation to not change n.runningConfig and improved the test for it(534a3c6).

I'll squash my commits once you finish reviewing.

@aledbf
Copy link
Member

aledbf commented Mar 16, 2018

@ElvinEfendi please rebase and squash for the last review and merge

@ElvinEfendi ElvinEfendi force-pushed the dynamic-upstreams branch 2 times, most recently from 7d1a560 to d855a5e Compare March 16, 2018 16:51
@ElvinEfendi
Copy link
Member Author

@ElvinEfendi please rebase and squash for the last review and merge

@aledbf done!

end

function _M.call()
if ngx.var.request_method ~= "POST" or ngx.var.request_uri ~= "/configuration/backends" then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have a GET api to retrieve all configurations. As all configurations now store in memory not files, when something gets wrong, it will be difficult to debug if configuration is correct

@aledbf
Copy link
Member

aledbf commented Mar 18, 2018

@ElvinEfendi I just finished testing the PR and I just have two comments:

  1. If the dynamic configuration is enabled you need to skip the update of upstreams in the file nginx.conf (template)
    Add the flag --v=2 to see what happens
127.0.0.1 - [127.0.0.1] - - [18/Mar/2018:02:18:53 +0000] "POST /configuration/backends HTTP/1.1" 201 5 "-" "Go-http-client/1.1" 2890 0.000 [-] - - - -
I0318 02:18:53.682267       5 controller.go:175] dynamic reconfiguration succeeded, skipping reload
I0318 02:18:53.682301       5 util.go:64] system fs.file-max=180573
I0318 02:18:53.682307       5 nginx.go:560] maximum number of open file descriptors : 89262
I0318 02:18:53.714604       5 nginx.go:658] NGINX configuration diff
I0318 02:18:53.714670       5 nginx.go:659] --- /etc/nginx/nginx.conf	2018-03-18 02:17:17.168970949 +0000
+++ /tmp/new-nginx-cfg589937464	2018-03-18 02:18:53.712665931 +0000
@@ -221,6 +221,7 @@
 
         server 172.17.0.5:8080 max_fails=0 fail_timeout=0;
         server 172.17.0.3:8080 max_fails=0 fail_timeout=0;
+        server 172.17.0.6:8080 max_fails=0 fail_timeout=0;
 
     }
  1. When you send the update to lua please include a log line like glog.V(2).Infof with the body you are sending. Without this is impossible to know what's the configuration in lua
    Even better is you can add an additional route to get the content from lua.

@ElvinEfendi
Copy link
Member Author

Thanks @aledbf!

If the dynamic configuration is enabled you need to skip the update of upstreams in the file nginx.conf (template)

I implemented it that way intentionally thinking that it would be more resilient to failures. But now that I think more about it is probably better to completely skip OnUpdate function call in syncIngress when dynamic reconfiguration is enabled. When we switch back to default behaviour the up-to-date configuration will be generated anyway.

When you send the update to lua please include a log line like glog.V(2).Infof with the body you are sending. Without this is impossible to know what's the configuration in lua
Even better is you can add an additional route to get the content from lua.

👍 I'll do both.

@aledbf
Copy link
Member

aledbf commented Mar 18, 2018

I implemented it that way intentionally thinking that it would be more resilient to failures. But now that I think more about it is probably better to completely skip OnUpdate function call in syncIngress when dynamic reconfiguration is enabled

Please keep in mind you only need to comment the upstream generation. We still need the reload of nginx when an Ingress is added/removed or a new annotation is configured

To be clear, https://github.com/kubernetes/ingress-nginx/blob/master/rootfs/etc/nginx/template/nginx.tmpl#L311-L338 should be inside an if section asking for dynamic configuration being disabled (so this is not generated)

@ElvinEfendi
Copy link
Member Author

@aledbf addressed all your comments in the last 5 commits.

I have also pushed 66137ba to allow requests only from localhost to the Lua configuration endpoint.

@aledbf
Copy link
Member

aledbf commented Mar 18, 2018

If someone else can help to test this please use quay.io/aledbf/nginx-ingress-controller:0.341 with the flag --enable-dynamic-configuration=true in the deployment

@aledbf
Copy link
Member

aledbf commented Mar 18, 2018

@ElvinEfendi this is working just fine. Let's wait until tomorrow and I will merge this.

@aledbf
Copy link
Member

aledbf commented Mar 18, 2018

@ElvinEfendi after this PR we need to clean up the JSON is being sent to LUA because there's a lot of information we are not going to use (and I don't want to use more memory than necessary in nginx)

@aledbf
Copy link
Member

aledbf commented Mar 18, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, ElvinEfendi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2018
@aledbf
Copy link
Member

aledbf commented Mar 18, 2018

@ElvinEfendi thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants