Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve configuration change detection #2656

Merged
merged 4 commits into from
Jun 21, 2018
Merged

Conversation

aledbf
Copy link
Member

@aledbf aledbf commented Jun 17, 2018

What this PR does / why we need it:

Introduce additional information (checksum) to the configuration to allow determine if there was a change or not. Until now a change in the configuration configmap only triggered a reload after a change in some Ingress.

Which issue this PR fixes:

fixes #2567

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 17, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 17, 2018
@aledbf aledbf force-pushed the queue branch 3 times, most recently from ff26ed1 to 5ec16e3 Compare June 17, 2018 16:57
@codecov-io
Copy link

codecov-io commented Jun 17, 2018

Codecov Report

Merging #2656 into master will increase coverage by 0.11%.
The diff coverage is 40.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2656      +/-   ##
==========================================
+ Coverage   40.82%   40.93%   +0.11%     
==========================================
  Files          72       72              
  Lines        5078     5088      +10     
==========================================
+ Hits         2073     2083      +10     
+ Misses       2723     2721       -2     
- Partials      282      284       +2
Impacted Files Coverage Δ
internal/ingress/controller/config/config.go 98.23% <ø> (ø) ⬆️
internal/ingress/types_equals.go 12.33% <0%> (-0.09%) ⬇️
internal/ingress/controller/nginx.go 11.51% <0%> (-0.03%) ⬇️
internal/ingress/controller/controller.go 2.24% <0%> (+0.02%) ⬆️
internal/ingress/controller/template/configmap.go 77.39% <66.66%> (-0.6%) ⬇️
internal/task/queue.go 77.92% <70%> (-1.49%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6978a8...0abbd2b. Read the comment docs.

@ElvinEfendi
Copy link
Member

Until now a change in the configuration configmap only triggered a reload after a change in some Ingress.

What about forcereload we currently have? Is it not supposed to cover this case?

@@ -185,7 +185,7 @@ func NewStatusSyncer(config Config) Sync {
go st.syncQueue.Run(time.Second, stop)
wait.PollUntil(updateInterval, func() (bool, error) {
// send a dummy object to the queue to force a sync
st.syncQueue.Enqueue("sync status")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to reload/regenerate Nginx config when controller leader change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary now. Removed.

@@ -185,7 +185,7 @@ func NewStatusSyncer(config Config) Sync {
go st.syncQueue.Run(time.Second, stop)
wait.PollUntil(updateInterval, func() (bool, error) {
// send a dummy object to the queue to force a sync
st.syncQueue.Enqueue("sync status")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening currently here? Runtime error because of an incorrect type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that it is an interface, so eventually it will get passed to cache.DeletionHandlingMetaNamespaceKeyFunc to generate a key. And https://github.com/kubernetes/client-go/blob/8aceb98010c1c18b6b54a35b52fd5b46905e3d7f/tools/cache/store.go#L77 will make sure the key is "sync status"

@@ -153,7 +152,7 @@ Error loading new template: %v

n.t = template
glog.Info("New NGINX configuration template loaded.")
n.SetForceReload(true)
n.syncQueue.Enqueue(task.GetDummyObject("template-change"), false)
Copy link
Member

@ElvinEfendi ElvinEfendi Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for better readability should we create helper functions for these two operations? i.e something like

enqueueTask(obj)
enqueueSkippableTask(obj)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

Copy link
Member

@ElvinEfendi ElvinEfendi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes in the PR 👍

But I could not figure out what's the edge case that existing forceReload functionality does not cover but this does. Can you elaborate?

@aledbf
Copy link
Member Author

aledbf commented Jun 18, 2018

What about forcereload we currently have? Is it not supposed to cover this case?

The force reload approach works only if we can enqueue objects. Without the new parameter in enqueue, a change in the configmap could be skipped. Also, to reach the force reload validation the model must be different. This is not true if there is no change in Ingress rules, that's the reason for the new checksum field in the configuration type.

@ElvinEfendi
Copy link
Member

The force reload approach works only if we can enqueue objects

When store.ConfigurationEvent happens at https://github.com/kubernetes/ingress-nginx/pull/2656/files?utf8=%E2%9C%93&diff=unified#diff-cde3fffe2425ad7efaa8add1d05ae2c0R310 we setForceReload and then enqueue that event object at https://github.com/kubernetes/ingress-nginx/pull/2656/files?utf8=%E2%9C%93&diff=unified#diff-cde3fffe2425ad7efaa8add1d05ae2c0L317. So in case of configuration update, this requirement is met.

Without the new parameter in enqueue, a change in the configmap could be skipped.

In case of configmap, how can this happen? In the code I see if t.lastSync > item.Timestamp {. Why would that not be enough specifically for configmap events, but it is enough for Ingress/Endpoints events?

Also, to reach the force reload validation the model must be different

Could you elaborate? 🤔 from what I read, force reload forces the sync method to always execute n.OnUpdate(pcfg) regardless of the model difference.

Timestamp: ts,
Key: key,
Timestamp: ts,
IsSkippable: skippable,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we piggyback on Timestamp field to implement "skippable" logic? For example if we want to enqueue an event that's not skippable we can set its Timestamp to current time + an hour. This way t.lastSync > item.Timestamp will not hold, and worker won't skip the event.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -311,10 +308,11 @@ func (n *NGINXController) Start() {
if evt, ok := event.(store.Event); ok {
glog.V(3).Infof("Event %v received - object %v", evt.Type, evt.Obj)
if evt.Type == store.ConfigurationEvent {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's special with this event type? Why can we not simply enqueue a skippable task?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because event could be en endpoint, service, ingress, secret or configmap but only on a change in a configmap should escape the enqueue skippable logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but only on a change in a configmap should escape the enqueue skippable logic

This is the part I don't understand completely. My current understanding why we have to force reload in this case is because in syncIngress function the model we use currently does not include this configmap data, and therefore when the change is only about configmap it does not regenerate the Nginx configuration.

But in this PR you are adding a new field to that model (ConfigurationChecksum) which to my understanding means we don't need this special case anymore when event type is ConfigurationEvent.

@aledbf
Copy link
Member Author

aledbf commented Jun 19, 2018

Could you elaborate? thinking from what I read, force reload forces the sync method to always execute n.OnUpdate(pcfg) regardless of the model difference.

To reach the syncIngress function an item must be processed by the syncQueue. In this queue we use a time window to handle changes using a batch approach to avoid multiple reloads. This means in some scenarios, like simultaneous updates, we could skip a reload. This means we would never reach the isForceReload check.

@aledbf
Copy link
Member Author

aledbf commented Jun 19, 2018

@ElvinEfendi you can tests this easilly following this example #2567 (comment)

@ElvinEfendi
Copy link
Member

@ElvinEfendi you can tests this easilly following this example #2567 (comment)

@aledbf as you can see in that example, the controller logs

I0612 14:43:25.298234       6 event.go:218] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"nginx", Name:"nginx-configuration", UID:"75927380-6e41-11e8-86c0-025000000001", APIVersion:"v1", ResourceVersion:"82157", FieldPath:""}): type: 'Normal' reason: 'UPDATE' ConfigMap nginx/nginx-configuration
I0612 14:43:25.304823       6 controller.go:168] backend reload required
I0612 14:43:25.478106       6 controller.go:178] ingress backend successfully reloaded...

which comes from syncIngress. Does this not tell you that isForceReload check worked as expected?

@aledbf
Copy link
Member Author

aledbf commented Jun 20, 2018

which comes from syncIngress. Does this not tell you that isForceReload check worked as expected?

How are you using for the test, this PR or 0.15.0?
(with 0.15.0 even with setforcereload, the reload doesn't happen)

@ElvinEfendi
Copy link
Member

ElvinEfendi commented Jun 20, 2018

@aledbf I have not run the test myself, that's just the logs from @antoineco's test in the comment, does it matter though?

with 0.15.0 even with setforcereload, the reload doesn't happen

Can this be a race issue? I see that https://github.com/kubernetes/ingress-nginx/blob/master/internal/ingress/controller/store/store.go#L178 is not synchronized and it is being written from one goroutine (https://github.com/kubernetes/ingress-nginx/blob/master/internal/ingress/controller/store/store.go#L700) and red from another goroutine (https://github.com/kubernetes/ingress-nginx/blob/master/internal/ingress/controller/nginx.go#L427)

And you confirmed that this PR fixes the issue?

@aledbf
Copy link
Member Author

aledbf commented Jun 20, 2018

Can this be a race issue?

No, but I will add a mutex to avoid this from happening.

@aledbf
Copy link
Member Author

aledbf commented Jun 20, 2018

I have not run the test myself, that's just the logs from @antoineco's test in the comment.

If you check the generated nginx.con you will see that the reload didn't change the configuration file.

@ElvinEfendi
Copy link
Member

I tried to test this branch but I'm getting

E0620 12:20:44.485143       5 controller.go:174] Unexpected failure reloading the backend:
template: nginx.tmpl:10:33: executing "nginx.tmpl" at <$all.Cfg.Checksum>: can't evaluate field Checksum in type config.Configuration

Also, this issue #2567 seems to be specifically about whitelist-source-range change. I tried to customize error-log-level and it worked as expected. This again tells me the issue is somewhere else than this PR is changing.

@aledbf aledbf force-pushed the queue branch 2 times, most recently from 7bc5b3a to 0ec99f8 Compare June 20, 2018 17:05
@@ -311,10 +308,11 @@ func (n *NGINXController) Start() {
if evt, ok := event.(store.Event); ok {
glog.V(3).Infof("Event %v received - object %v", evt.Type, evt.Obj)
if evt.Type == store.ConfigurationEvent {
n.SetForceReload(true)
n.syncQueue.EnqueueTask(task.GetDummyObject("configmap-change"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this special case is necessary after you added checksum to the model? Please also refer to https://github.com/kubernetes/ingress-nginx/pull/2656/files#r196736703.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this special case is necessary after you added checksum to the model?

If we don't add this we could skip one update (don't adding an element to the queue)

key := k8s.MetaNamespaceKey(ingKey)
ing, err := store.GetIngress(key)
if err != nil {
glog.Errorf("could not find Ingress %v in local store", key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "not found" the only error GetIngress returns? Would it be useful to include err message itself in the log as well?

@@ -479,6 +479,18 @@ func New(checkOCSP bool,
if key == configmap {
store.setConfig(cm)
}

ings := store.listers.IngressAnnotation.List()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is going to fix the issue, it seems hacky to me. I think the root of the issue is the way IP whitelisting is implemented at

defBackend := a.r.GetDefaultBackend()

This is adding an extra cost that we can avoid by fixing root cause of the issue:

Reading https://github.com/kubernetes/ingress-nginx/blob/master/internal/ingress/annotations/ipwhitelist/main.go#L80 what it's doing is if whitelist-source-range annotation is not set then we use whitelist-source-range value from the configmap otherwise we use the one from the annotation. IMO that logic should not be part of the annotation parsing - annotation parsing should not be concerned about configmap. I suggest we move that logic into the template, in other words in the template we first check location.Whitelist.CIDR if it is not empty we then use that to generate geo ... config otherwise we use $cfg.WhitelistSourceRange to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share that opinion, the improvement to the config detection is independent from the issue related to whitelist-source-range, this should be made clearer via a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoineco my comment was mainly about the fact that this fix is a hack, it is not fixing the root of the issue and unnecessarily introducing extra work on configmap updates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is going to fix the issue, it seems hacky to me. I think the root of the issue is the way IP whitelisting is implemented at

This approach is used in all the annotation parsing step. Changing that in this PR is not going to happen.

IMO that logic should not be part of the annotation parsing - annotation parsing should not be concerned about configmap

You are right here, this should not be part of the annotation parsing but right now we have no alternative. This is the most important reason why I want to move to CRDs and don't use annotations anymore, not only because is complex but it doesn't scale and it's impossible to add the semantics we need in the configuration options (like a list of IP addresses and not a comma-separated string that can contain anything)

I suggest we move that logic into the template

No, that would make that impossible to understand, even now the templating step contains too much logic.


Expect(checksum).NotTo(BeEquivalentTo(newChecksum))
})
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loving this e2e test 😍

@aledbf aledbf merged commit aec40c1 into kubernetes:master Jun 21, 2018
@aledbf aledbf deleted the queue branch June 21, 2018 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ingress controller is picking up a change in nginx config map, but it takes a while to apply it to nginx.conf
5 participants