Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove temporal configuration file after a while #3507

Merged
merged 1 commit into from
Jan 4, 2019

Conversation

aledbf
Copy link
Member

@aledbf aledbf commented Dec 3, 2018

Which issue this PR fixes: fixes #3273

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 3, 2018
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 3, 2018

// In case of error the temporal configuration file will
// be available up to five minutes after the error
go func(tmpFile string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you limit the number of max possible goroutines so that we don’t potentially DOS ourselves.

Copy link
Member Author

@aledbf aledbf Dec 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ElvinEfendi actually, the goroutines will sleep 99% of the time, there is no gain in doing what you are proposing. Also, that approach will add unnecessary complexity (we need some queue aware of time)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my worry:
Given a working ingress-nginx deployment in a cluster, someone makes a change that causes Nginx config test to fail. The controller in that case will log an error and spin up a goroutine to delete the temporary file. At the same time controller will retry generating new Nginx configuration. This will cause another goroutine to be span up. And eventually the controller will get OOMKilled (because we retry faster than 5 minutes the goroutines will accumulate). And this will break all the services.

we need some queue aware of time

why can we not just check creation time of the file? Instead of spinning up a new goroutine here we can collect the filenames in a queue and at the start of controller we can start a "garbage collector goroutine" that periodically pops from the queue and checks if the file was created longer than 5 minutes ago and if so delete it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can we not just check creation time of the file?

Good point. I will refactor the PR to check the timestamp, starting only one goroutine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ElvinEfendi ready for review

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my worry:
Given a working ingress-nginx deployment in a cluster, someone makes a change that causes Nginx config test to fail. The controller in that case will log an error and spin up a goroutine to delete the temporary file. At the same time controller will retry generating new Nginx configuration. This will cause another goroutine to be span up. And eventually the controller will get OOMKilled (because we retry faster than 5 minutes the goroutines will accumulate). And this will break all the services.

we need some queue aware of time

why can we not just check creation time of the file? Instead of spinning up a new goroutine here we can collect the filenames in a queue and at the start of controller we can start a "garbage collector goroutine" that periodically pops from the queue and checks if the file was created longer than 5 minutes ago and if so delete it.

@ElvinEfendi @aledbf I think I'm seeing this behavior in nginx-ingress [0].
it can be reproduced like this:

  1. Create a service that listens to let's say 9500 and expose TCP port 9500
  2. make the pod go to crashloopback (eg with a command sleep 2s)
  3. the nginx-ingress-controller will go to status OOMKilled

Shall I create a new issue?

[0]
Release: 0.23.0
Build: git-be1329b22)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these two manifests you can reproduce it:
nginx-ingress-manifest: https://gist.github.com/strigazi/2fd6a4f139b146f11d50f93811de490c
example-noop-server.yaml: https://gist.github.com/strigazi/d1d11f1efe91ce2045bf5d7517659924

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this a a reproducible issue for me, I'll open an new github issue so that other can find.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 2, 2019
return errors.New(oe)
}

os.Remove(tmpfile.Name())
return nil
return os.Remove(tmpfile.Name())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behaviour when Nginx config test is successful but for some reason os.Remove returns an error. And in that case controller will re-enqueue the sync event. IMO the old behaviour was better. We can log a warning or something when os.Remove fails to make things better but as long as config test is successful this function should return nil (success).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change reverted

t.Fatal(err)
}

tmpfile, err := ioutil.TempFile("", tempNginxPattern)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better coverage I'd also create another temp file with creation time less than 5m and assert that it does not get deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

t.Fatal(err)
}

tmpfile, err = ioutil.TempFile("", tempNginxPattern)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done before calling cleanTempNginxCfg()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, fixed

@ElvinEfendi
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, ElvinEfendi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit c326c55 into kubernetes:master Jan 4, 2019
@aledbf aledbf deleted the remove-temporal branch January 5, 2019 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the nginx-cfg file under tmp part will not be cleaned up
4 participants