-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Since v0.25.0(maybe?), memory footprint increased by factor of 7 (0.24.1 to 0.26.1, no other change) #4629
Comments
@DaveAurionix did you just updated the image when you upgraded from 0.24.1 to 0.26? If you want to test the exact same configuration, change https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#worker-shutdown-timeout to |
@aledbf Thank you for responding. No, we also updated the mandatory.yaml to include changes between 0.24.1 and 0.26.1 (such as the graceful shutdown changes and from memory I think RBAC permissions changed a little). We did read the changelog but may have missed something. It's worth elaborating on the brief comment above that we waited a minimum of ten minutes in our tests to see if the memory usage would drop when old workers terminated. I believe the workers should terminate after a maximum of 5 minutes if I've understood the changelog notes and yaml change right? We were seeing high RAM usage immediately on startup as config is first applied and it wouldn't drop. I want to do a test with ModSecurity disabled again because I suspect we've seen that the RAM usage is normal without ModSecurity enabled. I'll add a note to confirm that either way when I've done that test. I'll also do a test with the worker termination time back at 10s as a double-check. Even so, from 250MB to 1855MB is quite a jump.... |
Yes, I agree with that, but enabling modsecurity and the new worker-shutdown-timeout combination is the reason for that. Please also add the mimalloc feature in the ingress controller deployment. That will reduce the memory (modsecurity on or off) in a ~15%. |
First test using this extra line in the configmap:
The pods are being killed before initialising completes as the memory limit is being hit. 0.24.1 was around 250MB with this same config. I'll try with ModSecurity off next. |
With ModSecurity off, the pods initialise much faster with these average stats (worker-shutdown-timeout is back up to 300s for this test):
I'm investigating our rule customisations in ModSecurity (although Ingress-NGINX 0.24.1 was OK with them) |
With ModSecurity on, but no changes to the shipped configuration, RAM goes high again (~1900Mi) and the pods struggle to initialise. The pod logs show this repeating a few times, but I'm not sure if it's a symptom of memory starvation rather than part of the problem.
|
@aledbf The only test left is the mimalloc change you mentioned, but given the finding above around ModSecurity I'm not sure if the test is worth the time yet? I'm fairly sure we're seeing that Ingress-NGINX 0.24.1 would load ModSecurity OK, but 0.26.1 won't. I'm equally sure the picture is not that simple or others would have found it sooner so I know I'm missing something. Thoughts? |
Yes, please add the flag —v=2 to the ingress controller to see the reason of the reload. Also, what is the number of Nginx workers? |
It's worth noting that in 0.24.1 we "turn on" ModSecurity in a slightly quirky way to improve memory usage. For the 0.26.1 tests we removed this server snippet because Ingress-NGINX is now loading ModSecurity earlier. Maybe this is the bug? Is it now loading ModSecurity host-wide instead of server-wide? 0.24.1 server-snippet:
0.26.1 server snippet:
|
@
Will do. Workers were left at the default which seemed to be 2 per pod. |
To be clear: the 1900Mi RAM stats were per-pod. |
--v=2 gave this:
The dump of successful configuration did include a @custom_upstream-default-backend_500 location section, so I assume NGINX is immediately being triggered to re-configure itself and then failing on this second attempt that is shown in the log lines above. I'm not aware of anything changing configmaps or ingresses to trigger this re-config. |
@aledbf I think I've found it. I'm not a ModSecurity expert, but I think rules are being loaded once inside EVERY location block now (since 0.25.0 maybe?) In the dumped successful config block, every location block contains this line: In 0.24.1, turning on modsecurity in the configmap but only loading rules through the server snippet I posted before means that the rules are only loaded once (I think - it might be once per host which is not as good as once for the server but better than once per location). We can't use that snippet in 0.26.1 because Ingress-NGINX is now loading the rules itself in every location. It didn't do that in 0.24.1. Does that seem to make sense? |
@DaveAurionix the logic is this https://github.com/kubernetes/ingress-nginx/blob/master/rootfs/etc/nginx/template/nginx.tmpl#L1082
This means the modsecurity feature is not enabled in the configmap |
Sorry, to clarify: the configmap still features this top-level toggle: I've only read that linked source code in a hurry but I suspect that the "$all.Cfg.EnableModsecurity" parameter is coming from the configmap. When we toggle that documented configmap value to "true" then RAM usage goes to 1900Mi. When we set it false, RAM usage drops to 88Mi. On 0.24.1 the RAM usage was 250Mi with that value "true". |
@DaveAurionix to test if the load of the configuration is the problem,
Using the server-snippet annotation in the host you need to use it
This change loads the rules only once per server block. |
@aledbf That would be amazing, thank you. I'll try that as soon as soon as I can and will get back to you (might be day or two now). |
@aledbf Sorry I've just had time to go through the code that you linked to. I understand what you mean now. The exact output we're seeing in the logs is interesting in that regard. We are only seeing this once at http level (which is good):
We are seeing exactly this repeated at each location level (note: no
I think we're hitting this line: https://github.com/kubernetes/ingress-nginx/blob/master/rootfs/etc/nginx/template/nginx.tmpl#L1094 We're not setting I'll try the custom template test and will get back to you. |
Adding However @aledbf , I removed the whole section https://github.com/kubernetes/ingress-nginx/blob/nginx-0.26.1/rootfs/etc/nginx/template/nginx.tmpl#L1078 to line 1096 (slightly broader than the snippet you suggested) and the RAM usage dropped to ~100Mi per pod. I couldn't add the server snippet directly because Ingress-NGINX was already adding the rules at server level so we had conflicting rule errors on initialisation. After a lot of manual testing we've finally cracked it. The workaround is to add I think the bug is on this line: https://github.com/kubernetes/ingress-nginx/blob/nginx-0.26.1/rootfs/etc/nginx/template/nginx.tmpl#L1089 That line of code is saying that if OWASP core rules are not enabled and there is no location-snippet rules (and $location.ModSecurity.OWASPRules is true) then include all of the OWASP core rules anyway. This means that if people turn on ModSecurity, but don't turn on rules, one copy of the rules are included in every location block, which consumes a huge amount of RAM and is not at all expected behaviour. The expected behaviour (in my opinion) is that if people turn on ModSecurity but don't turn on rules, no rules are included. I'm trying to think of a suggested fix. From our tests it appears that $location.ModSecurity.OWASPRules is evaluating to true by default causing these default rules to be included. Do you know what might be causing that? Our Ingress resource doesn't mention ModSecurity now, and our ConfigMap is as posted above. I don't know where that is set. On a positive note, the changes in 0.25.0 to load ModSecurity efficiently at http-level work brilliantly if both those flags are set to true, and we're seeing a 50% drop in base memory footprint as a result (~250MB with 0.24.1 to ~110MB with 0.26.1 and both configmap flags true) |
I can confirm that this action drops memory usage from 4 GB down to 320 MB @DaveAurionix THANK YOU! |
somehow, this is weird but true. Previously, either owasp was disabled globally and rendered in all locations, or it was enabled globally. This commit fixes the logic issue by fixing the and-clause in the if-statement. This reduces baseline global modsecurity-enabled resource usage.
Is this a request for help?: Not in the short-term, although this is blocking us from upgrading past version 0.24.1
What keywords did you search in NGINX Ingress controller issues before filing this one?: memory, RAM, ModSecurity, footprint
Is this a BUG REPORT or FEATURE REQUEST?: Bug report
NGINX Ingress controller version: 0.26.1
Kubernetes version: 1.14.6
Environment:
Cloud provider or hardware configuration: Azure Kubernetes Service
OS: Ubuntu 16.04
Kernel (e.g.
uname -a
): Linux nginx-ingress-controller-756c5867f6-hnpk8 4.15.0-1059-azure Sort whitelist list to avoid random orders #64-Ubuntu SMP Fri Sep 13 17:02:44 UTC 2019 x86_64 GNU/LinuxInstall tools: kubectl apply of mandatory.yaml and cloud-generic.yaml.
Others:
What happened: We upgraded NGINX from 0.24.1 to 0.26.1. The pods were continuously being OOMKilled. When we upped the memory request from 512MB to 2560MB and then the pods would start. Anecdotally, we tried 0.25.0 a while back and aborted due to pod startup failures, may have been the same issue on reflection (equally may not have been).
What you expected to happen: We expected 0.26.1 to use LESS memory due to ModSecurity being configured at server-level since 0.25.0 IIRC (#4091), not location-level as it is in 0.24.1
How to reproduce it (as minimally and precisely as possible): Enable ModSecurity, configure one ingress resource with about 50 paths split across about 7 hosts, deploy a 0.24.1 controller and check RAM usage, update to 0.26.1 and re-check RAM usage.
Anything else we need to know:
kubectl top pod
on 0.24.1 shows 2 pods with RAM 249Mi, 247Mikubectl top pod
on 0.26.1 shows 2 pods with RAM 1855Mi, 1852Mi - these don't reduce even after 10 minutes of waitingThe nginx-configuration configmap we use is as follows:
The Ingress resource configures TLS to reference a certificate shared by about 7 hosts, with a total of 50 paths spread across those hosts. Some paths reference the same upstream service.
In mandatory.yml we added two container arguments (default-backend-service and default-ssl-certificate), the result being:
We also configure resource requests and limits that we discover from performance tests.
Whilst writing this I've realised that the memory increase is a factor of 7 and we have 7 hosts configured in the Ingress resource. Might just be co-incidence but I wonder if something in 0.24.1 is server-wide and has moved to host-wide in 0.26.1, causing the RAM to increase so significantly?
The text was updated successfully, but these errors were encountered: