-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic Configuration does not balance to all backends #3290
Comments
I have since tried this with version 0.20.0 and the balancing behaviour seems very strange still. Someones only one backend never gets traffic, sometimes 2 or 3. Seems to be no pattern to it |
@rlees85 given you get the above uneven load balancing, can you provide your Nginx configuration, output of Also are you seeing any Nginx error/warning in the logs when this happens? -- I can not reproduce this, as you can see all 1000 requests are distributed almost evenly across all 10 available replicas.
This is using 0.180 and Round Robin. I could not reproduce this with latest master either. |
Thanks for the response! The extra debug step to show the backends is going to be really useful. I'm away at the moment but going to get all the requested information on Monday. With a bit of luck I'll have just done something stupid, which if that is the case I will give details and close. |
I've re-setup this environment and am still having problems.
{
"name": "cloud-dt1-hybris-storefront-8081",
"service": {
"metadata": {
"creationTimestamp": null
},
"spec": {
"ports": [
{
"name": "hybris-http",
"protocol": "TCP",
"port": 8081,
"targetPort": 8081
}
],
"selector": {
"app.kubernetes.io/instance": "storefront",
"app.kubernetes.io/name": "hybris",
"app.kubernetes.io/part-of": "hybris"
},
"clusterIP": "10.80.4.10",
"type": "ClusterIP",
"sessionAffinity": "None"
},
"status": {
"loadBalancer": {}
}
},
"port": 8081,
"secure": false,
"secureCACert": {
"secret": "",
"caFilename": "",
"pemSha": ""
},
"sslPassthrough": false,
"endpoints": [
{
"address": "10.80.148.2",
"port": "8081",
"maxFails": 0,
"failTimeout": 0
},
{
"address": "10.80.184.2",
"port": "8081",
"maxFails": 0,
"failTimeout": 0
},
{
"address": "10.80.236.2",
"port": "8081",
"maxFails": 0,
"failTimeout": 0
},
{
"address": "10.80.236.3",
"port": "8081",
"maxFails": 0,
"failTimeout": 0
}
],
"sessionAffinityConfig": {
"name": "cookie",
"cookieSessionAffinity": {
"name": "route",
"hash": "sha1",
"locations": {
"_": [
"/"
]
}
}
}
}
{
"name": "upstream-default-backend",
"service": {
"metadata": {
"creationTimestamp": null
},
"spec": {
"ports": [
{
"protocol": "TCP",
"port": 80,
"targetPort": 8080
}
],
"selector": {
"app.kubernetes.io/instance": "storefront",
"app.kubernetes.io/name": "default-http-backend",
"app.kubernetes.io/part-of": "nginx"
},
"clusterIP": "10.80.18.162",
"type": "ClusterIP",
"sessionAffinity": "None"
},
"status": {
"loadBalancer": {}
}
},
"port": 0,
"secure": false,
"secureCACert": {
"secret": "",
"caFilename": "",
"pemSha": ""
},
"sslPassthrough": false,
"endpoints": [
{
"address": "10.80.156.7",
"port": "8080",
"maxFails": 0,
"failTimeout": 0
}
],
"sessionAffinityConfig": {
"name": "",
"cookieSessionAffinity": {
"name": "",
"hash": ""
}
}
} Get Pods (restricted to namespace - as this is how I am running nginx in namespace restricted mode)
Ignore Hybris Backoffice, that is handled by a separate ingress. nginx.conf (some bits omitted with the word
Additional Information Before and After from a few cURLs (only using the first two pods, the second two only got hit by the Kubernetes healthcheck twice) |
You seem to have session affinity enabled for you app (there's a known load balancing issue with current implementation), is that intentional? When session affinity is enabled load-balance annotation is ignored. |
anyone having this issue please try
|
We have the same issue and are using the same configs for stickyness. We're testing the above dev fix and we'll let you know how it goes. |
Before and after the dev fix for sticky session for 2 pods. Before 16h40 we were using 0.20 and only one of the pod was receiving traffic (besides health checks) and thus having CPU. After 16h40, we are running version dev and traffic is well balanced and so is CPU. We'll continue to run this dev version for now and see how it behaves for the next few days. Any idea then this will make it to a final/production release ? Thanks ! |
The code is already merged in master (the dev image is from master) |
That is good news. To give more details on the queries distribution before and after (queries per second on 2 different pods part of the same service) You can see that the second pod was only getting the health check queries, not the client queries before the dev fix. Afterwards, the queries were well distributed. What I cannot confirm so far (looking into this as we speak) is if stickyness is still respected by the new code. I'm unsure if there is an automated test in the build that checks if stickyness works properly so I'm checking this manually to be safe. |
Cookie-based stickyness still works well. All incoming connections with a cookie presented are directed to the same pod, as it should be. I'll follow-up again after a few days to confirm everything still works well but so far so good ! |
Great news, yes my stickiness is intentional... Hybris seems to work better if sessions are not passed around nodes any more than they need to be. I didn't know there was a known issue with stickiness, but am happy to know the issue is known and likely fixed. |
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):
What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): dynamic, backend, upstream, ewma
Similar #2797 but I am NOT using an external service.
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG
NGINX Ingress controller version: 0.18.0
I understand this is not latest, but 0.19.0 and 0.20.0 are broken in other ways (missing Prometheus metrics). Looking at the changelog for these versions I don't see anything around fixes for missing backends.
Kubernetes version (use
kubectl version
): 1.10.9Environment:
uname -a
): 4.18 (local) Kubernetes nodes no sureWhat happened:
Not all backends are being balanced too with Dynamic Configuration enabled. I have tried with
round_robin
andewma
.What you expected to happen:
All backends to receive traffic.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
After doing verbose logging I can see all of my backends are being 'seen', just not used. As Dynamic configuration is all in LUA troubleshooting past this point is pretty much impossible.
Reverting back to
- --enable-dynamic-configuration=false
andleast_conn
balancing everything works fine.The graphs indicate CPU usage on the backend. With Dynamic Configuration only two ever get load. Without much more get load and the HPA starts scaling up as expected.
Slightly concerned by the fact Dynamic Configuration will be mandatory in the next release....
The text was updated successfully, but these errors were encountered: