Ingress Controller forwarding traffic to a POD(IP) even after termination #7330

alopsing · 2021-07-08T20:43:24Z

NGINX Ingress controller version: 0.29.0

Kubernetes version: 1.19

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.8-eks-96780e", GitCommit:"96780e1b30acbf0a52c38b6030d7853e575bcdf3", GitTreeState:"clean", BuildDate:"2021-03-10T21:32:29Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Environment: Production
Cloud provider or hardware configuration: Amazon EKS
OS: Linux

What happened:

As soon as we delete a pod, we are seeing 502 errors in the NGINX Ingress controller. There is a small blip and seeing good amount of errors in our production.

Log message:

2021/07/08 19:44:28 [error] 43#43: *5131 connect() failed (111: Connection refused) while connecting to upstream, client: <CLIENT_IP>, server: <SERVER_NAME>, request: "POST /api/v1/create HTTP/1.1", upstream: "http://10.53.24.125:8080/api/v1/create", host: "SERVER_NAME""

Note that the upstream IP in the above NGINX log: 10.53.24.125 is the POD IP that was just deleted.

What you expected to happen:

When a POD is deleted, NGINX Ingress controller should not be forwarding the requests to a deleted POD IP but it appears to be caching the POD IPs which should not be the case.
To isolate the problem, we accessed the K8s service using port-forward and we saw no issues with it.
It was the NGINX controller reporting 502 errors.

How to reproduce it:

When there's a good amount of load to the application, delete a pod in the deployment and should instantly see the above mentioned errors. (We had about 200 TPS when this happened)

Anything else we need to know:

We are running NGINX Ingress Controller v0.29.0 on a EKS 1.19 cluster. We also tried upgrading the version to v0.33 and v0.45 but the issue still exists.
Tried updating the ConfigMap with below but no luck:

ssl-session-cache-size: 100m
ssl_session_cache: 'off'

The same issue is not seen when we access the application using K8s service endpoint (http://app-service.default.svc.cluster.local)
A similar issue was reported here: upstream IP inconsistent with pod IP after pod deletion #768

/kind bug

The text was updated successfully, but these errors were encountered:

longwuyuan · 2021-07-09T16:53:09Z

/remove-kind bug
/triage needs-information

Right before deleting manually, open another terminal and watch for ep objects and updated to the list of ep-objects
Please post all the messages right around the time of the manual deletion
Does this happen when you scale down

To reproduce this problem, please add information here ranging from ;

kubectl get deploy,po,svc,ing -A -o wide
kubectl -n describe deploy
kubectl -n describe pod
kubectl -n describe svc
kubectl -n describe ing
kubectl -n describe deploy
kubectl -n get events
Any other related information

toredash · 2021-07-11T20:39:24Z

Pretty sure this is related to the refresh cycle in nginx, which happens every second.

Please try to add a preStop hook on your affected deployment with a "sleep 10" command, and try to terminate a pod after the change is applied. I'm pretty sure this will point you to the real issue, and it isn't really nginx controller related.

tao12345666333 · 2021-07-12T04:27:06Z

This is a general question about gracefully shutdown.

k8s-triage-robot · 2021-10-10T05:11:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-11-09T05:43:27Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-12-09T06:00:10Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-12-09T06:00:28Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

narqo · 2022-01-21T11:04:11Z

@toredash @tao12345666333 could you elaborate, why do you think this is related to an issue on the application's graceful shutdown side?

From Kubernetes docs o Pod Lifecycle:

[..] At the same time as the kubelet is starting graceful shutdown, the control plane removes that shutting-down Pod from Endpoints (and, if enabled, EndpointSlice) objects where these represent a Service with a configured selector.

I can confirm that by "watching" the Endpoints from the application: the POD's IP is removed from the list right on the terminate call.

ingress-nginx's own docs say, the controller uses Endpoints objects to build/re-build the upstreams.

Base on these two facts, it's not clear, how a slow shutdown of the application, should cause the described issue. I expected controller to react on the change in Endpoints right after the POD's termination request has happened, and to remove this POD's IP from the upstream.

toredash · 2022-01-25T11:59:40Z

@narqo
Did you try the suggested change to add a preStop hook to the affected deployment?

nginx will poll the k8s api at 1s interval for an updated EndPoint-list.
Source:

ingress-nginx/rootfs/etc/nginx/lua/balancer.lua

Line 25 in 362c97b

local BACKENDS_SYNC_INTERVAL = 1

For large endpoints, it can take time to compute the new list and reload the nginx procress. The result of this is that it can take >0-1s for nginx to detect and not forward traffic to a deleted backend/POD.

From Kubernetes docs o Pod Lifecycle:

[..] At the same time as the kubelet is starting graceful shutdown, the control plane removes that shutting-down Pod from Endpoints (and, if enabled, EndpointSlice) objects where these represent a Service with a configured selector.

I can confirm that by "watching" the Endpoints from the application: the POD's IP is removed from the list right on the terminate call.

This is also true, but I think you are assuming/expect that changes is instantaneous reflected in your environment, when the code does try to do that at all.

ingress-nginx's own docs say, the controller uses Endpoints objects to build/re-build the upstreams.

That is true, and I don't see anyone have stated otherwise in this issue.

I expected controller to react on the change in Endpoints right after the POD's termination request has happened, and to
remove this POD's IP from the upstream.

It does this, but I believe your expectations is not aligned with how the code works at the moment. Please look at the lua code mentioned above in regards to backend sync. The code does not attempt to make detect backend changes realtime.

jason1004 · 2022-12-17T07:53:03Z

Had the same problem after restarting the pod; how can I fix it now?

jason1004 · 2022-12-17T08:19:58Z

Solved, the reason is that the ingress controller pod has no space, and there is an error log:

I1217 08:00:10.109340 8 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"aaaa-nginx-ingress-controller-xxxxx", UID:" xxxxxxxxxxxx", APIVersion: "v1", ResourceVersion: "xxxxxx", FieldPath: ""}): type: 'Warning' reason: 'RELOAD' Error reloading NGINX: write /etc/nginx/opentracing.json: no space left on device

alopsing added the kind/bug Categorizes issue or PR as related to a bug. label Jul 8, 2021

k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 9, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 9, 2021

k8s-ci-robot closed this as completed Dec 9, 2021

narqo mentioned this issue Jan 20, 2023

Nginx Controller using endpoints instead of Services #257

Closed

felixscheinost mentioned this issue Aug 26, 2024

502 and "Connection refused while connecting to upstream" scaling pods down (or deleting pods) #3639

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingress Controller forwarding traffic to a POD(IP) even after termination #7330

Ingress Controller forwarding traffic to a POD(IP) even after termination #7330

alopsing commented Jul 8, 2021

longwuyuan commented Jul 9, 2021

toredash commented Jul 11, 2021

tao12345666333 commented Jul 12, 2021

k8s-triage-robot commented Oct 10, 2021

k8s-triage-robot commented Nov 9, 2021

k8s-triage-robot commented Dec 9, 2021

k8s-ci-robot commented Dec 9, 2021

narqo commented Jan 21, 2022 •

edited

Loading

toredash commented Jan 25, 2022

jason1004 commented Dec 17, 2022

jason1004 commented Dec 17, 2022

Ingress Controller forwarding traffic to a POD(IP) even after termination #7330

Ingress Controller forwarding traffic to a POD(IP) even after termination #7330

Comments

alopsing commented Jul 8, 2021

longwuyuan commented Jul 9, 2021

toredash commented Jul 11, 2021

tao12345666333 commented Jul 12, 2021

k8s-triage-robot commented Oct 10, 2021

k8s-triage-robot commented Nov 9, 2021

k8s-triage-robot commented Dec 9, 2021

k8s-ci-robot commented Dec 9, 2021

narqo commented Jan 21, 2022 • edited Loading

toredash commented Jan 25, 2022

jason1004 commented Dec 17, 2022

jason1004 commented Dec 17, 2022

narqo commented Jan 21, 2022 •

edited

Loading