[nginx] Lost client requests when updating a deployment and using keep alive #489

foxylion · 2017-03-23T16:53:54Z

I use the following setup the reproduce this issue:

Setup a nginx ingress controller
Add the following deployment (including services and ingress)

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: zero-downtime-test
spec:
  rules:
  - host: zero-downtime.domain.example
    http:
      paths:
      - path: /
        backend:
          serviceName: zero-downtime-test
          servicePort: 80
---
kind: Service
apiVersion: v1
metadata:
  name: zero-downtime-test
spec:
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: zero-downtime-test
---
kind: Service
apiVersion: v1
metadata:
  name: zero-downtime-test-node-port
spec:
  type: NodePort
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: zero-downtime-test
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: zero-downtime-test
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: zero-downtime-test
    spec:
      containers:
      - name: backend
        image: nginx:1.11
        lifecycle:
          preStop:
            exec:
              command: ["sleep, "1"]
        livenessProbe:
          httpGet:
            path: /
            port: 80
            scheme: HTTP
        readinessProbe:
          httpGet:
            path: /
            port: 80
            scheme: HTTP
        ports:
        - containerPort: 80
          protocol: TCP

Now it is possible to access the deployment on http://node:30086 or http://zero-downtime.domain.example.
Setup two load (in my case JMeter) tests which access either the node port directly or through ingress.
Update the zero-downtime-test deployment
The load test accessing node port does not report any broken requests. The load test accessing through ingress reports failed requests for a short period.

I think this my have something to do due to nginx is not correctly closing the connection to the backend when using keep alive.

This problem won't be visible when disabling "keep alive" connections in JMeter.

The text was updated successfully, but these errors were encountered:

aledbf · 2017-03-23T19:56:26Z

@foxylion this was reported a long time ago kubernetes-retired/contrib#1123
Please check the nginx answer https://trac.nginx.org/nginx/ticket/1022

foxylion · 2017-03-23T20:32:25Z

Thanks for the issue reference. I didn't look in the old issue tracker, sorry.

This makes me think nginx inside the controller does not handle the closed connections of the pods it is routing to well.

Because JMeter is handling everything well when it directly communicates with the pods (over the node port) but fails when communicating with the controller (which is never restarted).

yoanisgil · 2017-03-23T21:08:09Z

@aledbf I like @foxylion want to achieve a zero down time deployments and right now I'm kinda lost on how to do this. Is there a workaround? Maybe we should document this to make sure people know that this is a problem they might potentially face (specially since the Kubernetes doc states that it can do zero downtime deployments)

aledbf · 2017-03-23T21:40:43Z

Kubernetes doc states that it can do zero downtime deployments

This is true. The real issue here is how nginx handles requests when keep-alive is enabled.

foxylion · 2017-03-23T22:56:50Z

The real issue here is how nginx handles requests when keep-alive is enabled.

Okay, so we are seeing the same issue. May question is now:
Is there anything we can do to fix this on the ingress-controller side?

If not we should document this prominently. Because no one would ever consider that ingress (with nginx) would not support the broad claim of Kubernetes being able to do zero downtime deployments.

(I really hope there is anything we can do to allow this)

yoanisgil · 2017-03-23T22:57:48Z

@aledbf from my very short experience using k8s, if you don't implement a technique like the one described here: https://www.chrismoos.com/2016/09/28/zero-downtime-deployments-kubernetes/

you won't get vanilla zero downtime deployment. We did some tests on our own side and if we disable support for Keep-Alive connections on the server side, then the approach described on that article works indeed.

We're going to try and shorten the Keep-Alive timeout to make sure is far below pod's termination periods and see where that gets us.

yoanisgil · 2017-03-23T22:59:54Z

@foxylion can you try reducing the keep-alive timeout to something really below the pod's termination period. In theory, this will give clients time enough to finish the connection.

foxylion · 2017-03-23T23:00:03Z

We're going to try and shorten the Keep-Alive timeout to make sure is far below pod's termination periods and see where that gets us.

The thing is, keep-alive is defined as "seconds until an idle connection is closed". When a connection is used constantly it will never be closed. So this will (in my opinion) not always work.

yoanisgil · 2017-03-23T23:01:10Z

I don't see a reference to idle connections here:

http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_timeout

foxylion · 2017-03-23T23:07:57Z

@yoanisgil If their implementation is matching the RFC draft for the Keep-Alive header, then timeout should be interpreted as:

A host sets the value of the timeout parameter to the time that the host will allows an idle connection to remain open before it is closed. A connection is idle if no data is sent or received by a host.

There is also a 'max' parameter which defined the maximum amount of requests can be performed on a single connection. But I can't find anything about a 'max seconds'.

Also you read everywhere things like this:

Internet Explorer reuses the same TCP/IP socket that was used to receive the initial request until the socket is idle for one minute.

yoanisgil · 2017-03-23T23:45:34Z

@foxylion thanks for the information. We'll keep testing and see where that gets us. In the meantime it would be nice if someone with more in depth experience/knowledge from the k8s community could jump in and share a thought or two.

aledbf · 2017-03-24T01:43:32Z

@foxylion @yoanisgil to really fix this we need to avoid reloading nginx in order to update the upstreams.
Please check this comment #257 (comment)

foxylion · 2017-03-24T08:23:04Z

@aledbf I'm having a hard time understanding this. When I disable keep-alive on my client (JMeter) there are absolutely no issues with ingress nginx. Only when enabling the keep-alive ingress nginx does drop some of the requests during a rollout.

problem with keep-alive when: client -> node-port -> service -> nginx -> pod
no problem without keep-alive when: client -> node-port -> service -> nginx -> pod
no problem with keep-alive when: client -> node-port -> service -> pod

So it is the problem that nginx is dropping requests when doing a reload on a configuration change, but only when the connections to nginx use keep-alive?

aledbf · 2017-03-24T10:10:08Z

So it is the problem that nginx is dropping requests when doing a reload on a configuration change, but only when the connections to nginx use keep-alive?

Exactly. This is a know issue in nginx. This is not related to the ingress controller or kubernetes.

aledbf · 2017-03-24T10:24:55Z

@foxylion at some point the old nginx worker is going to close the existing connection.
Please use a custom template and change

keepalive_timeout {{ $cfg.KeepAlive }}s;

to

keepalive_timeout {{ $cfg.KeepAlive }}s {{ $cfg.KeepAlive }}s;

This change add the max section in the keep-alive header.

Also from this page https://wiki.apache.org/jmeter/JMeterSocketClosed follow this instructions to change the default jmeter configuration:

Enabling retry
Enabling stale check

foxylion · 2017-03-24T10:55:50Z

@aledbf Thanks a lot! I'll try this out.

vbisserie · 2017-03-24T14:58:46Z

@foxylion, with @yoanisgil we find a way to ask nginx to renew the http connections.

    lifecycle:
      preStop:
        exec:
          command: ["/bin/bash", "-c", "python /srv/app/readiness.py --stop && sleep 8 && nginx -s reload && sleep 5"]

"nginx -s reload" creates new workers and all connexions are thereby restarted.

aledbf · 2017-03-26T20:51:14Z

@foxylion any update?

foxylion · 2017-03-27T08:51:23Z

@aledbf I'm sorry, but currently there is no time to investigate further.
Should we close the issue and I reopen it if I have something new?

aledbf · 2017-03-27T12:02:19Z

Should we close the issue and I reopen it if I have something new?

Closing. Please reopen if you have more information

foxylion · 2017-03-28T08:02:07Z

I started some testing on the issue, but not yet finished.
If someone is interested: https://github.com/foxylion-playground/k8s-ingress-repro

(Will reopen when finished testing)

a-chernykh · 2017-10-13T05:57:46Z

I think we are experiencing the same problem with Keep-alive HTTP connections. What makes it worse for us - we deploy multiple services to the cluster and they all have ingress resources. This means that whenever service is updated (new version is deployed or services scaled up / down) ingress controller runs nginx -s reload which terminates all keep-alive connections. This breaks our integration test suite which runs POST / PUT queries through ingress - we are getting weird HTTP responses and EOF-s from server.

imduffy15 · 2019-07-27T11:46:27Z

Hi all,

Sorry for commenting on an old issue. Is there any new information or workarounds on this?

I've been doing a preStop of > my longest connection timeout to place the instance in a terminating state, remove it from the LB (no new connections received) before letting it go into the nginx graceful shutdown process. Graceful shutdown time is my longest connection timeout * 2.

aledbf closed this as completed Mar 27, 2017

ghost mentioned this issue Mar 5, 2018

Turning off keepalive does not work as documented #2168

Closed

trumpet2012 mentioned this issue Mar 13, 2018

Upstream Pod Scaling Closes Connections #2197

Closed

BuddhiWathsala mentioned this issue Jul 29, 2019

NGINX 502 Bad Gateway when using a single replication #4375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nginx] Lost client requests when updating a deployment and using keep alive #489

[nginx] Lost client requests when updating a deployment and using keep alive #489

foxylion commented Mar 23, 2017 •

edited

Loading

aledbf commented Mar 23, 2017

foxylion commented Mar 23, 2017 •

edited

Loading

yoanisgil commented Mar 23, 2017

aledbf commented Mar 23, 2017

foxylion commented Mar 23, 2017 •

edited

Loading

yoanisgil commented Mar 23, 2017

yoanisgil commented Mar 23, 2017

foxylion commented Mar 23, 2017 •

edited

Loading

yoanisgil commented Mar 23, 2017

foxylion commented Mar 23, 2017 •

edited

Loading

yoanisgil commented Mar 23, 2017

aledbf commented Mar 24, 2017

foxylion commented Mar 24, 2017 •

edited

Loading

aledbf commented Mar 24, 2017

aledbf commented Mar 24, 2017

foxylion commented Mar 24, 2017

vbisserie commented Mar 24, 2017

aledbf commented Mar 26, 2017

foxylion commented Mar 27, 2017

aledbf commented Mar 27, 2017

foxylion commented Mar 28, 2017 •

edited

Loading

a-chernykh commented Oct 13, 2017

imduffy15 commented Jul 27, 2019

[nginx] Lost client requests when updating a deployment and using keep alive #489

[nginx] Lost client requests when updating a deployment and using keep alive #489

Comments

foxylion commented Mar 23, 2017 • edited Loading

aledbf commented Mar 23, 2017

foxylion commented Mar 23, 2017 • edited Loading

yoanisgil commented Mar 23, 2017

aledbf commented Mar 23, 2017

foxylion commented Mar 23, 2017 • edited Loading

yoanisgil commented Mar 23, 2017

yoanisgil commented Mar 23, 2017

foxylion commented Mar 23, 2017 • edited Loading

yoanisgil commented Mar 23, 2017

foxylion commented Mar 23, 2017 • edited Loading

yoanisgil commented Mar 23, 2017

aledbf commented Mar 24, 2017

foxylion commented Mar 24, 2017 • edited Loading

aledbf commented Mar 24, 2017

aledbf commented Mar 24, 2017

foxylion commented Mar 24, 2017

vbisserie commented Mar 24, 2017

aledbf commented Mar 26, 2017

foxylion commented Mar 27, 2017

aledbf commented Mar 27, 2017

foxylion commented Mar 28, 2017 • edited Loading

a-chernykh commented Oct 13, 2017

imduffy15 commented Jul 27, 2019

foxylion commented Mar 23, 2017 •

edited

Loading

foxylion commented Mar 23, 2017 •

edited

Loading

foxylion commented Mar 23, 2017 •

edited

Loading

foxylion commented Mar 23, 2017 •

edited

Loading

foxylion commented Mar 23, 2017 •

edited

Loading

foxylion commented Mar 24, 2017 •

edited

Loading

foxylion commented Mar 28, 2017 •

edited

Loading