Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nginx] Lost client requests when updating a deployment and using keep alive #489

Closed
foxylion opened this issue Mar 23, 2017 · 23 comments
Closed

Comments

@foxylion
Copy link
Contributor

foxylion commented Mar 23, 2017

I use the following setup the reproduce this issue:

  1. Setup a nginx ingress controller
  2. Add the following deployment (including services and ingress)
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: zero-downtime-test
spec:
  rules:
  - host: zero-downtime.domain.example
    http:
      paths:
      - path: /
        backend:
          serviceName: zero-downtime-test
          servicePort: 80
---
kind: Service
apiVersion: v1
metadata:
  name: zero-downtime-test
spec:
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: zero-downtime-test
---
kind: Service
apiVersion: v1
metadata:
  name: zero-downtime-test-node-port
spec:
  type: NodePort
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: zero-downtime-test
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: zero-downtime-test
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: zero-downtime-test
    spec:
      containers:
      - name: backend
        image: nginx:1.11
        lifecycle:
          preStop:
            exec:
              command: ["sleep, "1"]
        livenessProbe:
          httpGet:
            path: /
            port: 80
            scheme: HTTP
        readinessProbe:
          httpGet:
            path: /
            port: 80
            scheme: HTTP
        ports:
        - containerPort: 80
          protocol: TCP
  1. Now it is possible to access the deployment on http://node:30086 or http://zero-downtime.domain.example.
  2. Setup two load (in my case JMeter) tests which access either the node port directly or through ingress.
  3. Update the zero-downtime-test deployment
  4. The load test accessing node port does not report any broken requests. The load test accessing through ingress reports failed requests for a short period.

I think this my have something to do due to nginx is not correctly closing the connection to the backend when using keep alive.

This problem won't be visible when disabling "keep alive" connections in JMeter.

@aledbf
Copy link
Member

aledbf commented Mar 23, 2017

@foxylion this was reported a long time ago kubernetes-retired/contrib#1123
Please check the nginx answer https://trac.nginx.org/nginx/ticket/1022

@foxylion
Copy link
Contributor Author

foxylion commented Mar 23, 2017

Thanks for the issue reference. I didn't look in the old issue tracker, sorry.

This makes me think nginx inside the controller does not handle the closed connections of the pods it is routing to well.

Because JMeter is handling everything well when it directly communicates with the pods (over the node port) but fails when communicating with the controller (which is never restarted).

@yoanisgil
Copy link

@aledbf I like @foxylion want to achieve a zero down time deployments and right now I'm kinda lost on how to do this. Is there a workaround? Maybe we should document this to make sure people know that this is a problem they might potentially face (specially since the Kubernetes doc states that it can do zero downtime deployments)

@aledbf
Copy link
Member

aledbf commented Mar 23, 2017

Kubernetes doc states that it can do zero downtime deployments

This is true. The real issue here is how nginx handles requests when keep-alive is enabled.

@foxylion
Copy link
Contributor Author

foxylion commented Mar 23, 2017

The real issue here is how nginx handles requests when keep-alive is enabled.

Okay, so we are seeing the same issue. May question is now:
Is there anything we can do to fix this on the ingress-controller side?

If not we should document this prominently. Because no one would ever consider that ingress (with nginx) would not support the broad claim of Kubernetes being able to do zero downtime deployments.

(I really hope there is anything we can do to allow this)

@yoanisgil
Copy link

@aledbf from my very short experience using k8s, if you don't implement a technique like the one described here: https://www.chrismoos.com/2016/09/28/zero-downtime-deployments-kubernetes/

you won't get vanilla zero downtime deployment. We did some tests on our own side and if we disable support for Keep-Alive connections on the server side, then the approach described on that article works indeed.

We're going to try and shorten the Keep-Alive timeout to make sure is far below pod's termination periods and see where that gets us.

@yoanisgil
Copy link

@foxylion can you try reducing the keep-alive timeout to something really below the pod's termination period. In theory, this will give clients time enough to finish the connection.

@foxylion
Copy link
Contributor Author

foxylion commented Mar 23, 2017

We're going to try and shorten the Keep-Alive timeout to make sure is far below pod's termination periods and see where that gets us.

The thing is, keep-alive is defined as "seconds until an idle connection is closed". When a connection is used constantly it will never be closed. So this will (in my opinion) not always work.

@yoanisgil
Copy link

I don't see a reference to idle connections here:

http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_timeout

@foxylion
Copy link
Contributor Author

foxylion commented Mar 23, 2017

@yoanisgil If their implementation is matching the RFC draft for the Keep-Alive header, then timeout should be interpreted as:

A host sets the value of the timeout parameter to the time that the host will allows an idle connection to remain open before it is closed. A connection is idle if no data is sent or received by a host.

There is also a 'max' parameter which defined the maximum amount of requests can be performed on a single connection. But I can't find anything about a 'max seconds'.

Also you read everywhere things like this:

Internet Explorer reuses the same TCP/IP socket that was used to receive the initial request until the socket is idle for one minute.

@yoanisgil
Copy link

@foxylion thanks for the information. We'll keep testing and see where that gets us. In the meantime it would be nice if someone with more in depth experience/knowledge from the k8s community could jump in and share a thought or two.

@aledbf
Copy link
Member

aledbf commented Mar 24, 2017

@foxylion @yoanisgil to really fix this we need to avoid reloading nginx in order to update the upstreams.
Please check this comment #257 (comment)

@foxylion
Copy link
Contributor Author

foxylion commented Mar 24, 2017

@aledbf I'm having a hard time understanding this. When I disable keep-alive on my client (JMeter) there are absolutely no issues with ingress nginx. Only when enabling the keep-alive ingress nginx does drop some of the requests during a rollout.

  • problem with keep-alive when: client -> node-port -> service -> nginx -> pod
  • no problem without keep-alive when: client -> node-port -> service -> nginx -> pod
  • no problem with keep-alive when: client -> node-port -> service -> pod

So it is the problem that nginx is dropping requests when doing a reload on a configuration change, but only when the connections to nginx use keep-alive?

@aledbf
Copy link
Member

aledbf commented Mar 24, 2017

So it is the problem that nginx is dropping requests when doing a reload on a configuration change, but only when the connections to nginx use keep-alive?

Exactly. This is a know issue in nginx. This is not related to the ingress controller or kubernetes.

@aledbf
Copy link
Member

aledbf commented Mar 24, 2017

@foxylion at some point the old nginx worker is going to close the existing connection.
Please use a custom template and change

keepalive_timeout {{ $cfg.KeepAlive }}s;

to

keepalive_timeout {{ $cfg.KeepAlive }}s {{ $cfg.KeepAlive }}s;

This change add the max section in the keep-alive header.

Also from this page https://wiki.apache.org/jmeter/JMeterSocketClosed follow this instructions to change the default jmeter configuration:

  • Enabling retry
  • Enabling stale check

@foxylion
Copy link
Contributor Author

@aledbf Thanks a lot! I'll try this out.

@vbisserie
Copy link

@foxylion, with @yoanisgil we find a way to ask nginx to renew the http connections.

    lifecycle:
      preStop:
        exec:
          command: ["/bin/bash", "-c", "python /srv/app/readiness.py --stop && sleep 8 && nginx -s reload && sleep 5"]

"nginx -s reload" creates new workers and all connexions are thereby restarted.

@aledbf
Copy link
Member

aledbf commented Mar 26, 2017

@foxylion any update?

@foxylion
Copy link
Contributor Author

@aledbf I'm sorry, but currently there is no time to investigate further.
Should we close the issue and I reopen it if I have something new?

@aledbf
Copy link
Member

aledbf commented Mar 27, 2017

Should we close the issue and I reopen it if I have something new?

Closing. Please reopen if you have more information

@aledbf aledbf closed this as completed Mar 27, 2017
@foxylion
Copy link
Contributor Author

foxylion commented Mar 28, 2017

I started some testing on the issue, but not yet finished.
If someone is interested: https://github.com/foxylion-playground/k8s-ingress-repro

(Will reopen when finished testing)

@a-chernykh
Copy link

I think we are experiencing the same problem with Keep-alive HTTP connections. What makes it worse for us - we deploy multiple services to the cluster and they all have ingress resources. This means that whenever service is updated (new version is deployed or services scaled up / down) ingress controller runs nginx -s reload which terminates all keep-alive connections. This breaks our integration test suite which runs POST / PUT queries through ingress - we are getting weird HTTP responses and EOF-s from server.

@imduffy15
Copy link

Hi all,

Sorry for commenting on an old issue. Is there any new information or workarounds on this?

I've been doing a preStop of > my longest connection timeout to place the instance in a terminating state, remove it from the LB (no new connections received) before letting it go into the nginx graceful shutdown process. Graceful shutdown time is my longest connection timeout * 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants