Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression issue with keep alive connections #27363

Closed
OrKoN opened this issue Apr 23, 2019 · 27 comments
Closed

Regression issue with keep alive connections #27363

OrKoN opened this issue Apr 23, 2019 · 27 comments
Labels
http Issues or PRs related to the http subsystem.

Comments

@OrKoN
Copy link
Contributor

OrKoN commented Apr 23, 2019

  • Version: 10.15.3
  • Platform: Linux
  • Subsystem:

Hi,

We updated the node version from 10.15.0 to 10.15.3 for a service which runs behind the AWS Application Load Balancer. After that our test suite revealed an issue which we didn't see before an update which results in HTTP 502 errors thrown by the load balancer. Previously, this was happening if the Node.js server closed a connection before the load balancer. We solved this by setting server.keepAliveTimeout = X where X is higher than the keep-alive timeout on the load balancer side.

With version 10.15.3 setting server.keepAliveTimeout = X does not work anymore and we see regular 502 errors by the load balancer. I have checked the changelog for Node.js, and it seems that there was a change related to keep-alive connection in 10.15.2 1a7302bd48 which might have caused the issue we are seeing.

Does anyone know if the mentioned change can cause the issue we are seeing? In particular, I believe the problem is that the connection is closed before the specified keep-alive timeout.

@BridgeAR BridgeAR added the http Issues or PRs related to the http subsystem. label Apr 24, 2019
@BridgeAR
Copy link
Member

// cc @nodejs/http

@bnoordhuis
Copy link
Member

The slowloris mitigations only apply to the HTTP headers parsing stage. Past that stage the normal timeouts apply (barring bugs, of course.)

Is it an option for you to try out 10.15.1 and 10.15.2, to see if they exhibit the same behavior?

@OrKoN
Copy link
Contributor Author

OrKoN commented Apr 24, 2019

In our test suite, there are about 250 HTTP requests. I have run the test suite four times for each of the following node versions 10.15.0, 10.15.1, 10.15.2. For 10.15.0 & 10.15.1 there was zero HTTP failures. For 10.15.2 there are on average two failures per test suite run (HTTP 502). In every run, a different test case fails so failures are not deterministic.

I tried to build a simple node server and reproduce the issue with it, but so far without any success. We will try to figure out what is the exact pattern and the volume of requests to reproduce the issue. Timing and the speed of the client might matter.

@shuhei
Copy link
Contributor

shuhei commented Apr 27, 2019

I guess that headersTimeout should be longer than keepAliveTimeout because after the first request of a keep-alive connection,headersTimeout is applied to the period between the end of the previous request (even before its response is sent) and the first parsing of the next request.

@OrKoN What happens with your test suite if you put a headersTimeout longer than keepAliveTimeout?

@shuhei
Copy link
Contributor

shuhei commented Apr 28, 2019

Created a test case that reproduces the issue. It fails on 10.15.2 and 10.15.3. (Somehow headersTimeout seems to work only when headers are sent in multiple packets.)

To illustrate the issue with an example of two requests on a keep-alive connection:

  1. A connection is made
  2. The server receives the first packet of the first request's headers
  3. The server receives the second packet of the first request's headers
  4. The server sends the response for the first request
  5. (...idle time...)
  6. The server receives the first packet of the second request's headers
  7. The server receives the second packet of the second request's headers

keepAliveTimeout works for 4-6 (the period between 4 and 6). headersTimeout works for 3-7. So headersTimeout should be longer than keepAliveTimeout in order to keep connections until keepAliveTimeout.

I wonder whether headersTimeout should include 3-6. 6-7 seems more intuitive for the name and should be enough for mitigating Slowloris DoS because 3-4 is up to the server and 4-6 is covered by keepAliveTimeout.

@OrKoN
Copy link
Contributor Author

OrKoN commented Apr 29, 2019

@shuhei so you mean that headersTimeout spans multiple requests on the same connection? I have not tried to change the headersTimeout because I expected it to work for a single request only and we have no long requests in our test suite. It looks like the headers timer should reset when a new request arrives but it's defined by the first request for a connection.

@shuhei
Copy link
Contributor

shuhei commented Apr 29, 2019

@OrKoN Yes, headersTimeout spans parts of two requests on the same connection including the interval between the two requests. Before 1a7302bd48, it was only applied to the first request. The commit started resetting the headers timer when a request is done in order to apply headersTimeout to subsequent requests in the same connection.

@OrKoN
Copy link
Contributor Author

OrKoN commented Apr 29, 2019

I see. So it looks like an additional place to reset the timer would be the beginning of a new request? And parserOnIncoming is only called once the headers are parsed, so it need to be some other place then.

P.S. I will run our tests with increased headerTimeout today to see if it helps.

@OrKoN
Copy link
Contributor Author

OrKoN commented Apr 29, 2019

So we have applied the workaround (headersTimeout > keepAliveTimeout) and the errors are gone. 🎉

@alaz
Copy link

alaz commented May 1, 2019

I faced this issue too. I configured my Nginx load balancer to use keepalive when connecting to Node upstreams. I already saw it dropping connections and found the reason. I switched to Node 10 after that and was surprised to see this happening again: Nginx reports that Node closed the connection unexpectedly and then Nginx disables that upstream for a while.

I have not seen this problem after tweaking header timeouts yesterday as proposed by @OrKoN above. I think this is a serious bug, since it results in load balancers switching nodes off and on.

Why does not anybody else find this bug alarming? My guess is that -

  1. there are no traces of it on Node instances itself. No log messages, nothing.
  2. Web users connecting to Node services directly may simply ignore that few connections are dropped. The rate was not high in my case (maybe a couple of dozens per day while we serve millions of connections daily), so the chance of a particular visitor to experience this is relatively small.
  3. and I found the bug indirectly based on the load balancer's logs: not everyone keeps an eye on the logs closely.

@yoavain
Copy link
Contributor

yoavain commented May 23, 2019

We're having the same problem after upgrading from 8.x to 10.15.3.
However, I don't think it's a regression from 10.15.2 to 10.15.3.
This discussion goes way back to this issue #13391
I forked the example code there and created a new test case that fails on all the following versions:
12.3.1, 10.15.3, 10.15.2, 10.15.1, 10.15.0, 8.11.2

The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.

So I tweaked the test so that:

  1. The server time to answer a request is 3 * keepAliveTimeout minus few (random) milliseconds (keeping 2-3 request alive).
  2. The client fires not only one request, but a request every (exactly) keepAliveTimeout. This makes sure that the client request are aligned with the server connection keepAliveTimeout.

The result are pretty consistent:

Error: socket hang up
    at createHangUpError (_http_client.js:343:17)
    at Socket.socketOnEnd (_http_client.js:444:23)
    at Socket.emit (events.js:205:15)
    at endReadableNT (_stream_readable.js:1137:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:9) {
  code: 'ECONNRESET'
}

You can clone the code from yoavain/node8keepAliveTimeout

npm install
npm run test -- --keepAliveTimeout 5000
(Note that the keepAliveTimeout is also the client requests interval)

When setting keepAliveTimeout to 0, the problem is gone.

npm run test -- --keepAliveTimeout 5000 --keepAliveDisabled

@dansantner
Copy link

Thanks for the info guys! This is a nasty issue that reared it's head when we went straight from 10.14 to 12. Node kept dropping our connections before the AWS Load Balancer knew about it. Once I set the ELB timeout < keepAliveTimeout < headersTimeout (we weren't even setting that one) the problem went away.

@markfermor
Copy link

markfermor commented Oct 16, 2019

The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.

I can confirm I'm pretty sure we're seeing this as well (v10.13.0). We have Nginx in front of NodeJS within K8s. We were seeing random "connection reset by peer" or "upstream prematurely closed connection" for requests Nginx was sending to nodeJS apps. On all these occasions the problem was occurring for connections established by Nginx to Node. Right on the default 5 second keepAliveTimeout on the nodeJS side, nginx decided to reuse it's open/established connection to the node process and send another request (however technically outside of the 5 second timeout limit on the node side by <2ms). NodeJS accepted this new request over the existing connection, responded with an ACK packet, then <2ms later node also followed up with a RST packet closing the connection. However stracing the nodeJS process I could see the app code had received the request and was processing it, but before the response could be sent, node had already closed the connection. I would second the thoughts that there is a slight race condition between the point the connection is about to be closed by nodeJS but it still accepting an incoming request.

To avoid we simply increased the nodeJS keepAliveTimeout to be higher than Nginx's, thus giving Nginx the power over the keepAlive connections. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout

PrintScreen of a packet capture taken on the nodeJS side of the connection is attached:
image

@kirillgroshkov
Copy link

Wow, very interesting thread. I have a suspicion that we're facing similar issue in AppEngine Node.js Standard. ~100 502 errors a day from ~1M requests per day total (~0.01% of all requests)

@michalschott
Copy link

Can confirm this is still the case in v10.19.0 release.

OrKoN added a commit to OrKoN/node that referenced this issue Mar 17, 2020
For keep-alive connections, the headersTimeout may fire during
subsequent request because the measurement was reset after
a request and not before a request.

Fixes: nodejs#27363
OrKoN added a commit to OrKoN/node that referenced this issue Mar 17, 2020
For keep-alive connections, the headersTimeout may fire during
subsequent request because the measurement was reset after
a request and not before a request.

Fixes: nodejs#27363
@sroze
Copy link

sroze commented Mar 24, 2020

We have investigated an issue for a very small subset of requests as well and this was the root-cause. The behaviour we see is exactly what @markfermor's described, you can read even more in our investigation details. The following configuration lines did indeed solve the issue:

server.keepAliveTimeout = 76 * 1000;
server.headersTimeout = 77 * 1000;

@kirillgroshkov
Copy link

This is what solved it for us in AppEngine:

this.server.keepAliveTimeout = 600 * 1000

BethGriggs pushed a commit that referenced this issue Apr 7, 2020
For keep-alive connections, the headersTimeout may fire during
subsequent request because the measurement was reset after
a request and not before a request.

PR-URL: #32329
Fixes: #27363
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
@vrozov
Copy link

vrozov commented Apr 9, 2020

@mcollina @OrKoN Is there a plan to release 10.15.4 with the fix?

@mcollina
Copy link
Member

mcollina commented Apr 9, 2020

Node 10.15.4 is not happening. At best, it might be backported to Node 10.20.1 or analog.

@wyardley
Copy link

wyardley commented Apr 9, 2020

Would love to see it in 10.20.x or similar if it’s at all feasible to port it back.

targos pushed a commit that referenced this issue Apr 12, 2020
For keep-alive connections, the headersTimeout may fire during
subsequent request because the measurement was reset after
a request and not before a request.

PR-URL: #32329
Fixes: #27363
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
@glasser
Copy link
Contributor

glasser commented May 14, 2020

Note that this also appears to be broken in v12, not just the recent v10s.

@yokomotod
Copy link

Since #32329 was merged, now I don't need to set headersTimeout bigger than keepAliveTimeout, right ?

MylesBorins pushed a commit that referenced this issue Aug 18, 2020
For keep-alive connections, the headersTimeout may fire during
subsequent request because the measurement was reset after
a request and not before a request.

Backport-PR-URL: #34131
PR-URL: #32329
Fixes: #27363
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
@Xilis
Copy link

Xilis commented Oct 7, 2020

@yokomotod if you're asking for v12, v12.19.0 contains the fix

v12.18.4...v12.19.0#diff-feaf3339998a19f0baf3f82414762c22

https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0

@Samjin
Copy link

Samjin commented Dec 17, 2021

@yokomotod if you're asking for v12, v12.19.0 contains the fix

v12.18.4...v12.19.0diff-feaf3339998a19f0baf3f82414762c22

https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0

If default is 60s then we still need to override that if ELB has longer duration, right?

@Xilis
Copy link

Xilis commented Dec 17, 2021

@yokomotod if you're asking for v12, v12.19.0 contains the fix
v12.18.4...v12.19.0diff-feaf3339998a19f0baf3f82414762c22
https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0

If default is 60s then we still need to override that if ELB has longer duration, right?

Yes, the application keep alive timeout should be higher than whatever is in front (nginx, ELB, ...). Whether to change the timeout on the side of the application or the load balancer depends on your setup (e.g. google cloud HTTP load balancer does not allow changing the timeout value). I'd suggest making your application agnostic to how it is deployed and make this value configurable through an environment variable for example.

The NodeJS default depends on the version you're running (I think it is 5 seconds for all non-obsolete versions), you can check for your version in the docs here.

ijjk added a commit to vercel/next.js that referenced this issue Mar 17, 2023
…46052)

Resolves #39689, partially resolves #28642 (see notes below)
Inspired by #44627

In #28642 it was also asked to expose `server.headersTimeout`, but it is
probably not needed for most use cases and not implemented even in `next
start`. It was needed to change this option before
nodejs/node#27363.
There also exists a rare bug that is described here
nodejs/node#32329 (comment). To fix
this exposing `server.headersTimeout` might be required both in
`server.js` and in `next start`.

Co-authored-by: JJ Kasper <jj@jjsweb.site>
@pankaz
Copy link

pankaz commented Feb 7, 2024

Hi everyone. Thank you for sharing these insights. Is there a chance that the fix is missing in v20.11.0?

This issue does not occur in 16.16.0 or 18.19.0. However, I'm noticing 502 errors when I use v20.11.0. Would love to get feedback on whether other members in the community are facing this issue.

@hjr3
Copy link

hjr3 commented Apr 3, 2024

However, I'm noticing 502 errors when I use v20.11.0.

I am seeing the same issue on v20.9.0 using Fastify as my server.

abhishekumar-tyagi pushed a commit to abhishekumar-tyagi/node that referenced this issue May 5, 2024
For keep-alive connections, the headersTimeout may fire during
subsequent request because the measurement was reset after
a request and not before a request.

PR-URL: nodejs/node#32329
Fixes: nodejs/node#27363
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem.
Projects
None yet