Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP/1.1 server aborts upload request in Node.js 18 #47421

Closed
nwtgck opened this issue Apr 5, 2023 · 23 comments
Closed

HTTP/1.1 server aborts upload request in Node.js 18 #47421

nwtgck opened this issue Apr 5, 2023 · 23 comments
Labels
http Issues or PRs related to the http subsystem.

Comments

@nwtgck
Copy link

nwtgck commented Apr 5, 2023

Version

18.0.0, 18.15.0

(14.21.3, 16.20.0, 17.9.1 works fine)

Platform

GitHub Actions ubuntu-20.04

Here is uname -a:

Linux fv-az222-460 5.15.0-1034-azure #41~20.04.1-Ubuntu SMP Sat Feb 11 17:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

http

What steps will reproduce the bug?

First of all, it is a super strange bug but I made a simple code for reproducing it.

Here is the whole code:
nwtgck/public-code@02b852b...6f51243

Run the following code by node server1.js. The server consumes the request body, responds "Finished!" and ends when the request body is read.

// server.js
const http = require("http");

const server = http.createServer((req, res) => {
  console.log(`${req.method} ${req.url}`);
  req.on("data", (data) => console.log(`on data: ${data.length}B`));
  req.on("close", () => console.log("req closed"));
  req.on("aborted", () => console.log("req aborted"));
  req.on("error", (err) => console.log("req error", err));
  req.on("end", () => {
    console.log("req end");
    res.end("Finished!\n");
  });
  res.writeHead(200);
  res.write("Handling...\n");
});

server.listen(3000, () => {
  console.log(`Listening on ${server.address().port}...`);
});

Upload a text sequence of 1-100 with 10 byte limit as follows.

seq 100 | pv -L 10 | curl -sST- http://127.0.0.1:3000

After approximately 23~26 seconds, the curl stops suddenly.

Here is the server-side stdout which outputs ECONNRESET error.

...
on data: 1B
on data: 1B
req aborted
req error Error: aborted
    at connResetException (node:internal/errors:717:14)
    at abortIncoming (node:_http_server:754:17)
    at socketOnClose (node:_http_server:748:3)
    at Socket.emit (node:events:525:35)
    at TCP.<anonymous> (node:net:322:12) {
  code: 'ECONNRESET'
}
req closed

How often does it reproduce? Is there a required condition?

Use GitHub Actions to reproduce it. My local Ubuntu 20.04 actual machine, GitHub Codespace, M1 Mac and Vagrant on Intel Mac did not reproduce it.

GitHub Actions (without sleep 300)

Here is the result:

  • Node 14.12.3: 0/10 failed
  • Node 16.20.0: 0/10 failed
  • Node 17.9.1: 0/10 failed
  • Node 18.0.0: 8/10 failed
  • Node 18.15.0: 9/10 failed

The result means Node 14-17 works but Node 18 does not work more than 80%.

The result was created by re-run GitHub Actions 10 times:
image
https://github.com/nwtgck/public-code/actions/runs/4602240359

GitHub Actions (with sleep 300)

Here is the super strange part. I added 5-minite wait and the server on Node 18 works fine. Both Node 18.0.0 and 18.15.0 worked well 10 times (0/10 failed).

All I did is adding sleep 300 at the very begining before checkout:

image

nwtgck/public-code@04d99e2

(whole code: nwtgck/public-code@02b852b...04d99e2)

The sleep-result was also created by re-run GitHub Actions 10 times:
image
https://github.com/nwtgck/public-code/actions/runs/4602334265

What is the expected behavior? Why is that the expected behavior?

The expected behavior is curl uploading successfully:

$ seq 100 | pv -L 10 | curl -sST- http://127.0.0.1:3000
Handling...
Finished!

Additional information

I confirmed CircleCI also has the same issue before simplifying a reproducing code. machine.image is ubuntu-2004:2023.02.1 and ubuntu-2004:202010-01. I will create a simple CircleCI reproducing code If a Node.js member needs it.

@VoltrexKeyva VoltrexKeyva added the http Issues or PRs related to the http subsystem. label Apr 5, 2023
@bnoordhuis
Copy link
Member

Thanks for the test case but ECONNRESET means the other end closed the connection. It's pretty much guaranteed that this isn't a node bug.

My best guess is you're hitting some kind of firewall rule on Actions and CircleCI.

@nwtgck
Copy link
Author

nwtgck commented Apr 6, 2023

@bnoordhuis Thank you so much for thinking about the issue! I'll try to investigate some kind of firewall rule.

I captured packets using tcpdump. Here are the last packets:
image

The server sends FIN, ACK and RST.

I ran the two commands:

  • sudo tcpdump -i any -n "host 127.0.0.1 and port 3000" -w a.pcap
  • seq 100 | pv -L 10 | curl -sST- http://127.0.0.1:3000 (same command as the first comment in the issue)

I hope this would be helpful.

Here is the whole pcap file in zip:
a.pcap.zip

@bnoordhuis
Copy link
Member

Oh, I understand what you mean now. Node v18.x added a requestTimeout option to http.createServer() that prevents slowloris attacks (which is basically what you're doing with pv(1)) and defaults to 300,000 milliseconds - i.e., 5 minutes.

@nwtgck
Copy link
Author

nwtgck commented Apr 7, 2023

Thanks! I'm not sure requestTimeout is related to the issue. requestTimeout is "Sets the timeout value in milliseconds for receiving the entire request from the client."

The client data seq 100 | pv -L 10 is finished about 30 seconds, which is less than 5 minutes.

The solution in the first comment is like:

# Wait 5 min
sleep 300
# Run server
node server1.js &
# <sufficient wait for starting the server>

# <the curl command>

I just waited 5 min before starting the server in GitHub Actions.

But anyway, I figured out http.createServer({ requestTimeout: 0 }, ...) resolves the issue.

@bnoordhuis
Copy link
Member

Those "on data: 1B" events suggest it's feeding the data byte by byte but anyway, happy to hear it's working for you now. I'll close out the issue.

@bnoordhuis bnoordhuis closed this as not planned Won't fix, can't repro, duplicate, stale Apr 8, 2023
@nwtgck
Copy link
Author

nwtgck commented Apr 8, 2023

@bnoordhuis Thank you so much for dealing with the issue.

However, my last word may be not accurate. I just found a workaround for the error.
I guess there is still some problem because it is very strange that waiting 5 min before starting the server also resolves the error. Specifying requestTimeout: Number.MAX_SAFE_INTEGER does not solve the error.

@bnoordhuis
Copy link
Member

I can see how such magic incantations can be unsatisfactory. Just so we're on the same page: you agree those "on data: 1B" events mean the data is arriving one byte at a time?

Because it makes sense node's slowloris protection kicks in under such conditions, and it also explains why a zero (but not a non-zero) requestTimeout makes the problem go away - because it disables the protection.

@gramakri
Copy link

gramakri commented Apr 9, 2023

I am having the same issue in our upload logic since updaing to node 18. It works fine in node 16. The exact stack trace is this:

restore: error in pipline Error: aborted
    at connResetException (node:internal/errors:717:14)
    at abortIncoming (node:_http_server:755:17)
    at socketOnClose (node:_http_server:749:3)
    at Socket.emit (node:events:525:35)
    at TCP.<anonymous> (node:net:322:12) {
  code: 'ECONNRESET'
}

Disabling requestTimeout: 0 and headersTimeout: 0 sorts out the issue. But there is a bug in the default server - I have set those timeouts to high numbers but they have no effect (headersTimeout: 5 * 60 * 1000, requestTimeout: 4 * 60 * 60 * 1000) .

What's curious is that this does not happen all the time. I can reproduce this often and reliably, it will be reproducible for a couple of hours and then it will start working! I am going crazy.

@gramakri
Copy link

gramakri commented Apr 9, 2023

@bnoordhuis @ShogunPanda (cc'ed Paolo since he seems to have touched this code after node 17).

OK, I have some more concrete information after debugging this a bit. I compiled node from source and put some debug messages in the Expired function - https://github.com/nodejs/node/blob/main/src/node_http_parser.cc#L1114

The code is:

--- a/src/node_http_parser.cc
+++ b/src/node_http_parser.cc
@@ -32,7 +32,7 @@
 
 #include <cstdlib>  // free()
 #include <cstring>  // strdup(), strchr()
-
+#include <iostream>
 
 // This is a binding to llhttp (https://github.com/nodejs/llhttp)
 // The goal is to decouple sockets from parsing for more javascript-level
@@ -1121,6 +1121,10 @@ void ConnectionsList::Expired(const FunctionCallbackInfo<Value>& args) {
   const uint64_t request_deadline =
     request_timeout > 0 ? now - request_timeout : 0;
 
+  std::cout << "Expiry checker " << "hto: " << headers_timeout << " rto: " << request_timeout
+    << " now: " << now
+    << " hd: " << headers_deadline << " rd: " << request_deadline << std::endl;
+
   uint32_t i = 0;
   auto iter = list->active_connections_.begin();
   auto end = list->active_connections_.end();
@@ -1140,6 +1144,7 @@ void ConnectionsList::Expired(const FunctionCallbackInfo<Value>& args) {
         return;
       }
 
+      std::cout << "Something expired " << " headers_completed_: " << parser->headers_completed_ << " last_message_start_: " << parser->last_message_start_ << std::endl;
       list->active_connections_.erase(parser);
     }
   }

When the upload fails, I got:

Expiry checker hto: 300000000000 rto: 14400000000000 now: 3633345688065 hd: 3333345688065 rd: 18446733307055239681
Something expired  headers_completed_: 1 last_message_start_: 3632136659343

The headersTimeout (5mins) and requestTimeout (4 hours) are correctly printed above. I don't understand what now() is. It doesn't match epoch and the libuv docs say it is against an arbitrary reference point. OK. Well, request_deadline = now - request_timeout is then some negative number. Is that what you are expecting ? request_deadline is also a uint64_t . So, request_deadline > 0 && parser->last_message_start_ < request_deadline succeeds (it's always greater than 0 being unsigned and it's also a very large signed number after wrapping over as it is negative). And then, it proceeds to terminate the connection with a 408.

@ShogunPanda
Copy link
Contributor

request_deadline should always be positive as it subtracted "now" to the "timeout".
That number is then compared to each request starting time. Any request with a request start time smaller than that is considered expired.

I think that I never checked if request_timeout (or headers_timeout) might be greater than uv_hrtime(). In that case I think the check is simply not applicable and therefore will be implicitly disabled.

We should add the check and the warning for the user if this happen.
Would you like to send a PR for this?

@gramakri
Copy link

@ShogunPanda I can give it a shot (but give me a day or two). Can you open this bug in the meantime?

@bnoordhuis
Copy link
Member

@nwtgck mentioned this only happened in CI environments but not locally. It could be time base-related but it could also be something else. I suggest opening a new issue specifically for the bug you've identified.

@gramakri
Copy link

@bnoordhuis my bug report is similar situation (I guess?). This is reproducible only inside the docker container created by our CI. We have a home grown CI that creates VMs on Digital Ocean. Then, one of the services which uses nodejs runs in a docker container, where this bug is seen.

@bnoordhuis @ShogunPanda also, can someone tell me what uv_hrtime is ? Is this just a monotonically increasing clock? Can it be say 1 or 2 or any random number on nodejs start up? (as you can see, I am only a node user. I have little idea about node code itself :-) ).

@bnoordhuis
Copy link
Member

Is this just a monotonically increasing clock?

Yes. The only guarantee at startup is that it's somewhere between 0 and 2**64-1 inclusive. :-)

@gramakri
Copy link

gramakri commented Apr 19, 2023

@bnoordhuis @ShogunPanda I am trying to understand the code at https://github.com/nodejs/node/blob/main/src/node_http_parser.cc#L1110

  const uint64_t request_deadline =
    request_timeout > 0 ? now - request_timeout : 0;
...
     (
        request_deadline > 0 &&
        parser->last_message_start_ < request_deadline)
    ) {

last_message_start_ is set to the time when the last packet was received. So, the calculation is not based on time when the first packet was received. The docs say it's the latter (my understanding). Per https://nodejs.org/api/http.html#serverrequesttimeout, "Sets the timeout value in milliseconds for receiving the entire request from the client." . Can you clarify if I am reading the code or docs wrong?

If the intent is to prevent DoS attacks, then shouldn't the code be based on "entire request" time and not "last message" time ?

@ShogunPanda
Copy link
Contributor

@gramakri I know is misleading, but in the C++ parser a message is either a HTTP request or response. In our case, it's the HTTP request.

In other words, parser->last_message_start_ is when the first byte of the current last HTTP request was received on the socket (considering HTTP keep-alive). When it is the first message on the socket, it is equal to the connection time, not when the first byte is actually received (to prevent idle socket Slowloris like DDOS attacks).

@gsong
Copy link

gsong commented May 17, 2023

I'm seeing this as well when deploying an Express app to @superfly. The specific use case is for Express to proxy multi-file large uploads to Cloudinary via stream.

# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

When using v18.16.0, I'll experience the error. However, rolling back to v16.20.0.

Error: aborted
  File "node:internal/errors", line 717, col 14, in connResetException
  File "node:_http_server", line 754, col 17, in abortIncoming
  File "node:_http_server", line 748, col 3, in socketOnClose
  File "node:events", line 525, col 35, in Socket.emit
  File "node:domain", line 489, col 12, in Socket.emit
  File "node:net", line 322, col 12, in TCP.<anonymous>
  File "node:internal/async_hooks", line 130, col 17, in TCP.callbackTrampoline

I've confirmed that there's no memory leak as the memory usage stays steady through uploading 70+ files.

I wasn't able to replicate this locally on my Mac M1 Pro.

@gsong
Copy link

gsong commented May 17, 2023

This can also be replicated using v20.2.0.

@gsong
Copy link

gsong commented May 18, 2023

@bnoordhuis Is there a way to reopen this ticket for further investigation? I'm not sure what the workaround would be.

@gsong
Copy link

gsong commented May 18, 2023

The same error can be replicated on v16.20.0 if you close the browser during the file transfer. I can replicate this remotely and locally.

@bnoordhuis
Copy link
Member

I can reopen but what is there to investigate? If you've identified a bug, you should send a pull request with a fix.

@h-a-s-k
Copy link

h-a-s-k commented Mar 26, 2024

So how is one supposed to upload large files with nodejs without setting a timeout for the whole server? This is the only place I've found talking about the same issue. At first I thought it was a browser issue but it is not.

Trying to upload a big >5GB file to localhost using XMLHttpRequest on the browser with a infinite timeout.
Times out after 5 minutes uploading on the server, the browser will usually show 400 Bad Request with no response.

This is the ONLY fix I found for it

Server.requestTimeout = 2 * 60 * 60 * 1000;

How come this can't be set per connection or route? These don't fix it at all by the way

Server.on(`connection`, function (socket) {
    socket.setTimeout(2 * 60 * 60 * 1000);
});
Server.keepAliveTimeout = 2 * 60 * 60 * 1000;
Server.timeout = 2 * 60 * 60 * 1000;
Server.setTimeout(2 * 60 * 60 * 1000);

@ShogunPanda
Copy link
Contributor

You can't setup a timer on the single request due to how connectionsCheckingInterval works on the server.
We need a simple timeout management otherwise the HTTP server performance will be impacted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem.
Projects
None yet
Development

No branches or pull requests

7 participants