Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroy sessions when an error occurs #97

Open
sttk opened this issue Mar 16, 2023 · 6 comments
Open

Destroy sessions when an error occurs #97

sttk opened this issue Mar 16, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@sttk
Copy link

sttk commented Mar 16, 2023

http2-wrapper version I used was 2.2.0.

I created a BFF with got which sends requests to a backend server on GCP Cloud Run, and encountered unrecovered repeated ETIMEOUT errors. (This ETIMEOUT error was caused by got and was passed to http2-wrapper's clientRequest._destroy.)
While this error occurred, all request got sent did not reach to a backend.

According to the logs I embedded in a forked http2-wrapper, while the error occurred, sessions were always gotten from agent.sessions.

On the other hand, when an error occurred but a connection recovered, the sessions was newly created.

I could achieve this recovering behavior forcedly by adding the codes in clientRequest._destroy, which destroys sessions in clientRequest.agent.sessions, delete entries in clientRequest.agent.queue, and delete a session in clientRequest.agent.tlsSessionCache by an origin (clientRequest[kOrigin]).
However this modification is not perfect because there was a time when unrecovered errors occurred. (at that time, sessions were gotten from agent.sessions.)

Is it not necessary to add such process to destroy sessions and remove them from queue, etc. when an error occurs?

P.S. This unrecovered errors never occurred when setting .agent = false of http2.auto's options.

@sttk
Copy link
Author

sttk commented Mar 16, 2023

The code at using got

const gotOpts = {
  method: 'post',
  headers: headers,
  body: body,
  https: { rejectUnauthorized: false },
  https: true,
  request: http2wrapper.auto,
  timeout: { lookup: 100, connect: 60000, secureConnect: 1000, send: 10000, response: 10000 },
};
const response = await got(url, gotOpts);
...

And the codes I added to http2-wrapper to force to recover a connection when an error occurred.

(client-request.js)
...
class ClientRequest extends Writable {
  ...
  async _destroy(error, callback) {
    // --- start of added code ---
    if (this.agent) {
      if (error) {
        const origin = this[kOrigin];
        this.agent.destroySessionsByOrigin(origin);
      }
    }
    // --- end of added code ---
    ....
  }
(agent.js)
...
class Agent extends EventEmitter {
  ...
  destroySessionsByOrigin(origin) {
    const {sessions, queue) = this;
    for (const normalizedOptions in sessions) {
      const thisSession = sessions[normalizedOptions];
      for (let index = 0; index < thisSessions.length; index++) {
        const session = thisSesions[index];
        if (session[kOriginSet].includes(origin)) {
          session.destroy();
        }
      }
      const name = `${normalizedOptions}:${origin}`;
      this.tlsSessionCache.delete(name);
    }
    for (const normalizedOptions in queue) {
      const entries = queue[normalizedOptions];
      for (const normalizedOrigin in entries) {
        if (normalizedOrigin === origin) {
          entries[normalizedOrigin].destroyed = true;
        }
      }
      const name = `${normalizedOptions}:${origin}`;
      this.tlsSessionCache.delete(name);
    }
  }

  destroy() {
  ...

@thesauri
Copy link

Hi @sttk ! We are experiencing an issue where got sometimes ends up in a state where none of the requests reach the backend. Instead, requests timeout with timings that look like this:

timings: {
      "start": 1698668866493,
      "socket": 1698668866494,
      "upload": 1698668866494,
      "error": 1698668896495,
      "phases": {
        "wait": 1,
        "request": null,
        "total": 30002
      }
    }

This looks fishy to us, as "request" might be undefined, but should never be null to our understanding.

We suspect that this has something to due with http2 and it sounds similar to what you're describing here:

While this error occurred, all request got sent did not reach to a backend.

However, we haven't been able to reproduce the issue locally yet. Did you find a way to reliably reproduce this issue? If so, how did you do that?

@thesauri
Copy link

Hi @sttk and @szmarczak !

I was able to reproduce the issue now, please find an example in this repository: https://github.com/thesauri/http2-wrapper-timeout-bug

In the example, the got client ends up in a state where requests stop working completely:

  1. First, the example code fires a large number of requests to an endpoint that doesn't respond
  2. Then, the example code tries requesting an endpoint that should respond normally
  3. These requests now fail in timeout errors too

To verify that issue is in http2-wrapper, and not got, I tried removing the http2: true flag and request: http2wrapper.auto to ensure that HTTP 1.1 is used. With HTTP 1.1, the second set of requests work just fine.

Also, as a further observation, reducing the number of non-responding requests to a smaller amount (say n=100) does resolve the issue in the example.

Unfortunately, we do run into this issue every once in a while in our production environment running with real-world flakiness. The only solutions we've found so far is either disabling HTTP2 completely or restarting the server when this happens.

@szmarczak
Copy link
Owner

Interesting... Commenting out limit: 0 in @thesauri example triggers a Node.js assertion:

$ node app.js 
Waiting for timeout requests to finish...
(node:24727) [DEP0111] DeprecationWarning: Access to process.binding('http_parser') is deprecated.
(Use `node --trace-deprecation ...` to show where the warning was created)
Server listening on port 9999
node:internal/assert:14
    throw new ERR_INTERNAL_ASSERTION(message);
    ^

Error [ERR_INTERNAL_ASSERTION]: This is caused by either a bug in Node.js or incorrect usage of Node.js internals.
Please open an issue with this stack trace at https://github.com/nodejs/node/issues

    at assert (node:internal/assert:14:11)
    at internalConnectMultiple (node:net:1118:3)
    at Timeout.internalConnectMultipleTimeout (node:net:1687:3)
    at listOnTimeout (node:internal/timers:575:11)
    at process.processTimers (node:internal/timers:514:7) {
  code: 'ERR_INTERNAL_ASSERTION'
}

Node.js v21.1.0

@mantysalo
Copy link

Hi @szmarczak,

I have created a reproduction of a similar issue here:

https://github.com/mantysalo/http2-issue

This only seems to manifest itself when using Got with retry enabled so I opened an issue in the Got repository as well.

@mantysalo
Copy link

mantysalo commented Nov 22, 2023

I managed to reproduce this by using just the http2-wrapper library so I closed sindresorhus/got#2310

I have changed the reproduction repository to only use http2-wrapper and node:http2 module.

There are some interesting things I've noticed while debugging this:

Issue only happens if the Send-Q of the socket is full

What I mean by full is that the value under Send-Q in netstat -anpo does not increase any further, even though the application is making requests.

See netstat logs below 👇

Netstat logs
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0  94506 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (76.46/0/14)
tcp        0  94650 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (75.45/0/14)
tcp        0  94772 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (74.45/0/14)
tcp        0  94916 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (73.45/0/14)
tcp        0  95060 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (72.45/0/14)
tcp        0  95167 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (71.44/0/14)
tcp        0  95239 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (70.44/0/14)
tcp        0  95348 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (69.43/0/14)
tcp        0  95492 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (68.43/0/14)
tcp        0  95636 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (67.42/0/14)
tcp        0  95780 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (66.42/0/14)
tcp        0  95924 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (65.41/0/14)
tcp        0  96068 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (64.41/0/14)
tcp        0  96212 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (63.41/0/14)
tcp        0  96356 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (62.40/0/14)
tcp        0  96500 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (61.40/0/14)
tcp        0  96607 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (60.39/0/14)
tcp        0  96716 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (59.39/0/14)
tcp        0  96823 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (58.39/0/14)
tcp        0  96932 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (57.38/0/14)
tcp        0  97076 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (56.38/0/14)
tcp        0  97220 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (55.38/0/14)
tcp        0  97364 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (54.37/0/14)
tcp        0  97508 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (53.37/0/14)
tcp        0  97652 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (52.36/0/14)
## Send-Q is now full and does not increase even though requests firing
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (51.36/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (50.35/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (49.35/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (48.34/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (47.33/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (46.33/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (45.32/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (44.32/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (43.32/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (42.31/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (41.31/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (40.31/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (39.30/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (38.30/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (37.29/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (36.29/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (35.28/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (34.27/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (33.27/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (32.26/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (31.25/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (30.24/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (29.24/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (28.24/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (27.23/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (26.23/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (25.23/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (24.22/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (23.21/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (22.21/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (21.20/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (20.20/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (19.19/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (18.18/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (17.18/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (16.18/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (15.17/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (14.17/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (13.16/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (12.15/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (11.15/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (10.14/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (9.14/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (8.13/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (7.13/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (6.12/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (5.11/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (4.10/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (3.10/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (2.09/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (1.09/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (0.08/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (0.00/0/14)
tcp        0  97776 172.21.0.4:59642        172.21.0.2:3001         ESTABLISHED 1/node               probe (0.00/0/14)

This explains why I saw the issue with got, as got adds extra headers to the request, thus making the Send-Q fill up quicker.

If the Send-Q is not full and the socket is closed by the OS, following debug logs are printed:

http2-issue-client-1        | NET 1: destroy
http2-issue-client-1        | NET 1: close
http2-issue-client-1        | NET 1: close handle
http2-issue-client-1        | HTTP2 1: Http2Session <invalid>: socket error [read ETIMEDOUT]
http2-issue-client-1        | HTTP2 1: Http2Session client: destroying
http2-issue-client-1        | HTTP2 1: Http2Session client: start closing/destroying Error: read ETIMEDOUT
http2-issue-client-1        |     at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20) {
http2-issue-client-1        |   errno: -110,
http2-issue-client-1        |   code: 'ETIMEDOUT',
http2-issue-client-1        |   syscall: 'read'
http2-issue-client-1        | }

However, when the Send-Q is full, no such logs are printed and the Http2Session keeps using the socket even though it has been closed by the OS (Weird that this does not produce any errors?)

Issue happens with node:http2 module as well

I've included the code in the reproduction repository. There might very well be an error in my code as I'm not very familiar with the http2 module, but I find it weird that it happens only when the Send-Q is full, and when it's not full there are clear errors printed out in the debug logs.

EDIT:

Looks like it may be a very old bug in Node nodejs/node#4417 this issue has basically all the same symptoms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants