Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch can make worker_threads stuck and process hang #2026

Closed
AriPerkkio opened this issue Mar 26, 2023 · 30 comments
Closed

fetch can make worker_threads stuck and process hang #2026

AriPerkkio opened this issue Mar 26, 2023 · 30 comments
Labels
bug Something isn't working

Comments

@AriPerkkio
Copy link

Bug Description

When undici is run inside worker_threads it may prevent the worker from terminating and makes the whole process hang.

Reproducible By

This Node script runs undici in worker threads and does 4x requests there before exiting. If this does not reproduce on your fast environment, try increasing the requests and const ROUNDS.

"undici": "^5.21.0"
// Save as `undici.mjs` and run as `node uncidi.mjs`
import { isMainThread, parentPort, Worker } from "node:worker_threads";
import { fileURLToPath } from "node:url";
import { cpus } from "node:os";

const THREADS = cpus().length - 1;
const ROUNDS = 10;
const filename = fileURLToPath(import.meta.url);

if (isMainThread) {
  const stuckWorkers = [];
  const interval = setInterval(() => {
    if (stuckWorkers.length > 0) {
      console.log(`There are ${stuckWorkers.length} stuck workers`);
    }
  }, 1000);

  const timeout = setTimeout(() => {
    setTimeout(() => console.log("Still running"), 5_000); // This is never called so process.exit does something, but it's still unable to terminate worker running undici
    console.log("Forcing process.exit on main thread");
    process.exit();
  }, 30_000);

  const task = async () => {
    const worker = new Worker(filename, {});

    return new Promise((resolve) => {
      worker.on("message", (message) => {
        if (message !== "DONE") return;

        const timer = setTimeout(() => {
          console.log("Unable to terminate");
          stuckWorkers.push(worker);
        }, 10_000);
        worker.terminate().then(() => {
          clearTimeout(timer);
          resolve();
        });
      });
    });
  };

  const pool = Array(ROUNDS).fill(task);

  async function execute() {
    const task = pool.shift();

    if (task) {
      await task();
      return execute();
    }
  }

  await Promise.all(
    Array(THREADS)
      .fill(execute)
      .map((task) => task())
  );
  console.log("All done!");
  clearTimeout(timeout);
  clearInterval(interval);
} else {
  const { fetch } = await import("undici");

  const request = async (pathWithParams) => {
    const url = new URL(pathWithParams, "https://api.spacexdata.com");

    const timer = setTimeout(
      () => console.log("Request taking more than 10s"), // This is never logged so requests are not stuck
      10_000
    );
    const output = await fetch(url.href, {
      method: "GET",
      headers: { "Content-Type": "application/json" },
    });
    clearTimeout(timer);
    return output;
  };
  await Promise.all([
    request("/v4/starlink").then((r) => r.json()),
    request("/v4/starlink/5eed770f096e59000698560d").then((r) => r.json()),
    request("/v4/starlink").then((r) => r.json()),
    request("/v4/starlink/5eed770f096e59000698560d").then((r) => r.json()),
  ]);

  parentPort.postMessage("DONE");
}

Expected Behavior

Package should be safe to run inside worker_threads. If it's not safe, it should detect such environment and throw an error.

Logs & Screenshots

ari ~/repro  $ node undici.mjs 
Unable to terminate
There are 1 stuck workers
...
There are 1 stuck workers
Forcing process.exit on main thread
^Z
[1]+  Stopped                 node undici.mjs

ari ~/repro  $ pgrep node
24498

ari ~/repro  $ kill  24498
ari ~/repro  $ pgrep node
24498
# ^^ Ok this one is really stuck

ari ~/repro  $ kill  -9 24498
[1]+  Killed: 9               node undici.mjs

ari ~/repro  $ pgrep node
# no output, process killed

Environment

MacOS 13.2.1
Node 18.14.0

Additional context

vitest-dev/vitest#3077

@AriPerkkio
Copy link
Author

Looks like the same minimal reproduction gets stuck even with Node's fetch - should this be moved to nodejs/node issues instead? Or was Node using undici under the hood?

@AriPerkkio AriPerkkio changed the title worker_threads can get stuck and make process hang fetch can make worker_threads stuck and process hang Mar 26, 2023
@ronag

This comment was marked as outdated.

@ronag
Copy link
Member

ronag commented Mar 26, 2023

Yep. I can confirm this happens.

@ronag
Copy link
Member

ronag commented Mar 26, 2023

Can you repro with just undici.request? Is this a fetch specific issue?

@ronag
Copy link
Member

ronag commented Mar 26, 2023

https://www.npmjs.com/package/why-is-node-running

There are 10 handle(s) keeping the process running

# TLSWRAP
node:internal/async_hooks:202                     
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069     - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068     - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262     - connect(client)

# DNSCHANNEL
node:internal/async_hooks:202      
node:internal/dns/utils:80         
node:internal/dns/utils:75         
node:internal/dns/utils:242        
node:internal/dns/utils:252        
node:internal/bootstrap/loaders:333
node:internal/bootstrap/loaders:364

# TLSWRAP
node:internal/async_hooks:202                     
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069     - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068     - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262     - connect(client)

# TLSWRAP
node:internal/async_hooks:202                     
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069     - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068     - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262     - connect(client)

# TLSWRAP
node:internal/async_hooks:202                     
/Users/ronagy/GitHub/undici/lib/core/connect.js:96 - socket = tls.connect({
/Users/ronagy/GitHub/undici/lib/client.js:1069     - client[kConnector]({
/Users/ronagy/GitHub/undici/lib/client.js:1068     - const socket = await new Promise((resolve, reject) => {
/Users/ronagy/GitHub/undici/lib/client.js:1262     - connect(client)

# ZLIB
node:internal/async_hooks:202                      
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789       - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406       - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1                             

# ZLIB
node:internal/async_hooks:202                      
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789       - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406       - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1                             

# ZLIB
node:internal/async_hooks:202                      
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789       - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406       - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1                             

# ZLIB
node:internal/async_hooks:202                      
/Users/ronagy/GitHub/undici/lib/fetch/index.js:2011 - decoders.push(zlib.createBrotliDecompress())
/Users/ronagy/GitHub/undici/lib/core/request.js:232 - return this[kHandler].onHeaders(statusCode, headers, resume, statusText)
/Users/ronagy/GitHub/undici/lib/client.js:789       - pause = request.onHeaders(statusCode, headers, this.resume, statusText) === false
/Users/ronagy/GitHub/undici/lib/client.js:406       - return currentParser.onHeadersComplete(statusCode, Boolean(upgrade), Boolean(shouldKeepAlive)) || 0
wasm://wasm/00036ac6:1      

@ronag
Copy link
Member

ronag commented Mar 26, 2023

@jasnell How is createBrotliDecompress keeping a ref that stops shutdown?

@ronag
Copy link
Member

ronag commented Mar 26, 2023

@KhafraDev any ideas?

@AriPerkkio
Copy link
Author

Can you repro with just undici.request? Is this a fetch specific issue?

undici.request does not cause this issue. Tested by running 100 rounds with request() successfully. The fetch() case causes the issue after 10 rounds on my laptop.

@ronag
Copy link
Member

ronag commented Mar 26, 2023

And now I can't reproduce it. 😢

@KhafraDev
Copy link
Member

KhafraDev commented Mar 26, 2023

I was hoping this wouldn't come up 😐. I've had this issue in the past, no idea how to debug it or even cause it. It might be an issue with web readable streams (ie. something like nodejs/node#44985?), but I'm unsure.

If it is an issue with zlib, that'd make sense at least. request doesn't decompress the body so you'd only experience it in fetch (such as this issue).

@AriPerkkio
Copy link
Author

And now I can't reproduce it. 😢

This reproduction case is a bit flaky. You may need to add more parallel requests or increase the const ROUNDS to make workers stuck. I guess it's also possible the API sets some request limit if you run the script constantly. I think the issue comes up when CPU load is high enough so maybe add some expensive task, like zipping a file, on background. I'm using Apple M2 Air and can reproduce the issue easily.

Are there any work-arounds available for now? How could we make the main thread force the workers to terminate when worker.terminate() and process.exit() are not enough? The worker.terminate() does something to workers, since they are no longer responding to worker.postMessage calls but they are still somewhere stuck.

@KhafraDev
Copy link
Member

KhafraDev commented Mar 28, 2023

It should be noted that the worker.terminate docs explicitly say that it will "[s]top all JavaScript execution in the worker thread as soon as possible". There's no guarantee that it'll actually close the worker if there is a resource (seemingly a stream from the debugging above) still ref'd.

Do you notice this issue if you run tests sequentially? We run hundreds of fetch tests in workers w/o issue, but one at a time. We noticed some issues with sockets staying open (not because of fetch) when running in parallel, but this could be a similar issue?

@AriPerkkio
Copy link
Author

AriPerkkio commented Mar 29, 2023

[...] docs explicitly say that it will "[s]top all JavaScript execution in the worker thread as soon as possible". There's no guarantee that it'll actually close the worker if there is a resource (seemingly a stream from the debugging above) still ref'd.

Yes, worker.terminate is likely working correctly. I would expect that once await fetch() has resolved, it would not leave any active resources in the background. By reaching the end of worker file I would expect all JavaScript to have been run. But fetch seems to have some side effect here.

Also when I replace worker_threads with child_process this issue does not happen. And when I use worker_threads and replace undici with node-fetch, this issue does not happen.

Do you notice this issue if you run tests sequentially?

The issue comes up even when only a single worker is used. But if worker fires fetch calls sequentially instead of parallel with Promise.all it does not come up. Also those sequential fetch calls seem to be much faster than parallel ones.

We run hundreds of fetch tests in workers w/o issue, but one at a time.

Here is failing test case: #2032

@KhafraDev
Copy link
Member

That test case isn't failing on my machine (both windows and ubuntu, using node v19.8.1).

@AriPerkkio
Copy link
Author

Sure, the point was that it fails the CI: https://github.com/nodejs/undici/actions/runs/4551655590/jobs/8026203674#step:14:12832. CI should be used as reference instead of developer machines. It's much easier to reproduce the issue constantly that way.

@KhafraDev
Copy link
Member

KhafraDev commented Mar 29, 2023

Failing, yes, but it's hard to fix an issue that can't be debugged locally.


Decompressing the body has a... pretty large number of issues that I can see. I wouldn't be surprised if createBrotliDecompress is causing issues, as it's the central cause of another issue that's hard to debug. Actually, looking at the code in #2032, spacexdata is sending us a br content-encoding header too.

If we take a look at https://github.com/node-fetch/node-fetch/blob/e87b093fd678a9ea39c5b17b2a1bdfc4691eedc7/src/index.js#L277-L366 (which is how node-fetch decompresses the body), it's more or less identical to what undici is doing in this instance. I'm unsure as to what exactly is causing it to hang here, and not node-fetch, unless of course it's not an issue with zlib, but something else?

@silverwind
Copy link
Contributor

silverwind commented Mar 30, 2023

I've been experiencing this as well, random hangs that only reproduce on Linux and never on macOS. The one time where I oberved it interactively in a Vagrant VM, I observed 200% CPU usage and the process was completely stuck somewhere and did not even accept SIGINT via keyboard.

There were no brotli-encoded responses involved, all responses were uncompressed and were coming from local node.js server.

@AriPerkkio
Copy link
Author

I wouldn't be surprised if createBrotliDecompress is causing issues

I've been trying to reproduce this issue without fetch at all, using just node:stream and node:zlib with workers. So far I haven't been able to reproduce the issue that way. Below is a script that reads the API's raw result from filesystem and passes it to createBrotliDecompress:

/*
curl --raw -H 'Accept-Encoding: br' https://api.spacexdata.com/v4/starlink > raw-response
*/
import { createBrotliDecompress } from "node:zlib";
import { pipeline } from "node:stream";
import { createReadStream } from "node:fs";
import { isMainThread, parentPort, Worker } from "node:worker_threads";
import { fileURLToPath } from "node:url";
import { cpus } from "node:os";

const THREADS = cpus().length - 1;
const ROUNDS = 100;

if (isMainThread) {
  const timeout = setTimeout(() => {
    console.log("Forcing process.exit on main thread");
    process.exit();
  }, 60_000);

  const task = async () => {
    const worker = new Worker(fileURLToPath(import.meta.url), {});
    const [promise, resolve] = getPromise();
    worker.on("message", (m) => m === "DONE" && resolve());
    await promise;

    const timer = setTimeout(() => {
      console.log("Unable to terminate");
    }, 10_000);
    await worker.terminate();
    clearTimeout(timer);
  };

  const pool = Array(ROUNDS).fill(task);

  async function execute() {
    const task = pool.shift();
    if (task) {
      await task();
      return execute();
    }
  }

  await Promise.all(
    Array(THREADS)
      .fill(execute)
      .map((task) => task())
  );
  console.log("All done!");
  clearTimeout(timeout);
} else {
  const pool = Array(10).fill(decompress);
  await Promise.all(pool.map((task) => task()));
  parentPort.postMessage("DONE");
}

async function decompress() {
  const [promise, resolve, reject] = getPromise();
  const output = pipeline(
    createReadStream("./raw-response"),
    createBrotliDecompress(),
    (err) => err && reject(err)
  );

  const chunks = [];

  output.on("readable", () => {
    let chunk;
    while (null !== (chunk = output.read())) {
      chunks.push(chunk);
    }
  });

  output.on("end", resolve);
  output.on("error", reject);

  await promise;

  const data = chunks.join("");
  const json = JSON.parse(data);
  console.assert(json != null, `JSON unexpcted: ${JSON.stringify(json)}`);
}

function getPromise() {
  let resolve, reject;
  const promise = new Promise((...args) => {
    [resolve, reject] = args;
  });

  return [promise, resolve, reject];
}

it's hard to fix an issue that can't be debugged locally

I'm happy to help by providing debugging logs or similar if there is anything you'd like to see. I'm able to reproduce the original issue constantly on my local machine.

I've been trying to look for any hanging HTTP sockets with netstat when workers get stuck, but haven't spotted anything unusual. I'm running out of ideas what to look for next.

@silverwind
Copy link
Contributor

silverwind commented Mar 31, 2023

vitest-dev/vitest#2008 (comment) may be relevant as well where I managed to attach a chrome debugger via node --inspect to such a hanging process, but not sure of those traces are of any help, or even if it's the same exact issue.

Would definitely help if we can get similar traces via node --inspect.

@mcollina
Copy link
Member

mcollina commented Apr 10, 2023

I wonder if this is the same of nodejs/node#47228

@silverwind
Copy link
Contributor

I'm not sure. I did observe this with node 18. Also, that hang is with idle cpu, this one here was with pegged cpu.

@AriPerkkio
Copy link
Author

I wonder if this is the same of nodejs/node#47228

I don't think so. That issue mentions that it does not reproduce on Node 18 while this one does. Also by passing keepalive: false option to fetch I'm still able to make the worker_threads stuck with the reproduction case of this issue.

ronag added a commit that referenced this issue Apr 11, 2023
ronag added a commit that referenced this issue Apr 11, 2023
ronag added a commit that referenced this issue Apr 11, 2023
ronag added a commit that referenced this issue Apr 11, 2023
@TheEssem
Copy link

TheEssem commented Apr 13, 2023

Can confirm that this issue occurs on both Node v17.9.0 and v19.8.1 using undici v5.21.2, and not just with fetch. Here is some smaller example code to help reproduce:

index.js:

import { Worker } from "worker_threads";
import path from "path";
import { fileURLToPath } from "url";

new Worker(path.join(path.dirname(fileURLToPath(import.meta.url)), "./worker.js"));

worker.js:

import { request } from "undici";

request("https://media.tenor.com/qy_WcGdRzfgAAAAC/xluna-high-five.gif").then(res => res.body.arrayBuffer()).then(buf => Buffer.from(buf))
  .then(() => ({
    buffer: Buffer.alloc(10),
    fileExtension: "bin"
  }))
  .then(() => {
    process.exit();
  });

@mcollina
Copy link
Member

Now removing process.exit fixes it:

import { request } from "undici";

request("https://media.tenor.com/qy_WcGdRzfgAAAAC/xluna-high-five.gif").then(res => res.body.arrayBuffer()).then(buf => Buffer.from(buf))
  .then(() => ({
    buffer: Buffer.alloc(10),
    fileExtension: "bin"
  }))
  .then(() => {
    //// comment out process.exit and the worker would not exit anymore
    // process.exit();
  });

I think this might be a bug in Node.js after all. @addaleax, have you got an idea why process.exit() in a worker might cause the worker to freeze?

Even worse, adding NODE_DEBUG=worker make the worker actually terminate correctly.

@gajus
Copy link

gajus commented Nov 3, 2023

What's the simplest way to override all global reference to fetch?

Presumably, if we can do that, then we should be able avoid the bug.

It seems we are hit pretty hard by this with every couple of test runs flaking as a result (vitest-dev/vitest#4415).

@mcollina
Copy link
Member

mcollina commented Nov 3, 2023

That issue seems unrelated, undici does not allocate any FileHandle.

@gajus
Copy link

gajus commented Nov 3, 2023

That issue seems unrelated, undici does not allocate any FileHandle.

We ended up removing all use of fetch from our codebase and restricting use of fetch by overriding globalThis.fetch and the issue disappeared. 🤷

@divmgl
Copy link

divmgl commented Nov 16, 2023

Yep, running into this now with fetch. I had to do put this at the top of our Vitest setupTests.ts:

import fetch from "node-fetch"

globalThis.fetch = fetch

Issue is now resolved.

@tsctx
Copy link
Member

tsctx commented Mar 15, 2024

Hi @AriPerkkio,
Is this fixed in the latest node (v21.7.1)?
A Webstream fix patch (nodejs/node#51526) has been added. It seems to have fixed this issue.

@AriPerkkio
Copy link
Author

Great work @tsctx and @jasnell (from nodejs/node#51255, 👋)!

This issue seems to be fixed. I ran these tests using undici@6.2.0 and following Node versions:

  • v18.19.1 stuck on first run
  • v21.7.1 10/10 runs without any worker hangs even with stress
  • v21.7.0 10/10 runs without any worker hangs
  • v21.6.2 10/10 runs without any worker hangs
  • v21.5.0 stuck on first run

Looks like the fix was orignally included in v21.6.2 in nodejs/node#51255, but reworked for v21.7.1 in nodejs/node#51526.

This also seems to fix two other node:worker_thread hangs that were not using Undici at all.
I just now tested both of these with v21.7.1 and they no longer hang. With v18.19.1 the worker hang is easily reproduced:

Let's close this issue as fix on Node's side has landed and root cause was not in Undici.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants