Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak in dd-trace http plugin #4286

Closed
viral-barot opened this issue May 7, 2024 · 14 comments
Closed

Memory Leak in dd-trace http plugin #4286

viral-barot opened this issue May 7, 2024 · 14 comments

Comments

@viral-barot
Copy link

viral-barot commented May 7, 2024

I have encountered memory leak issue in dd-trace package.
when I disable http plugin DD_TRACE_DISABLED_INSTRUMENTATIONS=http then there is no memory leak.

I have tried with upgrading and downgrading dd-trace package versions between 3.X to 5.X also tried different node version but still facing same issue.

node version : 18.17.1
dd-trace version : 5.10.0
restify version : 11.1.0

Memory Heap Snapshot

http plugin Disabled - Heap Snapshot.zip

http plugin Enabled - Heap Snapshot.zip

@viral-barot viral-barot changed the title Memory Leak in dd-trace http plugin Memory Leak in dd-trace http plugin May 7, 2024
@viral-barot viral-barot changed the title Memory Leak in dd-trace http plugin Memory Leak in dd-trace http plugin May 7, 2024
@sunsheeppoplar
Copy link

@tlhunter sorry for the tag, not sure the protocol for raising issues up to folks at dd... wondering if this is being looked at in the v4 and v5 lines

@SkylerRogers
Copy link

I am too having this issue. We're seeing elevated memory footprint in our node application after upgrading dd-trace npm package. Has anyone on the DataDog side been able to confirm this?

@ZeroCho
Copy link

ZeroCho commented May 21, 2024

for me, using cross-fetch instead of Native Node.js fetch solved memory leak. I guess something is wrong with dc-polyfill. Every fat span of continuous profiler is related with fetch code.

  1. waitForImage method calls 'fetch' which remains in memory even after waitForImage method ends and this span disappears if I replace it with cross-fetch
    image
  2. replicate use fetch internally and it remains in memory even after requests are finished. If I replace it with cross-fetch, this span disappears.
    image

@tayler-king
Copy link

tayler-king commented Jun 22, 2024

Also noticing increased memory usage, growing at a linear rate after adding in dd-trace to our project.

Have disabled the http plugin for now and will report back if that resolves the leak.

Edit: Can confirm disabling the http plugin has fixed the leak.

@tlhunter
Copy link
Member

I'll be looking into this issue today.

Does anybody have a simple reproduction of the issue? Ideally a single file that generates outgoing requests? Often times memory leaks are caused by interplay with third party packages but without knowing the dependencies everyone in this thread is using I'll be working in the dark.

@tlhunter
Copy link
Member

I've been working on this for a couple hours and have not perfectly reproduced the situation where requests made via fetch with dd-trace enabled causes a memory leak and where setting DD_TRACE_DISABLED_INSTRUMENTATIONS=http alleviates the memory leak. My attempt at a repro is a very basic app with an http server. It might be that the leak only happens when fetch is paired with a given framework like Express or Nest.

@rmagrin
Copy link

rmagrin commented Jun 27, 2024

I'm not sure if it is related to this issue, but the tests in an app that I'm working on started failing with out of memory issues when I upgraded dd-trace from 4.2.0 to 4.41.0. I didn't have time to identify which version introduced the problem.

The app is built using nest.js (v9) app with express (v4) running on node v18. The test are using jest v29.

Unfortunately I was not able to reproduce this outside the app context to be able to create a simple way to reproduce the issue.

@tlhunter
Copy link
Member

Whew, this ended up being an essay, but I hope the information is helpful and is something we can link to in the future.

Background

Here is a pretty straightforward application that I've been using to simulate a high load application (Node.js = v18.20.2). It works by sending millions of requests to a simple Node.js http server which in turn makes an outbound http request to another server. It's running with the latest version of the tracer. Heap snapshots are generated before and after a run of exactly 1,000,000 requests (though at that point exactly 2,000,003 requests had already been made).

I then compare the dump of the before and after 1MM requests in Chrome using the comparison feature. In this case I am seeing a ~500kb growth in memory which is pretty much a rounding error and could possibly be attributed to things like optimizations that V8 makes under the hood. The vanilla application without the tracer grows by ~200kb during this same time.

Thanks for the original heapdump, @viral-barot, however just a single request often isn't enough to track down a memory leak. For example, the application may be lazy loading files from disk to serve a request, caches aren't necessarily primed, code isn't optimized, etc. There is constant jitter when doing memory comparisons between two dumps and if 1 object is allocated somewhere it gets lost in the noise. Instead, by tracking exactly 1MM requests, I'm able to look through the heap and keep an eye out for situations where 1 million objects are retained (or multiples of 1 million like 2 million or 500k).

For example, in the past when a confirmed memory leak has occurred, and one million requests are made, we would see a million ServerResponse and IncomingMessage objects retained and 2MM DatadogSpan instances. These are super easy to spot in the V8 memory comparison. But in my case I only see a few entries, which could be garbage that hasn't yet been collected or something like that. Here's a screenshot:

Screenshot 2024-06-27 at 11 07 53

So there could be a memory leak in the tracer but my attempt to reproduce it has failed. The tracer has code paths for every integration we support so the leak might only occur when a particular pair of integrations is used. In order for me to properly reproduce the memory leak at this point we'll need either of the following:

  1. Everyone provides me with their package.json dependencies and I can look for similar patterns and attempt to create a repro. This is a slow process and I still might not reproduce the memory leaks that folks are seeing. Not only do the packages matter but also the coding patterns.
  2. Someone in this thread copies and pastes subsets of their application code into the repro files and attempts to reproduce the memory leak on their machine. This is the fastest and most accurate way to get a repro since you know more about your app than I could.

Defining a Memory Leak

Also I think it's worth defining what a memory leak is for any one trying to identify and even reproduce one. A Node.js application of course always consumes some memory, maybe a fresh basic process is 40MB. While serving requests it will use more, perhaps 60MB. After requests are complete it will idle at a certain level, maybe 50MB. Maybe while serving 10,000 concurrent requests the memory usage is 200MB but the process might settle back down around 50MB. So while the memory usage grows and contracts, as defined none of that is a memory leak so far.

Any observability tool, including the Datadog tracer, incurs additional overhead. The amount that is incurred is nearly impossible to measure as it changes depending on the packages used in the application. Even coding patterns and patterns of asyncrony can change the memory usage. However, to just throw out a number, we could pretend that a given basic app has a 20% memory overhead. In such case we might end up with memory usage like this:

Scenario   FRESH   MIDREQ   SETTLE  CONCURR   SETTLE
Vanilla    40MB    60MB     50MB    200MB     50MB
Traced     48MB    72MB     60MB    240MB     60MB

So while there is a memory overhead (in this case 20%) there isn't a memory leak. This can muddy the waters though as a container might be configured to provide enough memory for a vanilla application during peak usage or an idling application that is traced it might not provide enough memory for a traced application during peak usage. For example a memory limit of 225MB would cause this traced app to be killed during peak usage.

Thus a true memory leak is a program that grows in an unbounded manner related to an operation that is happening somewhere in the application. Generally speaking such growth is related to an operation happening in the application such as an HTTP request / response, gRPC messages being received (or in the case of a CRON style server it might be a function of time such as setInterval() calls). For this reason I'm performing the millions of requests since it's usually the case that a memory leak in a web server results in unbounded memory growth which is associated with the number of requests (inbound or outbound).

Essentially, request / response objects can stick around in memory after the request is complete.

Repro Files

Save these files in a directory and then run the following commands and a series of benchmarks and heapdumps and memory snapshots will be taken:

$ npm install
$ chmod +x ./benchmark.sh
$ ./benchmark.sh
  • mem-dd-0.heapsnapshot can be compared with mem-dd-1.heapsnapshot and represents 1MM traced requests
  • mem-no-dd-0.heapsnapshot can be compared with mem-no-dd-1.heapsnapshot and represents 1MM vanilla requests

Here's my benchmarking and heap dump generation code:

app.js

console.log('PID:', process.pid);

const http = require('http');
const URL = require('url');
const fs = require('fs');
const path = require('path');
const v8 = require('v8');

const host = 'localhost';
const port = 8000;

fs.writeFileSync(
  path.join(__dirname, 'app.pid'),
  String(process.pid)
);

const requestListener = async function (req, res) {
  const url = URL.parse(req.url);
  const params = new URLSearchParams(url.search);
  switch (url.pathname) {
    case '/gc':
      gc();
      res.end('garbage collected');
      break;

    case '/stats':
      const mb = (process.memoryUsage().rss / 1024 / 1024).toFixed(3)+'MB';
      res.end(mb);
      break;

    case '/dump':
      const filename = params.get('filename') ? params.get('filename') + '.heapsnapshot' : undefined
      console.log(`writing snapshot ${filename}...`);
      v8.writeHeapSnapshot(filename);
      console.log(`wrote snapshot ${filename}.`);
      res.end(`wrote snapshot ${filename}`);
      break;

    default:
      const r = await fetch(`http://localhost:8001`);
      const t = await r.text();
      res.end('client: ' + t);
      break;
  }
};

const server = http.createServer(requestListener);

server.listen(port, host, () => {
  console.log(`Server is running on http://${host}:${port}`);
});

benchmark.sh

echo "START COMMON SERVER"
npm run start-server &


# NO DATADOG

echo "START CLIENT WITHOUT DATADOG"
npm run start-no-dd &
sleep 1

echo "PRIME THE CACHE"
npm run req
npm run req
npm run req
sleep 1
npm run gc
sleep 1

echo "GET THE MEMORY USAGE BEFORE BENCHMARK"
npm run mem > mem-no-dd-0.txt
cat mem-no-dd-0.txt

echo "RUN THE FIRST BENCHMARK"
npm run benchmark
sleep 10
npm run gc
sleep 5
npm run gc
sleep 5

echo "GET THE MEMORY USAGE AFTER BENCHMARK"
npm run mem > mem-no-dd-1.txt
cat mem-no-dd-1.txt

echo "RUN THE SECOND BENCHMARK"
curl http://localhost:8000/dump?filename=mem-no-dd-0
npm run benchmark
npm run benchmark
sleep 10
npm run gc
sleep 5
npm run gc
sleep 5
curl http://localhost:8000/dump?filename=mem-no-dd-1

echo "GET THE MEMORY USAGE AFTER SECOND BENCHMARK"
npm run mem > mem-no-dd-2.txt
cat mem-no-dd-2.txt

echo "KILL THE NON-DATADOG CLIENT"
kill `cat app.pid`


# WITH DATADOG

echo "START CLIENT WITHOUT DATADOG"
npm run start-dd &
sleep 1

echo "PRIME THE CACHE"
npm run req
npm run req
npm run req
sleep 1
npm run gc
sleep 1

echo "GET THE MEMORY USAGE BEFORE BENCHMARK"
npm run mem > mem-dd-0.txt
cat mem-dd-0.txt

echo "RUN THE FIRST BENCHMARK"
npm run benchmark
sleep 10
npm run gc
sleep 5
npm run gc
sleep 5

echo "GET THE MEMORY USAGE AFTER BENCHMARK"
npm run mem > mem-dd-1.txt
cat mem-dd-1.txt

echo "RUN THE SECOND BENCHMARK"
curl http://localhost:8000/dump?filename=mem-dd-0
npm run benchmark
npm run benchmark
sleep 10
npm run gc
sleep 5
npm run gc
sleep 5
curl http://localhost:8000/dump?filename=mem-dd-1

echo "GET THE MEMORY USAGE AFTER SECOND BENCHMARK"
npm run mem > mem-dd-2.txt
cat mem-dd-2.txt

echo "KILL THE DATADOG CLIENT"
kill `cat app.pid`

faux-server.js

const http = require('http');

const host = 'localhost';
const port = 8001;

const requestListener = function (req, res) {
  res.end('ok from faux server');
};

const server = http.createServer(requestListener);

server.listen(port, host, () => {
  console.log(`Faux server is running on http://${host}:${port}`);
});

package.json

{
  "name": "gh-4286",
  "version": "1.0.0",
  "description": "memory leak test",
  "main": "app.js",
  "scripts": {
    "start": "echo 'use start-dd or start-no-dd instead' ; exit 1",
    "start-server": "node --expose-gc faux-server.js",
    "start-dd": "node --require='dd-trace/init' --expose-gc app.js",
    "start-no-dd": "node --expose-gc app.js",
    "watch": "watch -n 0.5 ps aux `cat app.pid`",
    "benchmark": "autocannon -a 1000000 http://localhost:8000",
    "req": "curl --silent http://localhost:8000/",
    "gc": "curl --silent http://localhost:8000/gc",
    "mem": "curl --silent http://localhost:8000/stats",
    "dump": "curl --silent http://localhost:8000/dump"
  },
  "author": "Thomas Hunter II <tlhunter@datadog.com>",
  "license": "BSD",
  "dependencies": {
    "dd-trace": "^5.17.0"
  }
}

@rmagrin
Copy link

rmagrin commented Jun 28, 2024

@tlhunter totally understand your point that is impossible to fix this kind of issues if you can't reproduce it. I will try to build a sample app to reproduce the issue that I'm having.

BTW, I checked multiple versions and the issues happens when I upgrade from v4.18.0 to v4.19.0.

@mariuszbeltowski

This comment was marked as off-topic.

@tayler-king

This comment was marked as off-topic.

@tlhunter
Copy link
Member

tlhunter commented Jun 28, 2024

@tlhunter totally understand your point that is impossible to fix this kind of issues if you can't reproduce it. I will try to build a sample app to reproduce the issue that I'm having.

BTW, I checked multiple versions and the issues happens when I upgrade from v4.18.0 to v4.19.0.

@rmagrin you bisected / installed a bunch of tracer versions and narrowed it down to exactly v4.19.0? That was a big release and it could be related to any of these commits:

ed9d465 2023-11-15 12:50 -0800 Thomas Hunter II               o make 128bit ids default even in v4.x release line (#3800)
55ab6f5 2023-11-16 03:15 +0800 Stephen Belanger               o Otel span name translator (#3766)
4612573 2023-11-15 14:08 -0500 Sam Brenner                    o fix webpack build errors in next.js test (#3798)
a3ca04b 2023-11-15 08:57 -0800 Thomas Hunter II               o core: enable 128-bit ids by default for v5+ (#3656)
2b7f039 2023-11-15 08:56 -0800 Thomas Hunter II               o restify: emit on DC channels w/ async handlers (#3782)
a575228 2023-11-15 14:00 +0100 Carles Capell                  o Update native-iast-taint-tracking to v1.6.4 (#3787)
e018b48 2023-11-14 12:51 -0500 William Conti                  o feat: add DSM pathway hash to Kafka spans (#3763)
25dbe58 2023-11-14 16:06 +0100 Attila Szegedi                 o GC events profiler (#3770)
edf821e 2023-11-14 15:24 +0100 Igor Unanua                    o ASM - Update collected request headers [APPSEC-11226] (#3795)
8b00bfd 2023-11-14 15:11 +0100 Igor Unanua                    o Obfuscate secret tokens (#3786)
abd004c 2023-11-13 10:40 -0500 Bryan English                  o enable arm builds for single-step (#3791)
51ebae3 2023-11-13 16:17 +0100 Attila Szegedi                 o Only consider the active span and its ancestors when looking for web tags (#3780)
7008d43 2023-11-13 16:07 +0100 Attila Szegedi                 o Eagerly release cached reference to web span tags when the span ends (#3781)
2cbdabc 2023-11-13 11:21 +0100 Juan Antonio Fernández de Alba o [ci-visibility] Update git metadata extraction (#3771)
b522fb2 2023-11-10 17:22 +0100 Attila Szegedi                 o Cache web span lookup, so we only perform it once per span (#3779)
6d35a3b 2023-11-09 09:42 +0100 Ugaitz Urien                   o Run nextjs system-test in tracer CI (#3777)
b70bdb3 2023-11-08 18:25 +0100 Attila Szegedi                 o PROF-8545: Restore eager release of tags, adapt endpoint profiling code (#3759)
57b711e 2023-11-08 17:08 +0100 Juan Antonio Fernández de Alba o [ci-visibility] Add flags to force code coverage reporting and test skipping (#3767)
072fa41 2023-11-08 16:58 +0100 simon-id                       o Use exact version for @datadog/native-appsec (#3778)
36836f9 2023-11-08 09:11 -0500 Ida Liu                        o Modified telemetry.enabled to comply with instrumentation telemetry specs (#3765)
e4e3eeb 2023-11-08 10:25 +0100 simon-id                       o Update AppSec rules to 1.9.0 (#3772)
1ad75e9 2023-11-06 01:14 -0800 Thomas Hunter II               o use dc-polyfill instead of diagnostics_channel directly (#3722)
be67462 2023-11-03 12:13 -0400 William Conti                  o add payload size to kafka stats (#3734)

Someone else suspected dc-polyfill could be contributing to the issue and this commit introduced it. That said there's no way of even telling if any of the memory leaks reported in this thread are related at this point.

@tlhunter
Copy link
Member

FYI, I'm closing any replies unrelated to the http integration as being off-topic.

If anyone else has concerns about potential memory leaks, please create an issue via the helpdesk. There's a link on the GitHub create issue screen that takes you to the helpdesk. Such issues are private and allow for attaching potentially sensitive information such as package.json files which may not be appropriate for a public forum such as GitHub issues.

Using the helpdesk not only prioritizes these issues but also helps prevent off-topic comments from muddying the water. Memory leaks are highly dependent on things like what combination of packages are used, what version of Node.js is used, etc.

@rmagrin
Copy link

rmagrin commented Jul 1, 2024

@tlhunter totally understand your point that is impossible to fix this kind of issues if you can't reproduce it. I will try to build a sample app to reproduce the issue that I'm having.
BTW, I checked multiple versions and the issues happens when I upgrade from v4.18.0 to v4.19.0.

@rmagrin you bisected / installed a bunch of tracer versions and narrowed it down to exactly v4.19.0? That was a big release and it could be related to any of these commits:

@tlhunter exactly. I created a branch from my app where I created one commit for each version off dd-trace between v4.7.0 (the last version a new didn't have the issue) and v4.41.0 and used git bisect to figure out that v4.19.0 was the one that introduced the issue. After I double check running the tests with v4.18.0 and v4.19.0 and confirmed that the issue only happens in v4.19.0.

A few more data points that could help here:

  • v5.0.0 and v5.17.0 have the same issue. I didn't test any versions in between;
  • I can't reproduce the issue when running nodejs v20 only on v18.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants