MQTT Memory Leaks on consumer connections #5471

slice-srinidhis · 2024-05-27T17:25:35Z

Observed behavior

Seeing consistent memory leak when using NATS while MQTT consumers just makes connections. The pattern is observed when we are scaling the system to have 5K connections .Though there not many messages produced or consumed . The consumer_inactive_threshold: 0.2s is also set but the memory is not getting released. The only way to cleanup is to restart the pods .

On getting the pprof , its observed createInternalClient seem to take a lot of heap memory (attaching the pprof for ref) . [
heap_pprof_24-04.pdf
](url)
Kindly look into issue and let know if there are any parameter tweaking that will help resolving the issue .

Expected behavior

Optimal memory management without memory leaks for MQTT .

Server and client version

Nats Server version 2.10.14

Host environment

Kubernetes v1.25

Steps to reproduce

Setup NATS with MQTT , on making client connections the leak is reproducible at scale .

The text was updated successfully, but these errors were encountered:

neilalexander · 2024-05-28T08:31:15Z

Can you please attach the /debug/pprof/allocs?debug=0 file itself instead of the PDF extract?

levb · 2024-05-28T15:35:17Z

(apologies, posted a comment that was incorrect, deleted)

slice-srinidhis · 2024-05-28T16:03:56Z

@neilalexander Please find the pprofs associated . 28-05-05-19 taken during the first iteration and 28-05-07-30 during the second one where the older memory is accumulated , Have also attached the memory graph for reference
28-05-07-30.pb.gz

28-05-05-19.pb.gz

levb · 2024-05-28T16:24:13Z

@slice-srinidhis Do you know the details of the MQTT connections: clean or stored? If stored, can you please provide information about how many subscriptions there are in the sessions? Do you use MQTT retained messages?

slice-arpitkhatri · 2024-05-29T06:19:20Z

@levb

we have cleanSession set as true.
There is a single subscription per client on topic notifications/{userId} ( QOS 2 )
We are not using MQTT retained messages.

derekcollison · 2024-06-03T02:41:10Z

Any updates here?

neilalexander · 2024-06-03T13:51:41Z

@slice-arpitkhatri When the memory usage is quite high, can you please also supply the output of a /debug/pprof/goroutine?debug=1 too?

That output should contain account/asset names etc, so if you would rather send privately vs posting here, then please email to neil@nats.io. Thanks!

slice-arpitkhatri · 2024-06-05T07:06:50Z

@levb @neilalexander: please find the attached files.

pprof.goroutine.004.pb.gz
pprof.goroutine.005.pb.gz
pprof.goroutine.003.pb.gz
pprof.goroutine.002.pb.gz
pprof.goroutine.001.pb.gz

Let us know if you need anything else. Thanks!

levb · 2024-06-05T12:00:31Z

@slice-arpitkhatri @neilalexander is it possible to get on a zoom call, with access to your cluster, so we could gather more data together?

slice-arpitkhatri · 2024-06-05T15:55:27Z

Yes, sure. Could you please let me know what times work best for you?

levb · 2024-06-05T16:01:03Z

@slice-arpitkhatri @neilalexander I can do any time tomorrow June 6 after 6AM Pacific, or Friday any time after 5AM PDT.

levb · 2024-06-05T16:25:52Z

@slice-arpitkhatri Let's do Friday, June 7, any time that works for you. Please consider @neilalexander is in GMT, I can make it work on my end.. Let us know. You can email me at lev@synadia.com to set up the call.

slice-arpitkhatri · 2024-06-10T09:52:07Z

Hi @levb @neilalexander, we have tried out the following suggestions that you guys proposed in the last meeting:

Changed inactive_consumer_threshold to 10s.
Tested with QOS 1 instead of QOS 2.

We've conducted performance tests for both cases and have not observed any meaningful changes in memory consumed.

Attaching the memory graphs for the same:

Memory graph with inactive_consumer_threshold : 10s

Memory graph with QOS 1

levb · 2024-06-10T14:01:57Z

@slice-arpitkhatri What is the motivation to set a "low" inactivity threshold? ("Low" relative to the frequency of messages coming through). Since your clients use clean sessions, the consumers will normally be deleted automatically when the sessions disconnect. If a server cold-restarts and consumers are left undeleted, a considerably longer value (24hrs?) may be acceptable for the cleanup?

I have read through the history of the config option and the code over the weekend, and I am half-through testing/investigating what happens to an MQTT session when its consumers go away "under the hood". There is definitely potential for it getting "confused", but I am not through with the code yet.

Let us know please if setting a "long" inactivity threshold helps avoiding (or slowing down) the leak.

slice-arpitkhatri · 2024-06-11T09:26:17Z

Hi @levb, we have changed the inactive_consumer_threshold to 24 hours. I have attached the memory graph after making this change. However, we are still facing the memory leak issue.

slice-arpitkhatri · 2024-06-11T09:28:20Z

We've basically run our tests with different values of inactive_consumer_threshold (0.2s, 10s, 24hrs). We've also run tests after removing the inactive_consumer_threshold from the config. However, we have not observed any changes in memory consumption.

neilalexander · 2024-06-12T09:49:04Z

Thanks for confirming Arpit, can you please provide updated memory profiles from a period where the memory usage is high? Thanks!

derekcollison · 2024-06-12T11:26:44Z

@slice-arpitkhatri which mqtt library are you using and could you provide us a small sample mqtt app that shows the behavior? At this point we would want to have a sample app and watch its interactions with the NATS system to help us track down any issues.

Thanks.

slice-arpitkhatri · 2024-06-13T13:05:19Z

@derekcollison In production, we are using HiveMQ Library for Android and CocoaMQTT Library for iOS.
For performance testing purposes, we are using Paho MQTT in Golang. I have attached a sample app which we are using for performance testing.

perfConsumer.go.zip

++ @neilalexander @levb

derekcollison · 2024-06-13T13:27:30Z

And you can see the issue using the Go client correct?

slice-arpitkhatri · 2024-06-13T13:37:46Z

We're encountering it regardless of the client, both in production (HiveMQ Kotlin & CocoaMQTT Swift) and during performance testing (Paho MQTT in Golang).

During performance testing, we are only using Paho MQTT in Golang (sample app shared above).

slice-arpitkhatri · 2024-06-13T13:39:17Z

And you can see the issue using the Go client correct?

To answer your question clearly, yes, we can see the issue using the Go client.

derekcollison · 2024-06-13T13:43:37Z

Thanks, and during your performance testing, how is that conducted?

slice-arpitkhatri · 2024-06-13T13:55:44Z

We have 15 pods running, each establishing around ~340 connections (totaling 5k connections, all non-durable, with random clientIDs). They subscribe to "topic/{i}" where 0 < i < 340. These connections are terminated after one minute, at which point the sample app mentioned above spins up another set of 5k connections with different clientIDs.

We're implementing this to simulate the production traffic pattern. Locally, we're running a producer script which publishes messages at 10 TPS to "topic/{i}", where i is a random integer between 0 and 340. Note that this is not 10 TPS per topic; it's 10 TPS collectively, essentially 10 TPS at the broker.

Please let me know in case you have any further queries. Thanks.

derekcollison · 2024-06-13T13:59:55Z

Thanks for the information, much appreciated.

derekcollison · 2024-06-20T00:05:43Z

@slice-srinidhis Thank you for your patience. We finally tracked it down and fixed. Is on main and will be part of 2.10.17 release.

…ak-1

…ing the session (#5575) MQTT s.clear(): do not wait for JS responses when disconnecting the session Related to #5471 Previously we were making `jsa.NewRequest` as it is needed when connecting a clean session. On disconnect, there is no reason to wait for the response (and tie up the MQTT read loop of the client). This should specifically help situations when a client app with many MQTT connections and QOS subscriptions disconnects suddenly, causing a flood of JSAPI deleteConsumer requests. Test: n/a, not sure how to instrument for it.

…ak-JS-only

…H-5471-leak-JS-only

…ak-JS-only

slice-srinidhis · 2024-07-04T05:58:35Z

Thanks for the fix @derekcollison @neilalexander . We have deployed the latest release in production and seeing nats_core_mem_bytes releasing memory and not growing (Graph 1 ) However , the container/pod memory has been growing and not releasing it back to the system (the growth is not as rapid as before) (Graph 2 ) . Can you help us with any parameter that can be tuned so that the pod doesn't go to OOM . We have the GOGC currently set to 50 .

slice-srinidhis added the defect Suspected defect such as a bug or regression label May 27, 2024

Jarema assigned levb May 28, 2024

derekcollison mentioned this issue Jun 19, 2024

[FIXED] Make sure to always remove internal clients from the account regardless of kind. #5566

Merged

derekcollison closed this as completed in #5566 Jun 19, 2024

derekcollison closed this as completed in e2752d5 Jun 19, 2024

levb added a commit that referenced this issue Jun 20, 2024

Merge branch 'main' of github.com:nats-io/nats-server into GH-5471-le…

6a09bae

…ak-1

levb added a commit that referenced this issue Jun 20, 2024

Merge branch 'main' of github.com:nats-io/nats-server into GH-5471-le…

1beec73

…ak-1

levb added a commit that referenced this issue Jun 20, 2024

Merge branch 'main' of github.com:nats-io/nats-server into GH-5471-le…

336431b

…ak-1

levb added a commit that referenced this issue Jun 20, 2024

Merge branch 'GH-5471-test-only' into GH-5471-leak-1

d4205d4

levb mentioned this issue Jun 20, 2024

[CHANGED] MQTT s.clear() do not wait for JS responses when disconnecting the session #5575

Merged

levb added a commit that referenced this issue Jun 25, 2024

Merge branch 'main' of github.com:nats-io/nats-server into GH-5471-le…

e7c676e

…ak-JS-only

levb added a commit that referenced this issue Jun 28, 2024

Merge branch 'nrg-data-race' of github.com:nats-io/nats-server into G…

4a7dc2e

…H-5471-leak-JS-only

levb added a commit that referenced this issue Jun 28, 2024

Merge branch 'main' of github.com:nats-io/nats-server into GH-5471-le…

cacf693

…ak-JS-only

slice-srinidhis mentioned this issue Jul 11, 2024

MQTT memory consumed not released back to the host #5643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MQTT Memory Leaks on consumer connections #5471

MQTT Memory Leaks on consumer connections #5471

slice-srinidhis commented May 27, 2024

neilalexander commented May 28, 2024

levb commented May 28, 2024

slice-srinidhis commented May 28, 2024

levb commented May 28, 2024

slice-arpitkhatri commented May 29, 2024

derekcollison commented Jun 3, 2024

neilalexander commented Jun 3, 2024

slice-arpitkhatri commented Jun 5, 2024

levb commented Jun 5, 2024

slice-arpitkhatri commented Jun 5, 2024

levb commented Jun 5, 2024

levb commented Jun 5, 2024

slice-arpitkhatri commented Jun 10, 2024

levb commented Jun 10, 2024

slice-arpitkhatri commented Jun 11, 2024

slice-arpitkhatri commented Jun 11, 2024

neilalexander commented Jun 12, 2024

derekcollison commented Jun 12, 2024

slice-arpitkhatri commented Jun 13, 2024

derekcollison commented Jun 13, 2024

slice-arpitkhatri commented Jun 13, 2024

slice-arpitkhatri commented Jun 13, 2024

derekcollison commented Jun 13, 2024

slice-arpitkhatri commented Jun 13, 2024

derekcollison commented Jun 13, 2024

derekcollison commented Jun 20, 2024

slice-srinidhis commented Jul 4, 2024

MQTT Memory Leaks on consumer connections #5471

MQTT Memory Leaks on consumer connections #5471

Comments

slice-srinidhis commented May 27, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

neilalexander commented May 28, 2024

levb commented May 28, 2024

slice-srinidhis commented May 28, 2024

levb commented May 28, 2024

slice-arpitkhatri commented May 29, 2024

derekcollison commented Jun 3, 2024

neilalexander commented Jun 3, 2024

slice-arpitkhatri commented Jun 5, 2024

levb commented Jun 5, 2024

slice-arpitkhatri commented Jun 5, 2024

levb commented Jun 5, 2024

levb commented Jun 5, 2024

slice-arpitkhatri commented Jun 10, 2024

levb commented Jun 10, 2024

slice-arpitkhatri commented Jun 11, 2024

slice-arpitkhatri commented Jun 11, 2024

neilalexander commented Jun 12, 2024

derekcollison commented Jun 12, 2024

slice-arpitkhatri commented Jun 13, 2024

derekcollison commented Jun 13, 2024

slice-arpitkhatri commented Jun 13, 2024

slice-arpitkhatri commented Jun 13, 2024

derekcollison commented Jun 13, 2024

slice-arpitkhatri commented Jun 13, 2024

derekcollison commented Jun 13, 2024

derekcollison commented Jun 20, 2024

slice-srinidhis commented Jul 4, 2024