Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT Memory Leaks on consumer connections #5471

Closed
slice-srinidhis opened this issue May 27, 2024 · 27 comments · Fixed by #5566
Closed

MQTT Memory Leaks on consumer connections #5471

slice-srinidhis opened this issue May 27, 2024 · 27 comments · Fixed by #5566
Assignees
Labels
defect Suspected defect such as a bug or regression

Comments

@slice-srinidhis
Copy link

Observed behavior

Seeing consistent memory leak when using NATS while MQTT consumers just makes connections. The pattern is observed when we are scaling the system to have 5K connections .Though there not many messages produced or consumed . The consumer_inactive_threshold: 0.2s is also set but the memory is not getting released. The only way to cleanup is to restart the pods .
Screenshot 2024-04-22 at 4 45 58 PM
Screenshot 2024-04-24 at 1 01 46 PM
On getting the pprof , its observed createInternalClient seem to take a lot of heap memory (attaching the pprof for ref) . [
heap_pprof_24-04.pdf
](url)
Kindly look into issue and let know if there are any parameter tweaking that will help resolving the issue .

Expected behavior

Optimal memory management without memory leaks for MQTT .

Server and client version

Nats Server version 2.10.14

Host environment

Kubernetes v1.25

Steps to reproduce

Setup NATS with MQTT , on making client connections the leak is reproducible at scale .

@slice-srinidhis slice-srinidhis added the defect Suspected defect such as a bug or regression label May 27, 2024
@neilalexander
Copy link
Member

Can you please attach the /debug/pprof/allocs?debug=0 file itself instead of the PDF extract?

@levb
Copy link
Contributor

levb commented May 28, 2024

(apologies, posted a comment that was incorrect, deleted)

@slice-srinidhis
Copy link
Author

@neilalexander Please find the pprofs associated . 28-05-05-19 taken during the first iteration and 28-05-07-30 during the second one where the older memory is accumulated , Have also attached the memory graph for reference
28-05-07-30.pb.gz

28-05-05-19.pb.gz

Screenshot 2024-05-28 at 9 31 07 PM

@levb
Copy link
Contributor

levb commented May 28, 2024

@slice-srinidhis Do you know the details of the MQTT connections: clean or stored? If stored, can you please provide information about how many subscriptions there are in the sessions? Do you use MQTT retained messages?

@slice-arpitkhatri
Copy link

@levb

  • we have cleanSession set as true.
  • There is a single subscription per client on topic notifications/{userId} ( QOS 2 )
  • We are not using MQTT retained messages.

@derekcollison
Copy link
Member

Any updates here?

@neilalexander
Copy link
Member

@slice-arpitkhatri When the memory usage is quite high, can you please also supply the output of a /debug/pprof/goroutine?debug=1 too?

That output should contain account/asset names etc, so if you would rather send privately vs posting here, then please email to neil@nats.io. Thanks!

@slice-arpitkhatri
Copy link

@levb @neilalexander: please find the attached files.

pprof.goroutine.004.pb.gz
pprof.goroutine.005.pb.gz
pprof.goroutine.003.pb.gz
pprof.goroutine.002.pb.gz
pprof.goroutine.001.pb.gz

Let us know if you need anything else. Thanks!

@levb
Copy link
Contributor

levb commented Jun 5, 2024

@slice-arpitkhatri @neilalexander is it possible to get on a zoom call, with access to your cluster, so we could gather more data together?

@slice-arpitkhatri
Copy link

Yes, sure. Could you please let me know what times work best for you?

@levb
Copy link
Contributor

levb commented Jun 5, 2024

@slice-arpitkhatri @neilalexander I can do any time tomorrow June 6 after 6AM Pacific, or Friday any time after 5AM PDT.

@levb
Copy link
Contributor

levb commented Jun 5, 2024

@slice-arpitkhatri Let's do Friday, June 7, any time that works for you. Please consider @neilalexander is in GMT, I can make it work on my end.. Let us know. You can email me at lev@synadia.com to set up the call.

@slice-arpitkhatri
Copy link

Hi @levb @neilalexander, we have tried out the following suggestions that you guys proposed in the last meeting:

  1. Changed inactive_consumer_threshold to 10s.
  2. Tested with QOS 1 instead of QOS 2.

We've conducted performance tests for both cases and have not observed any meaningful changes in memory consumed.

Attaching the memory graphs for the same:

  • Memory graph with inactive_consumer_threshold : 10s
Screenshot 2024-06-10 at 15 19 49
  • Memory graph with QOS 1
Screenshot 2024-06-10 at 15 20 51

@levb
Copy link
Contributor

levb commented Jun 10, 2024

@slice-arpitkhatri What is the motivation to set a "low" inactivity threshold? ("Low" relative to the frequency of messages coming through). Since your clients use clean sessions, the consumers will normally be deleted automatically when the sessions disconnect. If a server cold-restarts and consumers are left undeleted, a considerably longer value (24hrs?) may be acceptable for the cleanup?

I have read through the history of the config option and the code over the weekend, and I am half-through testing/investigating what happens to an MQTT session when its consumers go away "under the hood". There is definitely potential for it getting "confused", but I am not through with the code yet.

Let us know please if setting a "long" inactivity threshold helps avoiding (or slowing down) the leak.

@slice-arpitkhatri
Copy link

Hi @levb, we have changed the inactive_consumer_threshold to 24 hours. I have attached the memory graph after making this change. However, we are still facing the memory leak issue.

Screenshot 2024-06-11 at 14 53 38

@slice-arpitkhatri
Copy link

We've basically run our tests with different values of inactive_consumer_threshold (0.2s, 10s, 24hrs). We've also run tests after removing the inactive_consumer_threshold from the config. However, we have not observed any changes in memory consumption.

@neilalexander
Copy link
Member

Thanks for confirming Arpit, can you please provide updated memory profiles from a period where the memory usage is high? Thanks!

@derekcollison
Copy link
Member

@slice-arpitkhatri which mqtt library are you using and could you provide us a small sample mqtt app that shows the behavior? At this point we would want to have a sample app and watch its interactions with the NATS system to help us track down any issues.

Thanks.

@slice-arpitkhatri
Copy link

@derekcollison In production, we are using HiveMQ Library for Android and CocoaMQTT Library for iOS.
For performance testing purposes, we are using Paho MQTT in Golang. I have attached a sample app which we are using for performance testing.

perfConsumer.go.zip

++ @neilalexander @levb

@derekcollison
Copy link
Member

And you can see the issue using the Go client correct?

@slice-arpitkhatri
Copy link

We're encountering it regardless of the client, both in production (HiveMQ Kotlin & CocoaMQTT Swift) and during performance testing (Paho MQTT in Golang).

During performance testing, we are only using Paho MQTT in Golang (sample app shared above).

@slice-arpitkhatri
Copy link

And you can see the issue using the Go client correct?

To answer your question clearly, yes, we can see the issue using the Go client.

@derekcollison
Copy link
Member

Thanks, and during your performance testing, how is that conducted?

@slice-arpitkhatri
Copy link

We have 15 pods running, each establishing around ~340 connections (totaling 5k connections, all non-durable, with random clientIDs). They subscribe to "topic/{i}" where 0 < i < 340. These connections are terminated after one minute, at which point the sample app mentioned above spins up another set of 5k connections with different clientIDs.

We're implementing this to simulate the production traffic pattern. Locally, we're running a producer script which publishes messages at 10 TPS to "topic/{i}", where i is a random integer between 0 and 340. Note that this is not 10 TPS per topic; it's 10 TPS collectively, essentially 10 TPS at the broker.

Please let me know in case you have any further queries. Thanks.

@derekcollison
Copy link
Member

Thanks for the information, much appreciated.

@derekcollison
Copy link
Member

@slice-srinidhis Thank you for your patience. We finally tracked it down and fixed. Is on main and will be part of 2.10.17 release.

derekcollison pushed a commit that referenced this issue Jun 21, 2024
…ing the session (#5575)

MQTT s.clear(): do not wait for JS responses when disconnecting the
session

Related to #5471

Previously we were making `jsa.NewRequest` as it is needed when
connecting a clean session. On disconnect, there is no reason to wait
for the response (and tie up the MQTT read loop of the client).

This should specifically help situations when a client app with many
MQTT connections and QOS subscriptions disconnects suddenly, causing a
flood of JSAPI deleteConsumer requests.

Test: n/a, not sure how to instrument for it.
neilalexander pushed a commit that referenced this issue Jun 21, 2024
…ing the session (#5575)

MQTT s.clear(): do not wait for JS responses when disconnecting the
session

Related to #5471

Previously we were making `jsa.NewRequest` as it is needed when
connecting a clean session. On disconnect, there is no reason to wait
for the response (and tie up the MQTT read loop of the client).

This should specifically help situations when a client app with many
MQTT connections and QOS subscriptions disconnects suddenly, causing a
flood of JSAPI deleteConsumer requests.

Test: n/a, not sure how to instrument for it.
@slice-srinidhis
Copy link
Author

Thanks for the fix @derekcollison @neilalexander . We have deployed the latest release in production and seeing nats_core_mem_bytes releasing memory and not growing (Graph 1 ) However , the container/pod memory has been growing and not releasing it back to the system (the growth is not as rapid as before) (Graph 2 ) . Can you help us with any parameter that can be tuned so that the pod doesn't go to OOM . We have the GOGC currently set to 50 .
1 - Nats Memory
2-Pod Memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
5 participants