Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in v3.4.4 for cats.effect.CallbackStack #3359

Closed
floating-morality opened this issue Jan 13, 2023 · 14 comments
Closed

Memory leak in v3.4.4 for cats.effect.CallbackStack #3359

floating-morality opened this issue Jan 13, 2023 · 14 comments
Labels

Comments

@floating-morality
Copy link

floating-morality commented Jan 13, 2023

CallbackStack leaks badly in v3.4.4
image
After downgrading to v3.4.2 the issue disappeared.
screenshot_2023-01-13_at_8 22 15_am

@djspiewak
Copy link
Member

Very concerning.

Do you have any more details on what reproduces this? A heap dump would be the most helpful thing, or some subset of code. In lieu of that, are you doing anything interesting with Deferred?

@durban
Copy link
Contributor

durban commented Jan 13, 2023

Are there long-lived Deferreds, which are not completed, and repeatedly getted and cancelled? (If I remember correctly, when cancelling a get, the linked list node itself is not removed; I guess this could cause problems.)

@djspiewak
Copy link
Member

Yeah this is absolutely true (regarding the linked structure not being removed). Basically it trades off some memory in the cancel case because it's assumed that the "long lived, lots of cancels, never completed" case is very rare.

@TimWSpence
Copy link
Member

Yeah this is absolutely true (regarding the linked structure not being removed). Basically it trades off some memory in the cancel case because it's assumed that the "long lived, lots of cancels, never completed" case is very rare.

Sadly I think Ember would like a word with you there. Connecting via JMX I can see the number of CallbackStacks monotonically increasing there.

From a very quick glance at the Ember server code, I would guess that shutdown.signal is a problem at least, as it's a global Deferred that is getted by every connection-handling fiber

@TimWSpence
Copy link
Member

Yeah this is absolutely true (regarding the linked structure not being removed). Basically it trades off some memory in the cancel case because it's assumed that the "long lived, lots of cancels, never completed" case is very rare.

Sadly I think Ember would like a word with you there. Connecting via JMX I can see the number of CallbackStacks monotonically increasing there.

From a very quick glance at the Ember server code, I would guess that shutdown.signal is a problem at least, as it's a global Deferred that is getted by every connection-handling fiber

Meant to say that I was hitting that hello world server in a tight loop. From what I can see, we leak 4 CallbackStacks per (successful) request

@djspiewak
Copy link
Member

This is VERY awesome @TimWSpence. Can you get a heap dump so we can be sure about where the ownership is on these objects?

@vasilmkd
Copy link
Member

@TimWSpence Did the leak in Ember start with the last couple of releases, or was it there from before?

@TimWSpence
Copy link
Member

TimWSpence commented Jan 13, 2023

This is VERY awesome @TimWSpence. Can you get a heap dump so we can be sure about where the ownership is on these objects?

Sure! Here's one after a couple of minutes of hammering it in a tight loop with curl. I haven't had time to dig through it properly yet but I definitely saw the ember shutdown in there. Some fs2 stuff as well - not sure if that's because of the interruptWhen in ember or if there are separate instances internal to fs2.

dump.zip

@TimWSpence
Copy link
Member

@TimWSpence Did the leak in Ember start with the last couple of releases, or was it there from before?

I only observed it today so I'm afraid I can't be sure. I don't have time now I'm afraid but next week I can downgrade the http4s version for that reproduction and see if it still leaks. From what I've seen I believe that it should but obviously worth confirming!

@durban
Copy link
Contributor

durban commented Jan 14, 2023

Somewhat simpler reproduction with FS2:

import cats.effect.{ExitCode, IO}
import cats.effect.IOApp

import fs2.Stream

object Example extends IOApp {

  final override def run(args: List[String]): IO[ExitCode] = {
    IO.deferred[Either[Throwable, Unit]].flatMap { p =>
      val s = Stream.eval(IO { () }).repeat.interruptWhen(p)
      s.compile.drain.as(ExitCode.Success)
    }
  }
}

FS2 3.4.0; does not leak with Cats Effect 3.4.2, but does leak with 3.4.4. There is an IODeferred with a lot of CallbackStacks (with null callbacks). The IODeferred is held by 2 fs2.internal.InterruptContext instances. I suspect FS2 does exactly what we discussed: repeatedly cancelling a get on a deferred. Maybe here: https://github.com/typelevel/fs2/blob/v3.4.0/core/shared/src/main/scala/fs2/internal/InterruptContext.scala#L91 ?

@djspiewak
Copy link
Member

Yeah I think that's exactly what's happening. I'm pondering if there's a way to adjust CallbackStack to be more self cleaning, rather than leaving the remnants of the structure in place.

@mernst-github
Copy link
Contributor

mernst-github commented Jan 8, 2024

Observing this issue after upgrading an app to CE 3.5.2/fs2 3.9.3 (from a somewhat weird version mix: cats-effect: 3.3.14-6-d8a0441, cats-effect-std/kernel:3.5.1, fs2: 3.3.0):

$ jmap -histo:live 40
 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:     250914017     6021936408  cats.effect.CallbackStack
   2:        211480       69001416  [B (java.base@21.0.1)
...

Can confirm that reproducer above does not exhibit the leak in this combination.

Seeing very large stack instance xxx k deep:
image

Currently speculating this being related to parEvalMapUnordered on an infinite fs2.Stream (which is specific to that app and is implemented using Deferred).

@mernst-github
Copy link
Contributor

mernst-github commented Jan 8, 2024

Reproducer:

    fs2
      .Stream[IO, Unit] {}
      .repeat
      .parEvalMap(2)(IO(_))
      .compile
      .drain
      .unsafeRunSync()(global)

If you run this with a concurrent while sleep 1; do jmap -histo:live $PID | grep cats.effect.CallbackStack\$; done you see the number of cats.effect.CallbackStack instances grow in a zig-zag pattern to increasingly large (~2^n?) values. This points towards the #pack logic.

@armanbilge
Copy link
Member

Thanks for reporting, I opened #3935.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants