Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSP 1.9.21-1.0.15 leaking memory and causing OOMs #1653

Closed
Bencodes opened this issue Dec 11, 2023 · 31 comments
Closed

KSP 1.9.21-1.0.15 leaking memory and causing OOMs #1653

Bencodes opened this issue Dec 11, 2023 · 31 comments

Comments

@Bencodes
Copy link
Contributor

Bencodes commented Dec 11, 2023

KSP 1.9.21-1.0.15 has started leaking memory and causing OOM issues when being used in a persistent worker environment. I suspect it's related to 85f9cac where instances are being placed into a global static list.

@neetopia
Copy link
Contributor

That commit is essentially reverting 6bf8193, which is just reverting back to the old behavior for KSP1 therefore should not be causing issues. Do you have a setup that can I can reproduce your OOM issue?

@Bencodes
Copy link
Contributor Author

We are able to repro this in a build that uses persistent workers via https://github.com/bazelbuild/rules_kotlin. Bazel persistent workers are just a standalone JVM processes that accepts and process work requests.

As a quick test I added this code to the SymbolProcessor#finish call in one of our KSP processors which seems to eliminate the issue:

    val clazz = Class.forName("com.google.devtools.ksp.KSObjectCacheManager")
    val companion = clazz.getDeclaredField("Companion")
    val instance = companion.get(null)
    val method = instance.javaClass.getDeclaredMethod("clear")
    method.invoke(instance)

@Bencodes
Copy link
Contributor Author

Bencodes commented Dec 11, 2023

Is there any reason to not tear down the cache along with the Resolver like what's being done here e518b81c?

Ex: https://github.com/google/ksp/compare/main...Bencodes:ksp:clear-ksobjectcachemanager-during-resolver-tear-down?expand=1

@neetopia
Copy link
Contributor

The reason for tearing down the ResolverImpl was to allow GC to recycle ResolverImpl instances in previous rounds, the cache clear is already happening every round here, does manually clearing cache helps with your use case? If so it might be a cache clearing issue in case of an error state condition? Still, I don't see if this issue has anything to do with 85f9cac

@Bencodes
Copy link
Contributor Author

Manually clearing the cache does seem to help. I've ran our benchmark suite a few times against our entire repo and the numbers look consistent with Kotlin 1.9.20 + KSP 1.9.20-1.0.14.

Happy to test any other theories if you can think of anything else that might be causing this.

@neetopia
Copy link
Contributor

Interesting, I am wondering why it was not present until 1.9.20-1.0.15, let me try to figure it out before proceeding with a fix.

@Bencodes
Copy link
Contributor Author

Bencodes commented Dec 11, 2023

Dug a bit deeper just to triple-check the memory leaking theory using jmap -dump:format=b,file=out.hprof PID looking at 3 cases:

Kotlin 1.9.20 + KSP 1.9.20-1.0.14 building all apps

Observed memory stays around 1.5gb - 2.5gb for most of the build and never seems to exceed 3gb. I managed to grab a dump that's about 2.5gb in size.

Kotlin 1.9.21 + KSP 1.9.21-1.0.15 building all apps

Observed memory pretty quickly grows to more than 17gb. Running jmap clears a lot of this memory out resulting in a dump between 3.5gb to 4.5gb, but the memory does climb back up to 17gb pretty quickly.

The KSP worker also eventually locks up during the build once it hits the memory cap and has nothing else to consume.

Kotlin 1.9.21 + KSP 1.9.21-1.0.15 + reflection hack to clear wipe the memory

Observed memory consumption remains pretty low. Hands around 3.5 - 4.5gb. Running jmap also clears out some of this memory resulting in a dump of about 3gb. The observed memory as the build continues remains around 3.5 - 4.5gb.

@jbarr21
Copy link

jbarr21 commented Dec 12, 2023

this also causes OOMs for our developers. we are using the Buck build system that has a long lived Buck daemon process

@neetopia
Copy link
Contributor

Can this issue be observed with any 1.0.13 KSP versions?

@jbarr21
Copy link

jbarr21 commented Dec 13, 2023

Can this issue be observed with any 1.0.13 KSP versions?

@neetopia We see no issues on 1.0.13 & 1.0.14

@neetopia
Copy link
Contributor

@Bencodes would you be ok if I proceed with your fix for now to unblock impacted users?

@Bencodes
Copy link
Contributor Author

@neetopia I'm fine with that.

@Bencodes
Copy link
Contributor Author

@Bencodes would you be ok if I proceed with your fix for now to unblock impacted users?

I can also try to throw together a demo project that demonstrates the issue if you want something to debug against. And can provide heap dumps if that's useful.

@neetopia
Copy link
Contributor

That would help, for now I am ok with proceeding with your fix, the cause should be a pointer not released causing GC fail to recycle the cache, we will visit it later.

@neetopia
Copy link
Contributor

Please try 1.9.21-1.0.16 for the fix, thanks for the help with the issue investigation.

@jbarr21
Copy link

jbarr21 commented Dec 14, 2023

@neetopia we are no longer seeing OOMs/GC thrash on our Bazel builds on 1.0.16, but our Buck builds are still leaking & failing after 2x build time due to GC of leaks in our long-lived Buck daemon process. I'm investigating further to see why

@neetopia
Copy link
Contributor

@jbarr21 Thanks for checking, can you file another issue to track since it looks like a different issue to me?

@JoelWilcox
Copy link

Just ran into this while updating Kotlin to 1.9.21 and KSP to 1.9.21-1.0.15 in Anvil. I haven't had a chance to dig into it yet but I've also given the new 1.9.21-1.0.16 release a try which seems to still be resulting in OOMs for us. Also, we are using Gradle for our builds.

Related PR: https://github.com/square/anvil/pull/814/files
And build: https://github.com/square/anvil/actions/runs/7215286211/job/19659225800?pr=814

@jbarr21
Copy link

jbarr21 commented Dec 14, 2023

Raised #1656

@Bencodes
Copy link
Contributor Author

Just tested 1.9.21-1.0.16 and it's still causing OOMs for us.

@neetopia
Copy link
Contributor

@Bencodes is the reflection hack you mentioned the same as the commit in your PR?

@Bencodes
Copy link
Contributor Author

@Bencodes is the reflection hack you mentioned the same as the commit in your PR?

Yes and being used like this:

internal class SomeProcessor(
    // ...
) : SymbolProcessor {
    // ...
    override fun finish() {
        val clazz = Class.forName("com.google.devtools.ksp.KSObjectCacheManager")
        val companion = clazz.getDeclaredField("Companion")
        val instance = companion.get(null)
        val method = instance.javaClass.getDeclaredMethod("clear")
        method.invoke(instance)
    }
}

@neetopia
Copy link
Contributor

It looks incorrect to me, in your commit you are calling clear for the class com.google.devtools.ksp.processing.impl.KSObjectCacheManager, your reflection is calling a class in a different package.

For now if this reflection hack works for anyone on older version (1.0.15), please take a try, otherwise please use 1.0.14 or earlier versions before we figure out a root cause and fixed it.

@Bencodes
Copy link
Contributor Author

Oh wait you are right. I didn't realize that there are two cache managers with the exact same name in the KSP codebase. So then that means calling clear on com.google.devtools.ksp.KSObjectCacheManager one is the correct solution instead of calling clear on com.google.devtools.ksp.processing.impl.KSObjectCacheManager

❯ fd KSObjectCacheManager
common-util/src/main/kotlin/com/google/devtools/ksp/KSObjectCacheManager.kt
compiler-plugin/src/main/kotlin/com/google/devtools/ksp/processing/impl/KSObjectCacheManager.kt

@johnjohndoe
Copy link

Isn't this issue resolved with the 1.9.21-1.0.16 release? A commit isn't linked here, though.

Bencodes added a commit to Bencodes/ksp that referenced this issue Dec 17, 2023
github-actions bot pushed a commit that referenced this issue Jan 6, 2024
@neetopia
Copy link
Contributor

neetopia commented Jan 9, 2024

For everyone affected with OOM issue, please try out this snapshot for a fix verification:
snapshot repository: https://oss.sonatype.org/content/repositories/snapshots/
snapshot version: 1.9.21-1.0.17-SNAPSHOT

@JoelWilcox
Copy link

I confirmed that 1.9.21-1.0.17-SNAPSHOT fixes the OOM issue for Anvil 👍

@jbarr21
Copy link

jbarr21 commented Jan 10, 2024

we are no longer seeing OOMs on our Buck or Bazel builds on 1.9.21-1.0.17-SNAPSHOT

as @Bencodes previously noted, memory usage on newer KSP versions has increased which will require us to reconfigure some of our max memory & thread values

@shyvum
Copy link

shyvum commented Jan 14, 2024

@neetopia given fix is verified by two folks, can you go ahead and tag it to a hotfix-version release?

@neetopia
Copy link
Contributor

Fix is shipped with 1.9.22-1.0.17

@Bencodes
Copy link
Contributor Author

Thanks!

@neetopia will the artifacts be uploaded to the 1.9.22-1.0.17 GitHub release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants