Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning Glibc Environment Variables (e.g. MALLOC_ARENA_MAX) #320

Closed
ipsi opened this issue Aug 12, 2016 · 12 comments
Closed

Tuning Glibc Environment Variables (e.g. MALLOC_ARENA_MAX) #320

ipsi opened this issue Aug 12, 2016 · 12 comments
Assignees
Milestone

Comments

@ipsi
Copy link

ipsi commented Aug 12, 2016

This is the other issue we discussed @nebhale.

Having been looking into some Java memory issues recently, I came across the following two articles from Heroku (should I just call them "The unnamable ones"? :P):

Tuning glibc Memory Behavior
Testing Cedar-14 Memory Use

Which suggest that tuning MALLOC_ARENA_MAX can help with memory problems in applications. It appears that it also causes problems for Hadoop, though that's probably their own fault for monitoring VMEM (seriously, guys, why?!).

In my own testing I have noticed that setting MALLOC_ARENA_MAX seems to help slow down memory growth, though it does not prevent it entirely.

There are performance implications from setting MALLOC_ARENA_MAX (I've always set it to 2, as suggested by Heroku), but that falls out like so:

  • New thread created, asks for an arena
  • If < MALLOC_ARENA_MAX blocks in use (defaults to 8 * cpu_count), acquire one.
  • If >, share an existing one
  • When mallocing new memory, if have exclusive use of an arena, no need to lock.
  • If shared, have to lock, thus resulting in contention for memory. Prior to glibc introducing arenas, all threads contended for the same memory.
  • When thread exits, glibc marks the arena as "free", but does not return it to the OS.

That's my understanding at any rate. So I think most Java apps on CF are basically unaffected for the following reasons:

  • Limited amonut of thread-thrashing. If a major concern, can configure container to have min_threads = max_threads. Most threads are long-lived and process a lot of requests.
  • Java does not do malloc on a per-thread basis - it mallocs the heap space at start up, and then writes to it as needed. As far as I know, once a thread is running, there should be no need to do an OS-level malloc, meaning we'd never have OS-level contention for memory (that's pulled up into the Java runtime).
    • NIO and things like ByteBuffers might result in malloc calls - not sure about that.

With all of that said, after digging deeper into this, it seems like setting MALLOC_ARENA_MAX may have just hidden a deeper problem, and I'm not sure whether I would advise setting it by default. It didn't seem to adversely affect performance, but both applications I tested were effectively non-performant by design, so don't take that as gospel.

I had initially thought that there may be memory concerns here, but I'm not so sure anymore. I think the following scenarios are what would happen if there were a memory leak:

Unlimited MALLOC_ARENAS_MAX

  • Thread asks for arena
  • Thread gets arena
  • Thread fills arena, never frees memory
  • Thread asks for new arena
  • Repeat until cgroup OOM killer his.

MALLOC_ARENAS_MAX set to 2

  • Thread asks for arena
  • Thread gets arena
  • Thread fills arena, never frees memory
  • Thread asks for new arena, but no more available.
  • Thread instead writes to "main" arena (or the native program heap in this case), which is unbounded. Given enough time, cgroup OOM killer hits.

(Actual allocation decisions are a lot more complex - this is heavily simplified. See Understanding glibc malloc for an in-depth explanation).

This would explain why I saw the heap space reported by pmap (which is not the Java Heap!) going up and up in one of my test runs - there was a memory leak somewhere in native code, the arenas got filled up, and once they were full, it had to fall back to using the native heap.

With all that said, some allowance for the native memory used by individual threads should probably be made. If it's small enough, it could probably stay in the "native" part of the calculator. Otherwise, since it's relative to the number of threads used, it may need to be calculated somehow.

Regardless of the decision to set it, I do think that documenting it would be very useful (along with how it relates to Java, etc), as would providing a link to other glibc configuration environment variables. Finding this information was not as easy as I would have liked. I think this should also be documented for other buildpacks as well - Ruby and Go, particularly, are more likely to see issues with memory arenas - note that the Heroku examples all talked about Ruby.

@lhotari
Copy link

lhotari commented Aug 25, 2016

@ipsi Are you aware of #160 and #163 ?

@lhotari
Copy link

lhotari commented Aug 25, 2016

I just received a tweet from @TimGerlach from SAP and he seems interested in improvements in this area as well. @TimGerlach , Could you elaborate your requirements?

@lhotari
Copy link

lhotari commented Aug 25, 2016

This would explain why I saw the heap space reported by pmap (which is not the Java Heap!) going up and up in one of my test runs - there was a memory leak somewhere in native code, the arenas got filled up, and once they were full, it had to fall back to using the native heap.

Have you checked this article written by Evan Jones:

TL;DR: Always close GZIPInputStream and GZIPOutputStream since they use native memory via zlib. To track down leaks, use jemalloc and turn on sampling profiling using the MALLOC_CONF environment variable.

Recently Heroku blogged about tracking down a similar bug in Kafka. GOV UK GDS team's article about debugging native memory leaks as well might also be helpful.

Evan Jones has also blogged about a native memory leak in Java's ByteBuffer.

@TimGerlach
Copy link

Thanks @lhotari for mentioning and pointing me to this Issue. I am going to elaborate on #319 as it is closer related to our topic.

@ipsi
Copy link
Author

ipsi commented Aug 25, 2016

@lhotari I'm aware of the first one - was not aware of the second issue. As far as I can tell, there is actually a HotSpot bug, and this might be what you're seeing as well: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8164293

I saw the GZIP thing, and even ran my app with JEMalloc, but did not see the same issues that they reported. Given that this issue (a) only appears in Java 8, and (b) disappears when HotSpot is disabled, I suspect that there are multiple issues that could cause apps to increase in memory over time.

@lhotari
Copy link

lhotari commented Aug 25, 2016

@ipsi Thanks for pointing out the JDK bug.

Did you take a look at the assumptions I presented in #163 (comment) ?
One assumption is that tuning CodeCacheExpansionSize and MinMetaspaceExpansion would also reduce malloc memory fragmentation.
There must be multiple factors that cause the same symptom of slow growth of the process RSS. That was the main reason for filing #163 .

@tootedom
Copy link

tootedom commented Jan 5, 2017

Hi there,

I've notice that TieredCompilation is seemingly resulting in continual growth. Disabling TieredCompilation (-XX:-TieredCompilation), has stopped this growth from occurring on several of our applications (all using jdk8). I submitting a bug to Oracle today to evaluation, which could be related to the already mentioned bug. http://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8164293

Just thought I'd mention it, in case this helps you out.

/dom

@ipsi
Copy link
Author

ipsi commented Jan 12, 2017

FYI, the bug I raised has been marked as resolved in https://bugs.openjdk.java.net/browse/JDK-8164293, target version is 8u152, which has an expected release date of 2017-10-16.

@nebhale
Copy link
Member

nebhale commented Jan 12, 2017

Well, that's a hilarious date.

@mweirauch
Copy link

FYI, the bug has been backported to 8u131 which is released: https://bugs.openjdk.java.net/browse/JDK-8178124

@nebhale
Copy link
Member

nebhale commented May 8, 2017

Well that's good news. I'll leave this open for a couple of weeks to allow people to test against v3.16 and v4.0 and let me know if there is still an outstanding issue.

@nebhale nebhale self-assigned this May 26, 2017
@nebhale nebhale added this to the v3.16 milestone May 26, 2017
@nebhale
Copy link
Member

nebhale commented May 26, 2017

No complaints, so I'm closing this out.

@nebhale nebhale closed this as completed May 26, 2017
jtwaleson pushed a commit to mendix/cf-mendix-buildpack that referenced this issue Feb 21, 2018
The default behavior is 8x the number of detected CPUs . As Cloud
Foundry typically uses large host machines with smaller containers, the
numbers are way off. This often leads to high native memory usage,
followed by a cgroup OOM killer event

We go with Heroku's recommendation of lowering to a
setting of 2 for small instances and grow larger linearly with memory.

References:
- cloudfoundry/java-buildpack#163
- https://devcenter.heroku.com/articles/testing-cedar-14-memory-use
- cloudfoundry/java-buildpack#320
jtwaleson pushed a commit to mendix/cf-mendix-buildpack that referenced this issue Feb 21, 2018
The default behavior is 8x the number of detected CPUs . As Cloud
Foundry typically uses large host machines with smaller containers,
and the Java process is unaware of the difference in allocated CPUs,
the numbers are way off. This often leads to high native memory usage,
followed by a cgroup OOM killer event.

We go with Heroku's recommendation of lowering to a
setting of 2 for small instances and grow larger linearly with memory.

References:
- cloudfoundry/java-buildpack#163
- https://devcenter.heroku.com/articles/testing-cedar-14-memory-use
- cloudfoundry/java-buildpack#320
jtwaleson pushed a commit to mendix/cf-mendix-buildpack that referenced this issue Feb 21, 2018
The default behavior is 8x the number of detected CPUs . As Cloud
Foundry typically uses large host machines with smaller containers,
and the Java process is unaware of the difference in allocated CPUs,
the numbers are way off. This often leads to high native memory usage,
followed by a cgroup OOM killer event.

We go with Heroku's recommendation of lowering to a
setting of 2 for small instances. We also larger linearly with memory to
be more in line with the default setting in Mendix Cloud v3.

References:
- cloudfoundry/java-buildpack#163
- https://devcenter.heroku.com/articles/testing-cedar-14-memory-use
- cloudfoundry/java-buildpack#320
jtwaleson pushed a commit to mendix/cf-mendix-buildpack that referenced this issue Feb 21, 2018
The default behavior is 8x the number of detected CPUs . As Cloud
Foundry typically uses large host machines with smaller containers,
and the Java process is unaware of the difference in allocated CPUs,
the numbers are way off. This often leads to high native memory usage,
followed by a cgroup OOM killer event.

We go with Heroku's recommendation of lowering to a
setting of 2 for small instances. We also grown the setting linearly
with memory to be more in line with the default setting in Mendix
Cloud v3.

References:
- cloudfoundry/java-buildpack#163
- https://devcenter.heroku.com/articles/testing-cedar-14-memory-use
- cloudfoundry/java-buildpack#320
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants