Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elasticsearch data node crashing with OutOfMemoryError #30930

Closed
farin99 opened this issue May 29, 2018 · 9 comments
Closed

elasticsearch data node crashing with OutOfMemoryError #30930

farin99 opened this issue May 29, 2018 · 9 comments

Comments

@farin99
Copy link

farin99 commented May 29, 2018

Elasticsearch version (bin/elasticsearch --version):
6.2.4
Plugins installed: [
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3
]

JVM version (java -version):
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
OS version (uname -a if on a Unix-like system):
Linux prod-elasticsearch-hot-001 4.13.0-1018-azure #21-Ubuntu SMP Thu May 17 13:58:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:

Many of our data nodes crashed together with OutOfMemoryError.

I can send a link to the memory dump in DM.

call stack of one of the nodes:
<Thread 77> <--- OutOfMemoryError happened in this thread State: BLOCKED
java.lang.OutOfMemoryError.() OutOfMemoryError.java:48
io.netty.util.internal.PlatformDependent.allocateUninitializedArray(int) PlatformDependent.java:200
io.netty.buffer.PoolArena$HeapArena.newByteArray(int) PoolArena.java:676
io.netty.buffer.PoolArena$HeapArena.newChunk(int, int, int, int) PoolArena.java:686
io.netty.buffer.PoolArena.allocateNormal(PooledByteBuf, int, int) PoolArena.java:244
io.netty.buffer.PoolArena.allocate(PoolThreadCache, PooledByteBuf, int) PoolArena.java:226
io.netty.buffer.PoolArena.reallocate(PooledByteBuf, int, boolean) PoolArena.java:397
io.netty.buffer.PooledByteBuf.capacity(int) PooledByteBuf.java:118
io.netty.buffer.AbstractByteBuf.ensureWritable0(int) AbstractByteBuf.java:285
io.netty.buffer.AbstractByteBuf.ensureWritable(int) AbstractByteBuf.java:265
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int, int) AbstractByteBuf.java:1077
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int) AbstractByteBuf.java:1070
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf) AbstractByteBuf.java:1060
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteBufAllocator, ByteBuf, ByteBuf) ByteToMessageDecoder.java:92
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ChannelHandlerContext, Object) ByteToMessageDecoder.java:263
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.handler.logging.LoggingHandler.channelRead(ChannelHandlerContext, Object) LoggingHandler.java:241
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(ChannelHandlerContext, Object) DefaultChannelPipeline.java:1359
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.DefaultChannelPipeline.fireChannelRead(Object) DefaultChannelPipeline.java:935
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read() AbstractNioByteChannel.java:134
io.netty.channel.nio.NioEventLoop.processSelectedKey(SelectionKey, AbstractNioChannel) NioEventLoop.java:645
io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(Set) NioEventLoop.java:545
io.netty.channel.nio.NioEventLoop.processSelectedKeys() NioEventLoop.java:499
io.netty.channel.nio.NioEventLoop.run() NioEventLoop.java:459
io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEventExecutor.java:858
java.lang.Thread.run() Thread.java:748

@jaymode
Copy link
Member

jaymode commented May 29, 2018

Do you have any indication that there is a memory leak in Elasticsearch? Unless you have an indication that there is a memory leak or other problem internal to Elasticsearch, I will close this issue as not being a bug. OutOfMemoryErrors can happen due to overloading the cluster with aggregations and/or indexing. For recommendations and help with fixing these issues, you can start a new thread at https://discuss.elastic.co/c/elasticsearch. You may also look at the heap dump that you have in a tool like Eclipse Memory Analyzer (MAT) to provide more details when asking for help in the forums.

@jaymode jaymode closed this as completed May 29, 2018
@ArielCoralogix
Copy link

ArielCoralogix commented May 29, 2018

Hey @jaymode! Shouldn't the request circuit breaker protect us from this case exactly? it's set on our cluster with its default value.
https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#request-circuit-breaker
@danielmitterdorfer @jimczi @clintongormley looks like you have already discussed this here: #20250

@jaymode
Copy link
Member

jaymode commented May 29, 2018

The circuit breakers are a best effort attempt to prevent OOM, but it is possible to still overload Elasticsearch and get a OOM. For example, you might have the breaker set to 60% of the total heap but you may not actually have 60% of your total heap free so you can still get a OOM.

@ArielCoralogix
Copy link

That's a pretty strange way to treat this. IMHO, A query, no matter how complex, should not crash an entire cluster if properly configured.

@amnons77
Copy link

Hey @jaymode
according to the circuit breakers documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#request-circuit-breaker
the defaults is 60% from the JVM and not from the total memory, we didn't changed the circuit breakers defaults and we set the JVM to use only half of the total machine memory, my question is, will it possible to configure elasticsearch not to crash with OOM or is it by designed and the solution is to add more nodes?

@jaymode
Copy link
Member

jaymode commented May 30, 2018

@ArielCoralogix We are continuously working on improving our handling of memory and adding safeguards to prevent OOM errors. This issue is closed as there is nothing more than “some data nodes crashed with OOM and I can give you a heap dump”, which is not actionable. We use github for confirmed bugs and features and our forums as a place to get help for issues like this. There are other open issues for specific items that relate to circuit breakers.

@amnons77 I am referring to the JVM heap in my previous answer. Today we cannot prevent OOM 100%. I cannot give you an answer without more details and the forum is a place to get help with these kinds of questions.

@farin99
Copy link
Author

farin99 commented May 30, 2018

@jaymode IMHO an OOM is always a bug and the memory dump should be analyzed to find the root cause.
We are trying to analyze it our selves, of course, it much easier\ideal for someone who is more familiar with the code base :)
Regarding more information, please let us know what you need. As this is a production issue on a multi user environment, it is hard to provide the specific use case. I would argue that this is why elastic creates a memory dump on OOM by default.

@jaymode
Copy link
Member

jaymode commented May 30, 2018

IMHO an OOM is always a bug

As a development team, we try our best to prevent OOM. There are cases of OOM that we know needs work and there are other cases where we cannot control like if there is a high GC overhead that leads to an OOM from the JVM even when memory can be allocated.

the memory dump should be analyzed to find the root cause.

Github is not the right place for analysis. As I mentioned earlier, the forums would be a good place to ask for help. Developers and community members are active on the forums.

Regarding more information, please let us know what you need

Cluster Topology
Number of CPUs and load
Amount of memory
Size of heap
Changed JVM parameters
Number of indices and shards
Number of search req/s
Did anything spike at the time of the issue
SSDs/Spinning Disks
Logs
Any GC logs?

Basically as much information you can provide when you ask for help on the forums.

@farin99
Copy link
Author

farin99 commented Jun 1, 2018

@jaymode from the memory dump it seems like org.apache.lucene.search.DisjunctionMaxQuery objects take ~70% of the heap. Will you please consider reopening the bug or help us get to the bottom of this crash?
All of the information you requested plus our observations from the memory dump is in the discussion
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants