-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
elasticsearch data node crashing with OutOfMemoryError #30930
Comments
Do you have any indication that there is a memory leak in Elasticsearch? Unless you have an indication that there is a memory leak or other problem internal to Elasticsearch, I will close this issue as not being a bug. OutOfMemoryErrors can happen due to overloading the cluster with aggregations and/or indexing. For recommendations and help with fixing these issues, you can start a new thread at https://discuss.elastic.co/c/elasticsearch. You may also look at the heap dump that you have in a tool like Eclipse Memory Analyzer (MAT) to provide more details when asking for help in the forums. |
Hey @jaymode! Shouldn't the request circuit breaker protect us from this case exactly? it's set on our cluster with its default value. |
The circuit breakers are a best effort attempt to prevent OOM, but it is possible to still overload Elasticsearch and get a OOM. For example, you might have the breaker set to 60% of the total heap but you may not actually have 60% of your total heap free so you can still get a OOM. |
That's a pretty strange way to treat this. IMHO, A query, no matter how complex, should not crash an entire cluster if properly configured. |
Hey @jaymode |
@ArielCoralogix We are continuously working on improving our handling of memory and adding safeguards to prevent OOM errors. This issue is closed as there is nothing more than “some data nodes crashed with OOM and I can give you a heap dump”, which is not actionable. We use github for confirmed bugs and features and our forums as a place to get help for issues like this. There are other open issues for specific items that relate to circuit breakers. @amnons77 I am referring to the JVM heap in my previous answer. Today we cannot prevent OOM 100%. I cannot give you an answer without more details and the forum is a place to get help with these kinds of questions. |
@jaymode IMHO an OOM is always a bug and the memory dump should be analyzed to find the root cause. |
As a development team, we try our best to prevent OOM. There are cases of OOM that we know needs work and there are other cases where we cannot control like if there is a high GC overhead that leads to an OOM from the JVM even when memory can be allocated.
Github is not the right place for analysis. As I mentioned earlier, the forums would be a good place to ask for help. Developers and community members are active on the forums.
Cluster Topology Basically as much information you can provide when you ask for help on the forums. |
@jaymode from the memory dump it seems like org.apache.lucene.search.DisjunctionMaxQuery objects take ~70% of the heap. Will you please consider reopening the bug or help us get to the bottom of this crash? |
Elasticsearch version (
bin/elasticsearch --version
):6.2.4
Plugins installed: [
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3
]
JVM version (
java -version
):java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
OS version (
uname -a
if on a Unix-like system):Linux prod-elasticsearch-hot-001 4.13.0-1018-azure #21-Ubuntu SMP Thu May 17 13:58:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Many of our data nodes crashed together with OutOfMemoryError.
I can send a link to the memory dump in DM.
call stack of one of the nodes:
<Thread 77> <--- OutOfMemoryError happened in this thread State: BLOCKED
java.lang.OutOfMemoryError.() OutOfMemoryError.java:48
io.netty.util.internal.PlatformDependent.allocateUninitializedArray(int) PlatformDependent.java:200
io.netty.buffer.PoolArena$HeapArena.newByteArray(int) PoolArena.java:676
io.netty.buffer.PoolArena$HeapArena.newChunk(int, int, int, int) PoolArena.java:686
io.netty.buffer.PoolArena.allocateNormal(PooledByteBuf, int, int) PoolArena.java:244
io.netty.buffer.PoolArena.allocate(PoolThreadCache, PooledByteBuf, int) PoolArena.java:226
io.netty.buffer.PoolArena.reallocate(PooledByteBuf, int, boolean) PoolArena.java:397
io.netty.buffer.PooledByteBuf.capacity(int) PooledByteBuf.java:118
io.netty.buffer.AbstractByteBuf.ensureWritable0(int) AbstractByteBuf.java:285
io.netty.buffer.AbstractByteBuf.ensureWritable(int) AbstractByteBuf.java:265
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int, int) AbstractByteBuf.java:1077
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int) AbstractByteBuf.java:1070
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf) AbstractByteBuf.java:1060
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteBufAllocator, ByteBuf, ByteBuf) ByteToMessageDecoder.java:92
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ChannelHandlerContext, Object) ByteToMessageDecoder.java:263
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.handler.logging.LoggingHandler.channelRead(ChannelHandlerContext, Object) LoggingHandler.java:241
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(ChannelHandlerContext, Object) DefaultChannelPipeline.java:1359
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.DefaultChannelPipeline.fireChannelRead(Object) DefaultChannelPipeline.java:935
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read() AbstractNioByteChannel.java:134
io.netty.channel.nio.NioEventLoop.processSelectedKey(SelectionKey, AbstractNioChannel) NioEventLoop.java:645
io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(Set) NioEventLoop.java:545
io.netty.channel.nio.NioEventLoop.processSelectedKeys() NioEventLoop.java:499
io.netty.channel.nio.NioEventLoop.run() NioEventLoop.java:459
io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEventExecutor.java:858
java.lang.Thread.run() Thread.java:748
The text was updated successfully, but these errors were encountered: