Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Besu uses 100 % CPU after losing the internet connection #5348

Closed
ahamlat opened this issue Apr 15, 2023 · 9 comments
Closed

Besu uses 100 % CPU after losing the internet connection #5348

ahamlat opened this issue Apr 15, 2023 · 9 comments
Assignees
Labels
bug Something isn't working mainnet P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) TeamChupa GH issues worked on by Chupacabara Team

Comments

@ahamlat
Copy link
Contributor

ahamlat commented Apr 15, 2023

Description

A discord user reported Besu consuming 100% CPU after restarting his router.
RAM usage was affected too (from 5.19 to 11.7 GB of 32 GB total).

Below the logs shared by the user

rocketpool_eth1  |      at app//io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
rocketpool_eth1  |      at app//io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
rocketpool_eth1  |      at app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
rocketpool_eth1  |      at java.base@17.0.6/java.lang.Thread.run(Thread.java:833)
rocketpool_eth1  | 2023-04-14 22:33:32.190+00:00 | vert.x-eventloop-thread-0 | ERROR | ContextImpl | Unhandled exception
rocketpool_eth1  | java.lang.OutOfMemoryError: Java heap space
rocketpool_eth1  | 2023-04-14 22:33:18.618+00:00 | dnsjava NIO selector | ERROR | Besu | Uncaught exception in thread "dnsjava NIO selector"
rocketpool_eth1  | java.lang.OutOfMemoryError: Java heap space
rocketpool_eth1  |      at java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.iterator(ConcurrentHashMap.java:4740)
rocketpool_eth1  |      at org.xbill.DNS.NioTcpClient.checkTransactionTimeouts(NioTcpClient.java:53)
rocketpool_eth1  |      at org.xbill.DNS.NioTcpClient$$Lambda$1006/0x000000080113d220.run(Unknown Source)
rocketpool_eth1  |      at org.xbill.DNS.NioClient$$Lambda$1016/0x0000000801141950.accept(Unknown Source)
rocketpool_eth1  |      at java.base/java.util.concurrent.CopyOnWriteArrayList.forEach(CopyOnWriteArrayList.java:807)
rocketpool_eth1  |      at org.xbill.DNS.NioClient.runSelector(NioClient.java:114)
rocketpool_eth1  |      at org.xbill.DNS.NioClient$$Lambda$1008/0x000000080113d890.run(Unknown Source)
rocketpool_eth1  |      at java.base/java.lang.Thread.run(Thread.java:833)
rocketpool_eth1  | 2023-04-14 22:33:35.573+00:00 | nioEventLoopGroup-2-1 | WARN  | NioServerSocketChannel | Failed to create a new channel from an accepted socket.
rocketpool_eth1  | java.lang.OutOfMemoryError: Java heap space
rocketpool_eth1  | 2023-04-14 22:33:36.043+00:00 | EthScheduler-Transactions-0 | ERROR | Besu | Uncaught exception in thread "EthScheduler-Transactions-0"
rocketpool_eth1  | java.lang.OutOfMemoryError: Java heap space
rocketpool_eth1  | 2023-04-14 22:33:37.515+00:00 | vertx-blocked-thread-checker | WARN  | BlockedThreadChecker | Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 19215 ms, time limit is 2000 ms
rocketpool_eth1  | io.vertx.core.VertxException: Thread blocked
rocketpool_eth1  |      at java.base@17.0.6/jdk.internal.misc.Unsafe.park(Native Method)
rocketpool_eth1  |      at java.base@17.0.6/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
rocketpool_eth1  |      at java.base@17.0.6/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
rocketpool_eth1  |      at java.base@17.0.6/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:938)
rocketpool_eth1  |      at java.base@17.0.6/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
rocketpool_eth1  |      at java.base@17.0.6/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
rocketpool_eth1  |      at java.base@17.0.6/java.util.concurrent.LinkedBlockingQueue.signalNotEmpty(LinkedBlockingQueue.java:175)
rocketpool_eth1  |      at java.base@17.0.6/java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:423)
rocketpool_eth1  |      at java.base@17.0.6/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357)
rocketpool_eth1  |      at app//io.vertx.core.impl.ContextImpl.executeBlocking(ContextImpl.java:172)
rocketpool_eth1  |      at app//io.vertx.core.impl.ContextImpl.executeBlocking(ContextImpl.java:137)
rocketpool_eth1  |      at app//io.vertx.core.impl.EventLoopContext.executeBlocking(EventLoopContext.java:22)
rocketpool_eth1  |      at app//io.vertx.core.impl.AbstractContext.executeBlocking(AbstractContext.java:138)
rocketpool_eth1  |      at app//io.vertx.core.impl.EventLoopContext.executeBlocking(EventLoopContext.java:22)
rocketpool_eth1  |      at app//io.vertx.core.Vertx.executeBlocking(Vertx.java:566)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.VertxPeerDiscoveryAgent$VertxAsyncExecutor.execute(VertxPeerDiscoveryAgent.java:316)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.internal.PeerDiscoveryController.createPacket(PeerDiscoveryController.java:613)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.internal.PeerDiscoveryController.lambda$bond$11(PeerDiscoveryController.java:531)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.internal.PeerDiscoveryController$$Lambda$1095/0x000000080118dff0.accept(Unknown Source)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.internal.PeerDiscoveryController$PeerInteractionState.execute(PeerDiscoveryController.java:787)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.internal.PeerDiscoveryController.dispatchInteraction(PeerDiscoveryController.java:657)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.internal.PeerDiscoveryController.bond(PeerDiscoveryController.java:553)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.internal.PeerDiscoveryController.onMessage(PeerDiscoveryController.java:324)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.PeerDiscoveryAgent.lambda$handleIncomingPacket$5(PeerDiscoveryAgent.java:307)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.PeerDiscoveryAgent$$Lambda$1163/0x00000008011a8240.accept(Unknown Source)
rocketpool_eth1  |      at java.base@17.0.6/java.util.Optional.ifPresent(Optional.java:178)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.PeerDiscoveryAgent.handleIncomingPacket(PeerDiscoveryAgent.java:307)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.VertxPeerDiscoveryAgent.lambda$handlePacket$10(VertxPeerDiscoveryAgent.java:297)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.VertxPeerDiscoveryAgent$$Lambda$1156/0x00000008011a67e8.handle(Unknown Source)
rocketpool_eth1  |      at app//io.vertx.core.impl.future.FutureImpl$3.onSuccess(FutureImpl.java:141)
rocketpool_eth1  |      at app//io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:54)
rocketpool_eth1  |      at app//io.vertx.core.impl.future.FutureBase$$Lambda$816/0x000000080109d198.run(Unknown Source)
rocketpool_eth1  |      at app//io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
rocketpool_eth1  |      at app//io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
rocketpool_eth1  |      at app//io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
rocketpool_eth1  |      at app//io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:391)
rocketpool_eth1  |      at app//io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
rocketpool_eth1  |      at app//io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
rocketpool_eth1  |      at app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
rocketpool_eth1  |      at java.base@17.0.6/java.lang.Thread.run(Thread.java:833)
rocketpool_eth1  | 2023-04-14 22:33:44.810+00:00 | vertx-blocked-thread-checker | WARN  | BlockedThreadChecker | Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 6800 ms, time limit is 2000 ms
rocketpool_eth1  | io.vertx.core.VertxException: Thread blocked
rocketpool_eth1  |      at app//io.vertx.core.impl.future.FutureImpl.onComplete(FutureImpl.java:137)
rocketpool_eth1  |      at app//io.vertx.core.impl.future.PromiseImpl.onComplete(PromiseImpl.java:23)
rocketpool_eth1  |      at app//io.vertx.core.impl.AbstractContext.setResultHandler(AbstractContext.java:229)
rocketpool_eth1  |      at app//io.vertx.core.impl.AbstractContext.executeBlocking(AbstractContext.java:139)
rocketpool_eth1  |      at app//io.vertx.core.impl.EventLoopContext.executeBlocking(EventLoopContext.java:22)
rocketpool_eth1  |      at app//io.vertx.core.Vertx.executeBlocking(Vertx.java:566)
rocketpool_eth1  |      at app//io.vertx.core.Vertx.executeBlocking(Vertx.java:573)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.VertxPeerDiscoveryAgent.handlePacket(VertxPeerDiscoveryAgent.java:283)
rocketpool_eth1  |      at app//org.hyperledger.besu.ethereum.p2p.discovery.VertxPeerDiscoveryAgent$$Lambda$1031/0x000000080115f9a0.handle(Unknown Source)
rocketpool_eth1  |      at app//io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:50)
rocketpool_eth1  |      at app//io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:274)
rocketpool_eth1  |      at app//io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:22)
rocketpool_eth1  |      at app//io.vertx.core.datagram.impl.DatagramSocketImpl$Connection.handlePacket(DatagramSocketImpl.java:541)
rocketpool_eth1  |      at app//io.vertx.core.datagram.impl.DatagramSocketImpl$Connection.handleMessage(DatagramSocketImpl.java:521)
rocketpool_eth1  |      at app//io.vertx.core.net.impl.ConnectionBase.read(ConnectionBase.java:155)
rocketpool_eth1  |      at app//io.vertx.core.net.impl.VertxHandler.channelRead(VertxHandler.java:153)
rocketpool_eth1  |      at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
rocketpool_eth1  |      at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
rocketpool_eth1  |      at app//io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
rocketpool_eth1  |      at app//io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
rocketpool_eth1  |      at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
rocketpool_eth1  |      at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
rocketpool_eth1  |      at app//io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
rocketpool_eth1  |      at app//io.netty.channel.epoll.EpollDatagramChannel.processPacket(EpollDatagramChannel.java:616)
rocketpool_eth1  |      at app//io.netty.channel.epoll.EpollDatagramChannel.recvmsg(EpollDatagramChannel.java:651)
rocketpool_eth1  |      at app//io.netty.channel.epoll.EpollDatagramChannel.access$100(EpollDatagramChannel.java:56)
rocketpool_eth1  |      at app//io.netty.channel.epoll.EpollDatagramChannel$EpollDatagramChannelUnsafe.epollInReady(EpollDatagramChannel.java:490)
rocketpool_eth1  |      at app//io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:425)
rocketpool_eth1  |      at app//io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
rocketpool_eth1  |      at app//io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
rocketpool_eth1  |      at app//io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
rocketpool_eth1  |      at app//io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:391)
rocketpool_eth1  |      at app//io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
rocketpool_eth1  |      at app//io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
rocketpool_eth1  |      at app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
rocketpool_eth1  |      at java.base@17.0.6/java.lang.Thread.run(Thread.java:833)

Besu shouldn't trigger OutOfMemoryError after loosing internet connection.

@non-fungible-nelson non-fungible-nelson added bug Something isn't working mainnet P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) labels Apr 17, 2023
@non-fungible-nelson non-fungible-nelson added the TeamChupa GH issues worked on by Chupacabara Team label May 1, 2023
@schwitin
Copy link

schwitin commented Jun 2, 2023

Same issue after downtime for abut 10 days (power supply failure).

@ahamlat
Copy link
Contributor Author

ahamlat commented Jun 30, 2023

From what I gathered using Debug logs, this issue could be related to Besu enode being blacklisted by other nodes. I would recommend to delete the key file inside Besu data path and restart Besu.
If anyone is facing this issue, try the steps above and let us know if it resolved the issue.

@Pablosan
Copy link

I'm pretty sure I had tried that earlier with no success, based on something I found in the Besu documentation. I just tested again and after deleting the key file in the besu directory I am still seeing blocked threads (Thread[#33,vert.x-eventloop-thread-1,5,main] and Thread[#32,vert.x-eventloop-thread-0,5,main]) and still getting stack traces.

Eventually I get:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "vertx-blocked-thread-checker"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "dnsjava NIO selector"
2023-06-30 10:59:21.180-05:00 | vertx-blocked-thread-checker | ERROR | Besu | Uncaught exception in thread "vertx-blocked-thread-checker"
java.lang.OutOfMemoryError: Java heap space

After that I repeatedly get stack traces thrown by DNSResolver, AbstractChannel, and rejectedExecution.

Again, this is all after I deleted the key file from the besu directory as you suggested. So unfortunately this did not resolve my issue.

@ahamlat
Copy link
Contributor Author

ahamlat commented Jul 5, 2023

Thanks for the feedback @Pablosan.
Analysing the heap dump you shared, I was able to find what causes the outOfMemory error and the peak in CPU usage.
We can see in the screenshot below that one thread is holding more than 99% of the heap.
image
image

Looking into the stack strace of the thread holding 99% of the heap, we can see that this happens because Besu tries to do a huge reorg, to a block number 15_537_445 (0a46f5e956718147fe6b8f16301496283c28059fd07c16e6f842c0a3dfc1a711).

image

shouldContinuedownloading method calls rewindToBlock(0a46f5e956718147fe6b8f16301496283c28059fd07c16e6f842c0a3dfc1a711), which calls handleChainReorg. This method in DefaultBlockchain creates two local list variables, and each one consumes around 2 GiB memory because the chosen pivot block is too far from current chain head in the database (more than 292 days old).

while (currentOldChainWithReceipts.getNumber() > currentNewChainWithReceipts.getNumber()) {
      // If oldChain is longer than new chain, walk back until we meet the new chain by number,
      // updating as we go.
      updater.removeBlockHash(currentOldChainWithReceipts.getNumber());

      removedTransactions.addAll(
          currentOldChainWithReceipts.getBlock().getBody().getTransactions());
      addRemovedLogsWithMetadata(removedLogsWithMetadata, currentOldChainWithReceipts);

      currentOldChainWithReceipts = getParentBlockWithReceipts(currentOldChainWithReceipts);
    }

image

The peak in CPU usage is related to GC threads trying to free up some memory with no success, as the lists just keep growing.

@ahamlat
Copy link
Contributor Author

ahamlat commented Jul 6, 2023

To try to understand why Besu trusts this old pivot block and triggers a huge reorg, @matkt suggested to check the file inside [Besu_data_path]/fastsync/pivotBlockHeader.rlp.
Decoding the file shared by @Pablosan, I found the same block besu trying to rewind to (block number 15_537_445).
I suggested to @Pablosan to delete this file and restart Besu to see if is better

@Pablosan
Copy link

Pablosan commented Jul 8, 2023

Thanks for all the details @ahamlat. Removing the pivotBlockHeader.rlp file did help but it didn't solve the issue entirely. After removing the file I also had to run:

curl --location --request POST 'http://localhost:8545' \
--header 'Content-Type: application/json' \
--data-raw '{
    "jsonrpc": "2.0",
    "method": "debug_resyncWorldState",
    "params": [],
    "id": 1
}'

This clears the database and starts the syncing process back at the beginning but it did get my node back online eventually. On my machine running at home it took about 38 hours (using SNAP sync on a Mac Mini M1, 16GB RAM, 1 GbE ethernet to a 1 Gigabit fiber connection, using a Thunderbolt 4 2TB SSD).

@ahamlat
Copy link
Contributor Author

ahamlat commented Jul 8, 2023

Thanks for the feedback @Pablosan. Checking the logs you shared with me makes me think you're using Fast sync and not snap sync.
BTW, the RPC call triggers only a resync of the world state, the blockchain part is not cleared.
Not sure to understand why it didn't fix the issue if your node was able to get back online.

@Pablosan
Copy link

Pablosan commented Jul 8, 2023

My apologies. It did fix the issue but it didn't get me back to a fully sync'ed node without the extra step I mentioned above. Regarding the long sync time: I must have something wrong in my command line args. I'll take a closer look and see if I notice something amiss.

@non-fungible-nelson
Copy link
Contributor

Reference #5699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mainnet P3 Medium (ex: JSON-RPC request not working with a specific client library due to loose spec assumtion) TeamChupa GH issues worked on by Chupacabara Team
Projects
None yet
Development

No branches or pull requests

5 participants