Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Hangs #282

Open
aszepieniec opened this issue Dec 10, 2024 · 26 comments
Open

Node Hangs #282

aszepieniec opened this issue Dec 10, 2024 · 26 comments
Labels
bug Something isn't working concurrency neptune-core async tasks should run concurrently without blocking eachother for long

Comments

@aszepieniec
Copy link
Contributor

I observed that my node is hanging. Not sure why. Last few log messages:

2024-12-05T21:53:23.854059211Z  INFO ThreadId(02) neptune_cash::peer_loop: Block with height 2210 is valid. mined: 2024-12-05T21:53:17.487+00:00
2024-12-05T21:53:23.854096191Z  WARN ThreadId(02) neptune_cash::peer_loop: Received 1 blocks from peer but incoming blocks are less canonical than current tip.
2024-12-05T21:53:23.854124811Z  WARN ThreadId(02) neptune_cash: executed handle_peer_message()::PeerMessage::Block in 0.320761143 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:532:17
2024-12-05T21:53:23.854308201Z  INFO ThreadId(02) neptune_cash::peer_loop: Got new block from peer [::ffff:107.181.240.234]:41672, height 2210, mined 2024-12-05T21:53:17.487+00:00
2024-12-05T21:53:23.854733721Z  INFO ThreadId(02) neptune_cash::mine_loop: Miner task received beta block height 2210
2024-12-05T21:53:23.942761497Z  INFO ThreadId(03) neptune_cash::peer_loop: Block with height 2209 is valid. mined: 2024-12-05T21:32:56.300+00:00
2024-12-05T21:53:24.177705063Z  INFO ThreadId(02) neptune_cash::peer_loop: Block with height 2210 is valid. mined: 2024-12-05T21:53:17.487+00:00
2024-12-05T21:53:24.261931122Z  INFO ThreadId(03) neptune_cash::peer_loop: Block with height 2210 is valid. mined: 2024-12-05T21:53:17.487+00:00
2024-12-05T21:53:24.49263834Z  INFO ThreadId(02) neptune_cash::peer_loop: Block with height 2210 is valid. mined: 2024-12-05T21:53:17.487+00:00
2024-12-05T21:53:24.57640586Z  INFO ThreadId(03) neptune_cash::peer_loop: Block with height 2210 is valid. mined: 2024-12-05T21:53:17.487+00:00
2024-12-05T21:53:24.811175735Z  INFO ThreadId(02) neptune_cash::peer_loop: Block with height 2210 is valid. mined: 2024-12-05T21:53:17.487+00:00
2024-12-05T21:53:24.811267855Z  WARN ThreadId(03) neptune_cash::peer_loop: Received 1 blocks from peer but incoming blocks are less canonical than current tip.
2024-12-05T21:53:24.811291875Z  WARN ThreadId(03) neptune_cash: executed handle_peer_message()::PeerMessage::Block in 1.278402247 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:532:17
2024-12-05T21:53:24.811448405Z  WARN ThreadId(03) neptune_cash::peer_loop: Received 1 blocks from peer but incoming blocks are less canonical than current tip.
2024-12-05T21:53:24.811475485Z  WARN ThreadId(03) neptune_cash: executed handle_peer_message()::PeerMessage::Block in 1.278658637 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:532:17
2024-12-05T21:53:24.811571325Z  WARN ThreadId(03) neptune_cash::peer_loop: Received 10 blocks from peer but incoming blocks are less canonical than current tip.
2024-12-05T21:53:24.811692815Z  WARN ThreadId(03) neptune_cash: executed handle_peer_message()::PeerMessage::Block in 3.857305739 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:532:17
2024-12-05T21:53:24.811788645Z  WARN ThreadId(02) neptune_cash::peer_loop: Received 1 blocks from peer but incoming blocks are less canonical than current tip.
2024-12-05T21:53:24.811810895Z  WARN ThreadId(03) neptune_cash::peer_loop: Received 1 blocks from peer but incoming blocks are less canonical than current tip.
2024-12-05T21:53:24.811843046Z  WARN ThreadId(02) neptune_cash: executed handle_peer_message()::PeerMessage::Block in 1.278586039 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:532:17
2024-12-05T21:53:24.811851025Z  WARN ThreadId(03) neptune_cash: executed handle_peer_message()::PeerMessage::Block in 0.957542794 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:532:17
2024-12-05T21:53:24.812810554Z  WARN ThreadId(01) neptune_cash: executed run()::select::mp_resync_timer in 0.001559499 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/main_loop.rs:1468:21
2024-12-05T21:53:24.994879006Z  WARN ThreadId(03) neptune_cash: executed handle_peer_message()::PeerMessage::BlockRequestByHeight in 0.182773032 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:850:17
2024-12-05T21:53:25.11751951Z  WARN ThreadId(02) neptune_cash: executed handle_peer_message()::PeerMessage::BlockRequestByHash in 0.003222898 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:826:17
2024-12-05T21:54:49.019940768Z  INFO ThreadId(01) neptune_cash::main_loop: Attempting to reconnect to peer with lost connection: 139.162.193.206:9798
2024-12-05T21:54:49.020016008Z  INFO ThreadId(01) neptune_cash::main_loop: Performing peer discovery
2024-12-05T21:54:49.020037568Z  INFO ThreadId(01) neptune_cash::main_loop: Connecting to peer [::ffff:79.157.213.217]:9798 with distance 2
2024-12-05T21:54:49.020395938Z  INFO ThreadId(03) neptune_cash::connect_to_peers: Established outgoing TCP connection with 139.162.193.206:9798
2024-12-05T21:54:49.020673748Z  INFO ThreadId(02) neptune_cash::connect_to_peers: Established incoming TCP connection with [::ffff:139.162.193.206]:48210
2024-12-05T21:54:49.020956617Z  WARN ThreadId(02) neptune_cash::connect_to_peers: Incoming connection refused: SelfConnect
2024-12-05T21:54:49.021014877Z ERROR ThreadId(02) neptune_cash::main_loop: Got error: Refusing incoming connection. Reason: SelfConnect
2024-12-05T21:54:49.021043447Z ERROR ThreadId(03) neptune_cash::connect_to_peers: An error occurred: Outgoing connection attempt refused. Reason: SelfConnect. Connection closing
2024-12-05T21:54:49.021066747Z  INFO ThreadId(03) neptune_cash::connect_to_peers: Connection closing
2024-12-05T21:55:02.601627685Z  WARN ThreadId(02) neptune_cash::connect_to_peers: Failed to establish connection: Connection timed out (os error 110)
2024-12-05T21:55:02.601689695Z  INFO ThreadId(02) neptune_cash::connect_to_peers: Connection closing
2024-12-05T21:55:04.838148482Z  INFO ThreadId(02) neptune_cash::peer_loop: Got block proposal from peer.
2024-12-05T21:55:05.152632797Z  INFO ThreadId(02) neptune_cash::peer_loop: Rewarding peer ::ffff:66.248.205.33 for NewBlockProposal
2024-12-05T21:55:05.152686637Z  WARN ThreadId(02) neptune_cash: executed handle_peer_message()::PeerMessage::BlockProposal in 0.314540425 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:1128:17
2024-12-05T21:55:05.152760117Z  INFO ThreadId(01) neptune_cash::main_loop: Received new favorable block proposal for mining operation.
2024-12-05T21:55:05.153318377Z  INFO ThreadId(03) neptune_cash::mine_loop: Miner received message about new block proposal for guessing.
2024-12-05T21:55:05.154226766Z  INFO ThreadId(572) neptune_cash::mine_loop: Mining on block with 1 outputs and difficulty 704690060. Attempting to find block with height 2211 with digest less than target: 05169561069146539959,10291194454790015845,06096537823767786088,16985605236488289992,00000000026177102695
2024-12-05T21:55:05.224629293Z  INFO ThreadId(02) neptune_cash::peer_loop: Got block proposal from peer.
2024-12-05T21:55:05.224682013Z  WARN ThreadId(02) neptune_cash::peer_loop: Rejecting new block proposal:
Insufficient fee. Proposal was 64;
current fee is: Some(NeptuneCoins(256000000000000005090463958695936))
2024-12-05T21:55:05.225672992Z  WARN ThreadId(02) neptune_cash::peer_loop: Punishing peer 51.15.139.238 for NonFavorableBlockProposal
2024-12-05T21:55:05.225688382Z  WARN ThreadId(02) neptune_cash: executed handle_peer_message()::PeerMessage::BlockProposal in 0.001061049 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/runner/work/neptune-core/neptune-core/src/peer_loop.rs:1128:17
2024-12-05T21:56:31.651265531Z  INFO ThreadId(02) neptune_cash::peer_loop: Got new block from peer 51.15.139.238:9798, height 2211, mined 2024-12-05T21:56:29.441+00:00
2024-12-05T21:56:32.264116975Z  INFO ThreadId(02) neptune_cash::peer_loop: Got new block from peer 151.115.78.81:9798, height 2211, mined 2024-12-05T21:56:29.441+00:00

The dashboard fails to make a connection. Ctrl-C does not work. I ran it with the following command: neptune-core --max-num-peers 25 --bootstrap --guess --sleepy-guessing --peers 51.15.139.238:9798 --peers 139.162.193.206:9798. After reboot it seems to be syncing okay but I am noticing this worrying error popping up a bunch of times:

2024-12-10T14:25:48.367618193Z ERROR ThreadId(01) neptune_cash::main_loop: Failed to send pause message to miner thread:                                                                                                             
no available capacity                                                                                                                                                                                                                
@aszepieniec
Copy link
Contributor Author

aszepieniec commented Dec 11, 2024

It happened again. Same setup. I lost the log output though. Version 0.0.10 and published binaries.

@Sword-Smith
Copy link
Member

I suspect peer_loop to be responsible. I have a hunch (from logs) that it's related to the receival of two blocks in quick successions, from two different peers. I've found the problem to be lessened when I connect to fewer peers.

@aszepieniec
Copy link
Contributor Author

It happened again. Here are some logs, now with RUST_LOG=debug.

2024-12-11T17:13:41.911369067Z  WARN ThreadId(02) neptune_cash::connect_to_peers: Failed to establish connection: Connection timed out (os error 110)
2024-12-11T17:13:41.911448297Z  INFO ThreadId(02) neptune_cash::connect_to_peers: Connection closing
2024-12-11T17:13:49.542654191Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received peer list req from peer [::ffff:107.181.240.34]:49882
2024-12-11T17:13:49.542721401Z DEBUG ThreadId(02) neptune_cash::peer_loop: Responding with: [(51.15.139.238:9798, 180273059814229484062836173416189655133), ([::ffff:23.239.118.34]:9798, 329399519625511720329395729569714136442), ([::ffff:66.248.205.33]:9798, 107800094576276670568220704743898518523), ([::ffff:107.181.240.34]:9798, 161088251798632610242008530126239684818), ([::ffff:107.181.240.218]:9798, 122409577406447673762203669590964243985), ([::ffff:107.181.240.234]:9798, 188388803456246908798729605308083576652), ([::ffff:107.181.240.242]:9798, 57990064333588526416683932831105694764), ([::ffff:178.197.248.35]:9798, 44082800370450425291539636841743982849), ([2a03:4000:68:82:38a1:3ff:fe6a:5bb1]:9798, 254624299172714942717748692359326123465)]
2024-12-11T17:13:57.07514971Z DEBUG ThreadId(01) neptune_cash::main_loop: Timer: Membership proof resync job
2024-12-11T17:13:57.075187139Z DEBUG ThreadId(01) neptune_cash::models::state: Membership proof syncing not needed
2024-12-11T17:13:57.812022458Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block proposal notification from peer [::ffff:23.239.118.34]:36570
2024-12-11T17:13:57.812071108Z  INFO ThreadId(02) neptune_cash::peer_loop: Got unfavorable block proposal notification from [::ffff:23.239.118.34]:36570 peer; rejecting. Reason:
Insufficient fee. Proposal was 64;
current fee is: Some(NeptuneCoins(256000000000000005090463958695936))
2024-12-11T17:14:17.574486707Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received peer list req from peer [2a03:4000:68:82:38a1:3ff:fe6a:5bb1]:9798
2024-12-11T17:14:17.574692697Z DEBUG ThreadId(02) neptune_cash::peer_loop: Responding with: [(51.15.139.238:9798, 180273059814229484062836173416189655133), ([::ffff:23.239.118.34]:9798, 329399519625511720329395729569714136442), ([::ffff:66.248.205.33]:9798, 107800094576276670568220704743898518523), ([::ffff:107.181.240.34]:9798, 161088251798632610242008530126239684818), ([::ffff:107.181.240.218]:9798, 122409577406447673762203669590964243985), ([::ffff:107.181.240.234]:9798, 188388803456246908798729605308083576652), ([::ffff:107.181.240.242]:9798, 57990064333588526416683932831105694764), ([::ffff:178.197.248.35]:9798, 44082800370450425291539636841743982849), ([2a03:4000:68:82:38a1:3ff:fe6a:5bb1]:9798, 254624299172714942717748692359326123465)]
2024-12-11T17:14:17.7558627Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [::ffff:107.181.240.34]:49882
2024-12-11T17:14:17.75590405Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 2981
2024-12-11T17:14:17.75591072Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-11T17:14:17.75591616Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 2981
2024-12-11T17:14:18.232971368Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [::ffff:107.181.240.218]:50096
2024-12-11T17:14:18.233086618Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 2981
2024-12-11T17:14:18.233106438Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-11T17:14:18.233122568Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 2981
2024-12-11T17:14:18.237492375Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [::ffff:107.181.240.234]:59456
2024-12-11T17:14:18.237524385Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 2981
2024-12-11T17:14:18.237530205Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-11T17:14:18.237535385Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 2981
2024-12-11T17:14:18.705266768Z DEBUG ThreadId(03) neptune_cash::peer_loop: Received block from peer [::ffff:107.181.240.34]:49882
2024-12-11T17:14:18.705375568Z  INFO ThreadId(03) neptune_cash::peer_loop: Got new block from peer [::ffff:107.181.240.34]:49882, height 2981, mined 2024-12-11T17:14:17.258+00:00
2024-12-11T17:14:18.705395738Z DEBUG ThreadId(03) neptune_cash::peer_loop: Fetching parent block
2024-12-11T17:14:18.707422456Z DEBUG ThreadId(03) neptune_cash::peer_loop: Completed parent block fetching from DB: found
2024-12-11T17:14:18.707870546Z DEBUG ThreadId(03) neptune_cash::peer_loop: attempting to validate 1 block
2024-12-11T17:14:18.72981466Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [::ffff:107.181.240.242]:41262
2024-12-11T17:14:18.72983464Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 2981
2024-12-11T17:14:18.72984098Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-11T17:14:18.72984663Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 2981
2024-12-11T17:14:18.763384137Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [::ffff:23.239.118.34]:36570
2024-12-11T17:14:18.763476087Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 2981
2024-12-11T17:14:18.763496997Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-11T17:14:18.763510867Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 2981
2024-12-11T17:14:19.053956804Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block from peer [::ffff:107.181.240.218]:50096
2024-12-11T17:14:19.054112414Z  INFO ThreadId(02) neptune_cash::peer_loop: Got new block from peer [::ffff:107.181.240.218]:50096, height 2981, mined 2024-12-11T17:14:17.258+00:00
2024-12-11T17:14:19.054121094Z DEBUG ThreadId(02) neptune_cash::peer_loop: Fetching parent block
2024-12-11T17:14:19.055522393Z DEBUG ThreadId(02) neptune_cash::peer_loop: Completed parent block fetching from DB: found
2024-12-11T17:14:19.056605552Z DEBUG ThreadId(02) neptune_cash::peer_loop: attempting to validate 1 block

@dan-da
Copy link
Collaborator

dan-da commented Dec 12, 2024

Does it completely stop writing to the log after it hangs, or some log messages still appear?

What is neptune-core CPU usage like when it hangs?

If CPU usage is high that can suggest an infinite loop somewhere. If idle/nil, that can suggest a deadlock.

You can build with --features log-lock_events and run with
export RUST_LOG="debug,neptune_core=trace"

See script run-multiple-instances-from-genesis-log-locks.sh for example, or just run it.

That way the log will display each time a lock is try_acquire, acquire, and release. If a deadlock you should see an acquisition that never releases and a try_acquire that never reaches acquire.

Or if you determine a reliable way to reproduce, let me know and I can try it out.

@aszepieniec
Copy link
Contributor Author

aszepieniec commented Dec 12, 2024

Might be related: after a while after rebooting I observe this message:

2024-12-12T10:42:35.029833128Z ERROR ThreadId(02) neptune_cash::main_loop: Got error: Refusing incoming connection. Reason: MaxPeerNumberExceeded

which one does not expect when running in bootstrapping mode.

EDIT: Disregard this message. I think it is unrelated and deserves a separate issue.

@aszepieniec
Copy link
Contributor Author

Does it completely stop writing to the log after it hangs, or some log messages still appear?

Complete stop. No more log messages.

What is neptune-core CPU usage like when it hangs?

I did not think to check htop but I did not notice any reduced responsiveness on that machine in other respects. I'll be sure to check next time. In the mean time, my money is on "idle".

If CPU usage is high that can suggest an infinite loop somewhere. If idle/nil, that can suggest a deadlock.

You can build with --features log-lock_events and run with
export RUST_LOG="debug,neptune_core=trace"

See script run-multiple-instances-from-genesis-log-locks.sh for example, or just run it.

Can I do this with the release executables? The VPS is not powerful enough to compile. That said, I should be able to compile and generate an installer using dist on my laptop and then install that on the VPS. Will that preserve the log feature?

That way the log will display each time a lock is try_acquire, acquire, and release. If a deadlock you should see an acquisition that never releases and a try_acquire that never reaches acquire.

That sounds really useful.

Or if you determine a reliable way to reproduce, let me know and I can try it out.

Right now your best bet is to use the same command I was using and wait ... neptune-core --max-num-peers 25 --bootstrap --guess --sleepy-guessing --peers 51.15.139.238:9798 --peers 139.162.193.206:9798

@aszepieniec
Copy link
Contributor Author

My node crashed again. This time it was running with --max-peers 5 and all other flags equal.

Some more log output:

2024-12-12T11:25:15.188938397Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [::ffff:23.239.118.34]:50810
2024-12-12T11:25:15.188975067Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 3002
2024-12-12T11:25:15.188981307Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-12T11:25:15.188994957Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 3002
2024-12-12T11:25:15.762506458Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [::ffff:107.181.240.218]:49854
2024-12-12T11:25:15.762539287Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 3002
2024-12-12T11:25:15.762544907Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-12T11:25:15.762549707Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 3002
2024-12-12T11:25:15.813250547Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [::ffff:107.181.240.234]:34388
2024-12-12T11:25:15.813275147Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 3002
2024-12-12T11:25:15.813281137Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-12T11:25:15.813289477Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 3002
2024-12-12T11:25:16.014798865Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block from peer [::ffff:23.239.118.34]:50810
2024-12-12T11:25:16.014856285Z  INFO ThreadId(02) neptune_cash::peer_loop: Got new block from peer [::ffff:23.239.118.34]:50810, height 3002, mined 2024-12-12T11:25:14.652+00:00
2024-12-12T11:25:16.014871425Z DEBUG ThreadId(02) neptune_cash::peer_loop: Fetching parent block
2024-12-12T11:25:16.015585894Z DEBUG ThreadId(03) neptune_cash::peer_loop: Completed parent block fetching from DB: found
2024-12-12T11:25:16.016052494Z DEBUG ThreadId(03) neptune_cash::peer_loop: attempting to validate 1 block
2024-12-12T11:25:16.453797928Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [2606:c700:4020:cb:b62e:99ff:fedb:5e9f]:9798
2024-12-12T11:25:16.453836948Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 3002
2024-12-12T11:25:16.453843898Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-12T11:25:16.453852578Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 3002
2024-12-12T11:25:16.592005331Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block from peer [::ffff:107.181.240.218]:49854
2024-12-12T11:25:16.592042651Z  INFO ThreadId(02) neptune_cash::peer_loop: Got new block from peer [::ffff:107.181.240.218]:49854, height 3002, mined 2024-12-12T11:25:14.652+00:00
2024-12-12T11:25:16.592054171Z DEBUG ThreadId(02) neptune_cash::peer_loop: Fetching parent block
2024-12-12T11:25:16.5927871Z DEBUG ThreadId(02) neptune_cash::peer_loop: Completed parent block fetching from DB: found
2024-12-12T11:25:16.59388097Z DEBUG ThreadId(02) neptune_cash::peer_loop: attempting to validate 1 block

@Sword-Smith Sword-Smith added bug Something isn't working high priority labels Dec 13, 2024
@aszepieniec
Copy link
Contributor Author

What is neptune-core CPU usage like when it hangs?

I did not think to check htop but I did not notice any reduced responsiveness on that machine in other respects. I'll be sure to check next time. In the mean time, my money is on "idle".

Yep. 2 CPUs and both are idle.

About logging traces: I tried yesterday evening but couldn't produce trace output. In the mean time Thorkil identified a potential cause: "neptune_core=trace" should be "neptune_cash=trace". Just recording this discrepancy here so that we don't lose track of todo's.

aszepieniec added a commit that referenced this issue Dec 13, 2024
Hunting #282.

Co-authored-by: Thorkil Schmidiger <thor@neptune.cash>
aszepieniec added a commit that referenced this issue Dec 13, 2024
Hunting #282.

Co-authored-by: Thorkil Schmidiger <thor@neptune.cash>
@aszepieniec
Copy link
Contributor Author

Again, now with logging lock events and with #d7854b2 / #6c66b18.

2024-12-13T22:09:24.525621281Z DEBUG ThreadId(03) neptune_cash::peer_loop: Received block from peer [::ffff:107.181.240.234]:59664
2024-12-13T22:09:24.52657253Z  INFO ThreadId(03) neptune_cash::peer_loop: Got new block from peer [::ffff:107.181.240.234]:59664, height 3226, mined 2024-12-13T22:08:22.915+00:00
2024-12-13T22:09:24.526843821Z DEBUG ThreadId(03) neptune_cash::peer_loop: Fetching parent block
2024-12-13T22:09:24.527214721Z TRACE ThreadId(03) neptune_cash: TryAcquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 125
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:338:14
        |-- lock_event=TryAcquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 338, col: 14 }) }
2024-12-13T22:09:24.52769211Z TRACE ThreadId(03) neptune_cash: Acquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 125
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:338:14
        |-- waited for acquire: 0.000054269 secs
        |-- lock_event=Acquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 842501192 }), acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 842555461 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 338, col: 14 }) }
2024-12-13T22:09:24.572160531Z TRACE ThreadId(03) neptune_cash: TryAcquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 140
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:1327:98
        |-- lock_event=TryAcquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 1327, col: 98 }) }
2024-12-13T22:09:24.572294181Z TRACE ThreadId(03) neptune_cash: Acquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 140
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:1327:98
        |-- waited for acquire: 0.00001755 secs
        |-- lock_event=Acquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 887158532 }), acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 887176082 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 1327, col: 98 }) }
2024-12-13T22:09:24.572641311Z TRACE ThreadId(03) neptune_cash: Release tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 140
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:1327:98
        |-- waited for acquire: 0.00001755 secs
        |-- held: 0.00034805 secs
        |-- lock_event=Release { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 887158532 }), acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 887176082 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 1327, col: 98 }) }
2024-12-13T22:09:24.572829111Z DEBUG ThreadId(03) neptune_cash::peer_loop: Received block from peer [::ffff:107.181.240.218]:54726
2024-12-13T22:09:24.573017831Z  INFO ThreadId(03) neptune_cash::peer_loop: Got new block from peer [::ffff:107.181.240.218]:54726, height 3226, mined 2024-12-13T22:08:22.915+00:00
2024-12-13T22:09:24.573114521Z DEBUG ThreadId(03) neptune_cash::peer_loop: Fetching parent block
2024-12-13T22:09:24.573346181Z TRACE ThreadId(03) neptune_cash: TryAcquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 140
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:338:14
        |-- lock_event=TryAcquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 338, col: 14 }) }
2024-12-13T22:09:24.573541151Z TRACE ThreadId(03) neptune_cash: Acquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 140
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:338:14
        |-- waited for acquire: 0.00001838 secs
        |-- lock_event=Acquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 888400382 }), acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 888418762 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 338, col: 14 }) }
2024-12-13T22:09:24.646802579Z TRACE ThreadId(03) neptune_cash: TryAcquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 142
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:1327:98
        |-- lock_event=TryAcquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 1327, col: 98 }) }
2024-12-13T22:09:24.646931899Z TRACE ThreadId(03) neptune_cash: Acquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 142
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:1327:98
        |-- waited for acquire: 0.00002074 secs
        |-- lock_event=Acquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 961792170 }), acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 961812910 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 1327, col: 98 }) }
2024-12-13T22:09:24.646985169Z TRACE ThreadId(03) neptune_cash: Release tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 142
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:1327:98
        |-- waited for acquire: 0.00002074 secs
        |-- held: 0.00005978 secs
        |-- lock_event=Release { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 961792170 }), acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 961812910 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 1327, col: 98 }) }
2024-12-13T22:09:24.647037109Z DEBUG ThreadId(03) neptune_cash::peer_loop: Received block from peer [::ffff:107.181.240.242]:58338
2024-12-13T22:09:24.647091879Z  INFO ThreadId(03) neptune_cash::peer_loop: Got new block from peer [::ffff:107.181.240.242]:58338, height 3226, mined 2024-12-13T22:08:22.915+00:00
2024-12-13T22:09:24.647120069Z DEBUG ThreadId(03) neptune_cash::peer_loop: Fetching parent block
2024-12-13T22:09:24.647146919Z TRACE ThreadId(03) neptune_cash: TryAcquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 142
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:338:14
        |-- lock_event=TryAcquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 338, col: 14 }) }
2024-12-13T22:09:24.647194209Z TRACE ThreadId(03) neptune_cash: Acquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 142
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:338:14
        |-- waited for acquire: 0.00000883 secs
        |-- lock_event=Acquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 962069940 }), acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 962078770 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 338, col: 14 }) }
2024-12-13T22:09:24.647827819Z TRACE ThreadId(03) neptune_cash: Release tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 140
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:338:14
        |-- waited for acquire: 0.00001838 secs
        |-- held: 0.074290946 secs
        |-- lock_event=Release { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 888400382 }), acquire_at: Some(Instant { tv_sec: 729843, tv_nsec: 888418762 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 338, col: 14 }) }
2024-12-13T22:09:24.648211059Z DEBUG ThreadId(03) neptune_cash::peer_loop: Completed parent block fetching from DB: found
2024-12-13T22:09:24.651307017Z DEBUG ThreadId(03) neptune_cash::peer_loop: attempting to validate 1 block
2024-12-13T22:09:24.651460377Z DEBUG ThreadId(03) neptune_cash::peer_loop: validating with respect to current timestamp 1734127764651
2024-12-13T22:09:24.651649567Z DEBUG ThreadId(03) neptune_cash::peer_loop: new block has proof of work? true
^C

@dan-da
Copy link
Collaborator

dan-da commented Dec 14, 2024

This does not appear to be a deadlock involving the locks I instrumented, which are the GlobalStateLock and DbtSchema::pending_writes lock.

This is both good and bad news. Bad news because a deadlock would be a smoking gun easily seen in the logs and usually easy to resolve once correctly identified.

analysis of the above log:

  1. everything in the log occurs within about 125 ms: 2024-12-13T22:09:24.525 to .651.
  2. The GlobalState lock is never acquired for Write.
  3. Multiple read locks are successfully acquired, indicating that there is no pre-existing write-lock being held.
  4. I count 5 TryAcquire, 5 Acquire, and 3 Release events, indicating that two read-locks are still being held
  5. we can track events by thread and tokio task. All events occur in thread 3, on 3 separate tasks: 125, 140, 142.
125: TryAcquire
125: Acquire
140: TryAcquire
140: Acquire
140: Release
140: TryAcquire
140: Acquire
142: TryAcquire
142: Acquire
142: Release
142: TryAcquire
142: Acquire
140: Release

125: never released.  acquired at: src/peer_loop.rs:338:14
142: never released.  acquired at: src/peer_loop.rs:338:14

Further thoughts:

  1. It would appear that some code is stuck waiting on something else, perhaps network activity or blocking on a full channel.
  2. It is interesting to me the report that no further log messages are emitted. Typically the main task wakes up for timer events periodically and logs. If that's not happening, it seems to indicate that the main task is itself stuck in such a way that it is not calling select!().

yeah my hunch right now (without even looking at any code yet) is that we may have a full channel somewhere and tasks are blocking during channel send().

@dan-da
Copy link
Collaborator

dan-da commented Dec 14, 2024

ok, so peer_loop:338 is this:

        let parent_block = self
            .global_state_lock
            .lock_guard()
            .await
            .chain
            .archival_state()
            .get_block(parent_digest)
            .await?;

This read-lock is acquired and released in the same statement. The fact that it is not being released indicates that the statement never completes. archival_state() is just a getter, so it appears that get_block().await is not completing.

@dan-da
Copy link
Collaborator

dan-da commented Dec 14, 2024

a next step could be to add logging inside ArchivalState::get_block() and possibly ::get_block_from_block_record().

ArchivalState::get_block() reads from both the DB and the filesystem. And the DB likely reads from the filesystem. It's possible the machine might be having some kind of disk error.

I'm wondering if this has been reproduced on any other machine?

Anyway, logging inside get_block() could show if its the DB read or the call to get_block_from_block_record() that does not complete.

It's also possible I'm on the wrong path here, but just following what the logs seem to be saying...

@aszepieniec
Copy link
Contributor Author

Thanks, @dan-da. This analysis is very helpful.

I sprinkled a bunch of debug! statements throughout get_block and get_block_from_block_record in branch asz/282_hunt-deadlock. That's the version that my node is running now. I will report the logs when it crashes next.

I observed that there was a suspicious .unwrap() inside get_block_from_block_record, which might explain the silent crash. If this .unwrap() was the issue, we will get more informative error messages now.

I'm wondering if this has been reproduced on any other machine?

It seems to be happening reliably on my VPS, but I cannot reproduce it on my laptop or on my desktop. However, issues #286 and #285 might be linked to or outright duplicates of this issue.

@aszepieniec
Copy link
Contributor Author

So the messages with "(deadlock-hunt)" are new in the latest commit.

2024-12-16T13:10:01.446274713Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block notification from peer [::ffff:23.239.118.34]:50510
2024-12-16T13:10:01.446345103Z DEBUG ThreadId(02) neptune_cash::peer_loop: Got BlockNotification of height 3553
2024-12-16T13:10:01.446388433Z TRACE ThreadId(02) neptune_cash: TryAcquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 40
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:786:26
        |-- lock_event=TryAcquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 786, col: 26 }) }
2024-12-16T13:10:01.446545462Z TRACE ThreadId(02) neptune_cash: Acquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 40
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:786:26
        |-- waited for acquire: 0.00001189 secs
        |-- lock_event=Acquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 761412723 }), acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 761424613 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 786, col: 26 }) }
2024-12-16T13:10:01.446631522Z TRACE ThreadId(02) neptune_cash: Release tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 40
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:786:26
        |-- waited for acquire: 0.00001189 secs
        |-- held: 0.00009277 secs
        |-- lock_event=Release { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 761412723 }), acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 761424613 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 786, col: 26 }) }
2024-12-16T13:10:01.446833992Z DEBUG ThreadId(02) neptune_cash::peer_loop: block_is_new: true
2024-12-16T13:10:01.446918802Z TRACE ThreadId(02) neptune_cash: TryAcquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 40
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:800:60
        |-- lock_event=TryAcquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 800, col: 60 }) }
2024-12-16T13:10:01.447049322Z TRACE ThreadId(02) neptune_cash: Acquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 40
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:800:60
        |-- waited for acquire: 0.00001076 secs
        |-- lock_event=Acquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 761917603 }), acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 761928363 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 800, col: 60 }) }
2024-12-16T13:10:01.447119182Z TRACE ThreadId(02) neptune_cash: Release tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 40
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:800:60
        |-- waited for acquire: 0.00001076 secs
        |-- held: 0.00007728 secs
        |-- lock_event=Release { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 761917603 }), acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 761928363 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 800, col: 60 }) }
2024-12-16T13:10:01.447296831Z DEBUG ThreadId(02) neptune_cash::peer_loop: sending BlockRequestByHeight to peer for block with height 3553
2024-12-16T13:10:01.447464781Z  WARN ThreadId(02) neptune_cash: executed handle_peer_message()::PeerMessage::BlockNotification in 0.001112078 secs.  exceeds slow fn threshold of 0.001 secs.  location: /home/alan/repos/neptune-core/src/peer_loop.rs:775:17
2024-12-16T13:10:01.517870835Z TRACE ThreadId(02) neptune_cash: TryAcquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 47
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:1329:98
        |-- lock_event=TryAcquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 1329, col: 98 }) }
2024-12-16T13:10:01.518263365Z TRACE ThreadId(02) neptune_cash: Acquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 47
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:1329:98
        |-- waited for acquire: 0.00004002 secs
        |-- lock_event=Acquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 833095866 }), acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 833135886 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 1329, col: 98 }) }
2024-12-16T13:10:01.518362235Z TRACE ThreadId(02) neptune_cash: Release tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 47
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:1329:98
        |-- waited for acquire: 0.00004002 secs
        |-- held: 0.00011089 secs
        |-- lock_event=Release { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 833095866 }), acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 833135886 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 1329, col: 98 }) }
2024-12-16T13:10:01.518606565Z DEBUG ThreadId(02) neptune_cash::peer_loop: Received block from peer [::ffff:107.181.240.234]:47176
2024-12-16T13:10:01.518702634Z  INFO ThreadId(02) neptune_cash::peer_loop: Got new block from peer [::ffff:107.181.240.234]:47176, height 3553, mined 2024-12-16T13:09:59.584+00:00
2024-12-16T13:10:01.518814524Z DEBUG ThreadId(02) neptune_cash::peer_loop: Fetching parent block
2024-12-16T13:10:01.518901794Z TRACE ThreadId(02) neptune_cash: TryAcquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 47
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:335:64
        |-- lock_event=TryAcquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 335, col: 64 }) }
2024-12-16T13:10:01.519039334Z TRACE ThreadId(02) neptune_cash: Acquire tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 47
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:335:64
        |-- waited for acquire: 0.00001258 secs
        |-- lock_event=Acquire { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 833905735 }), acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 833918315 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 335, col: 64 }) }
2024-12-16T13:10:01.519105834Z DEBUG ThreadId(02) neptune_cash::peer_loop: (deadlock-hunt) Acquired global state lock.
2024-12-16T13:10:01.519225374Z DEBUG ThreadId(02) neptune_cash::models::state::archival_state: (deadlock-hunt) entering ArchivalState::get_block
2024-12-16T13:10:01.520130772Z DEBUG ThreadId(02) neptune_cash::models::state::archival_state: (deadlock-hunt) got maybe block record (wrapped in option)
2024-12-16T13:10:01.520229092Z DEBUG ThreadId(02) neptune_cash::models::state::archival_state: (deadlock-hunt) got definite block record (unwrapped option)
2024-12-16T13:10:01.520303603Z DEBUG ThreadId(02) neptune_cash::models::state::archival_state: (deadlock-hunt) entering ArchivalState::get_block_from_record
2024-12-16T13:10:01.520385603Z DEBUG ThreadId(02) neptune_cash::models::state::archival_state: (deadlock-hunt): got block file path
2024-12-16T13:10:01.520570622Z DEBUG ThreadId(02) neptune_cash::models::state::archival_state: (deadlock-hunt): opening block file was successful
2024-12-16T13:10:01.520653972Z DEBUG ThreadId(02) neptune_cash::models::state::archival_state: (deadlock-hunt): spawning task to read the right data ...
2024-12-16T13:10:01.55001059Z DEBUG ThreadId(02) neptune_cash::models::state::archival_state: (deadlock-hunt): got block result
2024-12-16T13:10:01.55027795Z DEBUG ThreadId(02) neptune_cash::models::state::archival_state: (deadlock-hunt) got block; returning and 
exiting ArchivalState::get_block
2024-12-16T13:10:01.5503613Z DEBUG ThreadId(02) neptune_cash::peer_loop: (deadlock-hunt) Got parent block.
2024-12-16T13:10:01.55047994Z TRACE ThreadId(02) neptune_cash: Release tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 2, (`tokio-runtime-worker`)
        |-- tokio task 47
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:335:64
        |-- waited for acquire: 0.00001258 secs
        |-- held: 0.031422585 secs
        |-- lock_event=Release { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 833905735 }), acquire_at: Some(Instant { tv_sec: 956680, tv_nsec: 833918315 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 335, col: 64 }) }
2024-12-16T13:10:01.550693419Z DEBUG ThreadId(02) neptune_cash::peer_loop: (deadlock-hunt) Dropped global state lock.
2024-12-16T13:10:01.550773139Z DEBUG ThreadId(02) neptune_cash::peer_loop: Completed parent block fetching from DB: found
2024-12-16T13:10:01.554851075Z DEBUG ThreadId(02) neptune_cash::peer_loop: attempting to validate 1 block
2024-12-16T13:10:01.555064394Z DEBUG ThreadId(02) neptune_cash::peer_loop: validating with respect to current timestamp 1734354601555
2024-12-16T13:10:01.555174184Z DEBUG ThreadId(02) neptune_cash::peer_loop: new block has proof of work? true

@dan-da
Copy link
Collaborator

dan-da commented Dec 16, 2024

the latest logs put suspicion on Block::is_valid(). It calls Block::is_valid_internal() which is a quite complicated method. I would suggest instrumenting it with logging as well. and/or reviewing it carefully.

@dan-da
Copy link
Collaborator

dan-da commented Dec 16, 2024

another thing I sometimes find useful is to run neptune-core inside the gdb debugger.

gdb target/debug/neptune-core
...
> run <neptune-core-args>

Then when the hang occurs, hit Ctrl-C in gdb to pause execution. At that point, one can inspect the running threads and the stack of each. relevant commands: info threads, thread <id>, where.

Resume execution with c. Ctrl-C to stop again, etc.

Using Ctrl-C to stop (instead of a breakpoint) is kind of a shotgun approach. But if one or more threads are truly stuck waiting on something then it should show up in the stack trace(s).

@aszepieniec
Copy link
Contributor Author

Well that's interesting.

2024-12-16T15:10:02.366922486Z DEBUG ThreadId(03) neptune_cash::peer_loop: (deadlock-hunt) Got parent block.
2024-12-16T15:10:02.366999346Z TRACE ThreadId(03) neptune_cash: Release tokio lock `GlobalState` of type `RwLock` for `Read` by
        |-- thread 3, (`tokio-runtime-worker`)
        |-- tokio task 161
        |-- acquirer: /home/alan/repos/neptune-core/src/peer_loop.rs:335:64
        |-- waited for acquire: 0.00000987 secs
        |-- held: 0.13680294 secs
        |-- lock_event=Release { info: LockInfo { name: Some("GlobalState"), lock_type: RwLock }, acquisition: Read, try_acquire_at: Some(Instant { tv_sec: 963881, tv_nsec: 545068718 }), acquire_at: Some(Instant { tv_sec: 963881, tv_nsec: 545078588 }), location: Some(Location { file: "/home/alan/repos/neptune-core/src/peer_loop.rs", line: 335, col: 64 }) }
2024-12-16T15:10:02.367139216Z  WARN ThreadId(03) neptune_cash: read-lock held for 0.13693257 seconds. (exceeds max: 0.1 secs)  location: /home/alan/repos/neptune-core/src/peer_loop.rs:335:64
2024-12-16T15:10:02.367227027Z DEBUG ThreadId(03) neptune_cash::peer_loop: (deadlock-hunt) Dropped global state lock.
2024-12-16T15:10:02.367296917Z DEBUG ThreadId(03) neptune_cash::peer_loop: Completed parent block fetching from DB: found
2024-12-16T15:10:02.371385773Z DEBUG ThreadId(03) neptune_cash::peer_loop: attempting to validate 1 block
2024-12-16T15:10:02.371460863Z DEBUG ThreadId(03) neptune_cash::peer_loop: validating with respect to current timestamp 1734361802371
2024-12-16T15:10:02.371555254Z DEBUG ThreadId(03) neptune_cash::peer_loop: new block has proof of work? true
2024-12-16T15:10:02.371616254Z DEBUG ThreadId(03) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block height
2024-12-16T15:10:02.371639094Z DEBUG ThreadId(03) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block predecessor hash
2024-12-16T15:10:02.373238212Z DEBUG ThreadId(03) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block MMRA
2024-12-16T15:10:02.373265252Z DEBUG ThreadId(03) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block time against minimum
2024-12-16T15:10:02.373280602Z DEBUG ThreadId(03) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block difficuly update
2024-12-16T15:10:02.373299572Z DEBUG ThreadId(03) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block cumulative pow
2024-12-16T15:10:02.373319992Z DEBUG ThreadId(03) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block future-dating
2024-12-16T15:10:02.373333312Z DEBUG ThreadId(03) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block appendices
2024-12-16T15:10:04.118320595Z DEBUG ThreadId(02) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block proof
2024-12-16T15:10:04.91875862Z DEBUG ThreadId(03) neptune_cash::models::blockchain::block: (deadlock-hunt) testing block proof

@dan-da
Copy link
Collaborator

dan-da commented Dec 16, 2024

Probably the call to BlockProgram::verify(), which I see calls triton_vm::verify().....

@aszepieniec
Copy link
Contributor Author

Let me note something here for the record, just in case it's relevant later.

I'm changing this function in BlockProgram:

    pub(crate) fn verify(block_body: &BlockBody, appendix: &BlockAppendix, proof: &Proof) -> bool {
        let claim = Self::claim(block_body, appendix);
        triton_vm::verify(Stark::default(), &claim, proof)
    }

into something resembling this:

    pub(crate) async fn verify(
        block_body: &BlockBody,
        appendix: &BlockAppendix,
        proof: &Proof,
    ) -> bool {
        let claim = Self::claim(block_body, appendix);
        task::spawn_blocking(|| triton_vm::verify(Stark::default(), &claim, proof))
            .await
            .expect("should be able to verify block proof in new tokio task")
    }

Note the async and the spawn_blocking.

The point is that the second one doesn't compile. The compiler generates this error:

error[E0373]: closure may outlive the current function, but it borrows `claim`, which is owned by the current function
  --> src/models/blockchain/block/validity/block_program.rs:47:30
   |
47 |         task::spawn_blocking(|| triton_vm::verify(Stark::default(), &claim, proof))
   |                              ^^ may outlive borrowed value `claim`   ----- `claim` is borrowed here
   |
note: function requires argument type to outlive `'static`
  --> src/models/blockchain/block/validity/block_program.rs:47:9
   |
47 |         task::spawn_blocking(|| triton_vm::verify(Stark::default(), &claim, proof))
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
help: to force the closure to take ownership of `claim` (and any other referenced variables), use the `move` keyword
   |
47 |         task::spawn_blocking(move || triton_vm::verify(Stark::default(), &claim, proof))
   |                              ++++

error[E0521]: borrowed data escapes outside of associated function
  --> src/models/blockchain/block/validity/block_program.rs:47:9
   |
44 |         proof: &Proof,
   |         -----  - let's call the lifetime of this reference `'1`
   |         |
   |         `proof` is a reference that is only valid in the associated function body
...
47 |         task::spawn_blocking(|| triton_vm::verify(Stark::default(), &claim, proof))
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |         |
   |         `proof` escapes the associated function body here
   |         argument requires that `'1` must outlive `'static`

There are two lifetime issues, one for claim and one for proof. The claim is owned by the current function and that can be moved without any problems. The proof on the other hand is only passed by reference. I fixed it by cloneing it (and binding it to a variable owned by the function), but it makes me wonder whether this lifetime issue is at play in generating the crash. We'll have more information if the modified code produces the same error.

@dan-da
Copy link
Collaborator

dan-da commented Dec 17, 2024

Some further thoughts/observations in case helpful:

  1. When I did a previous pass adding spawn_blocking(), that was primarily around prove() type operations. I don't think I did any for verify(), but it sounds as though probably all of them should be in a spawn blocking, to be concurrency friendly. I notice there is another triton_vm::verify() call in an assert inside Block::make_block_template_with_valid_proof() and some more sprinkled around the codebase. Some, eg in ProofCollection may only be called inside the context of a proving job outside neptune-core, in which case it is already fine.

  2. I might be missing something but I don't see that the referenced params to BlockProgram::verify() should cause any problem for the pre-existing code. The compiler is satisfied with those lifetimes. When switching to spawn_blocking() it complains because the data will be used in a separate thread, so it must be owned and implement Send. That's normal.

  3. Why does the entire program appear to hang? My present understanding (best guess) of what is happening is that (a) the main task and peer task(s) are executing on the same thread (ie thread_id 3), so they are cooperatively multitasking with eachother. (b) a peer task calls Block::is_valid() which eventually calls this triton_vm::verify(), and somewhere inside it stalls waiting for something to happen. (c) The peer-loop task has stalled the OS thread and is no longer cooperating. Thus the main-task on the same OS thread no longer executes. This explains why no further timer events are logged, and why it does not process the CTRL-C signal.

  4. If (3) is correct, then the spawn_blocking() should enable the main task (and other peer tasks) to continue executing, because the triton_vm::verify() call will be executing on a separate OS thread in tokio's blocking threadpool. I do not expect this to fix the root problem because it's possible/likely that the underlying cause still occurs. What I expect to see is that the ::verify() task hangs and the waiting peer_loop task awaits it indefinitely, but the main() task is able to continue processing events. So it might actually sort of hide the problem until we perhaps notice that peer_loop tasks are stacking up. Or maybe it will be more obvious because block's stop being added to the chain, I'm not sure. Anyway it's something we should add some code to detect/log.

  5. If it's correct that the triton_vm::verify() is hanging, then obviously the cause needs to be determined. I will leave that analysis to the triton_vm experts.

  6. We haven't established for certain that it is the call to triton_vm::verify() that is hanging. I recommend adding log statements around it to be 100% certain, and also log the exact parameters, because the params can be used as input to hopefully recreate the issue repeatably in a test-case or small example.

@aszepieniec
Copy link
Contributor Author

With the change of wrapping triton_vm::verify into a spawn_blocking call, the node has been running for 1.5 days now (and counting), whereas previously it would hang after only a couple of hours. I think it is safe to say that this change either fixes the problem outright or else transforms it into an annoyance rather than a critical bug. I'll prepare a PR soon for this change.

So it might actually sort of hide the problem until we perhaps notice that peer_loop tasks are stacking up. Or maybe it will be more obvious because block's stop being added to the chain, I'm not sure. Anyway it's something we should add some code to detect/log.

At the time of writing, my node is synced to block height 3672, which is the current block height reported by neptunefundamentals but is not the block height reported by explorer.neptune.cash. I don't know what to make of this discrepancy.

How can we determine whether the peer_loop tasks are stacking up? If we could determine programmatically how many tasks are alive, this would be useful information for a periodic log message.

I recommend adding log statements around it to be 100% certain, and also log the exact parameters, because the params can be used as input to hopefully recreate the issue repeatably in a test-case or small example.

I will add the log statements. By parameters, do you mean the function arguments? If no, please elaborate. If yes, it's tricky because the proof is around half a megabyte in size, which you obviously can't print in the log. You could save it to a file but then the extra async overhead might cause the problem to go away. Probably the cleanest solution is to log the block height and then later fish the block proof from the database using an RPC endpoint designed for just that purpose. Let's verify with 100% certainty that triton_vm::verify is hanging before we go down this route.

@aszepieniec
Copy link
Contributor Author

the node has been running for 1.5 days now

3.5 days now. Reinforces the conclusion. Dropping "high priority" tag.

@aszepieniec aszepieniec added concurrency neptune-core async tasks should run concurrently without blocking eachother for long and removed high priority labels Dec 20, 2024
@dan-da
Copy link
Collaborator

dan-da commented Dec 20, 2024

At the time of writing, my node is synced to block height 3672, which is the current block height reported by neptunefundamentals but is not the block height reported by explorer.neptune.cash. I don't know what to make of this discrepancy.

I was doing some updates to explorer.neptune.cash a couple days ago, so it might've been behind for a time.

At this moment, both sites are at block height 3631.

It's interesting that this is 41 blocks behind 3672 that you reported 2 days ago. I'm not sure what to make of that... a large re-org perhaps?

How can we determine whether the peer_loop tasks are stacking up? If we could determine programmatically how many tasks are alive, this would be useful information for a periodic log message.

I haven't given the how much thought. Simplest might be to log when a peer task starts and ends. Then parse the log to find difference between start count and complete count. (hopefully no panic occurs in peer task)

we could implement some kind of programmatic counter, or Vec of task handles.

tokio may provide API(s) for counting tasks.

I recommend adding log statements around it to be 100% certain, and also log the exact parameters, because the params can be used as input to hopefully recreate the issue repeatably in a test-case or small example.

I will add the log statements. By parameters, do you mean the function arguments? If no, please elaborate. If yes, it's tricky because the proof is around half a megabyte in size, which you obviously can't print in the log. You could save it to a file but then the extra async overhead might cause the problem to go away. Probably the cleanest solution is to log the block height and then later fish the block proof from the database using an RPC endpoint designed for just that purpose. Let's verify with 100% certainty that triton_vm::verify is hanging before we go down this route.

yes, I meant the function arguments.

I'd suggest to temporarily remove the spawn-blocking and log before/after the triton_vm::verify, to be certain that's the cause.

With regards to logging the function arguments, that could be done after we are certain.

@aszepieniec
Copy link
Contributor Author

I'm running the node now

  • without the spawn_blocking wrap around the call to triton_vm::verify
  • with the async changes that were necessary to make that work (or specifically: without undoing those changes)
  • with log statements before and after the suspicious call.

Will report when I have data.

@dan-da
Copy link
Collaborator

dan-da commented Dec 20, 2024

  • with the async changes that were necessary to make that work (or specifically: without undoing those changes)

It would be interesting if that alone made the problem go away, though I don't see why it would.

@aszepieniec
Copy link
Contributor Author

  • with the async changes that were necessary to make that work (or specifically: without undoing those changes)

It would be interesting if that alone made the problem go away, though I don't see why it would.

Well, the node is still running after 24 hours ... Previously it would crash after only only two or three hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working concurrency neptune-core async tasks should run concurrently without blocking eachother for long
Projects
None yet
Development

No branches or pull requests

3 participants