-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node Hangs #282
Comments
It happened again. Same setup. I lost the log output though. Version 0.0.10 and published binaries. |
I suspect peer_loop to be responsible. I have a hunch (from logs) that it's related to the receival of two blocks in quick successions, from two different peers. I've found the problem to be lessened when I connect to fewer peers. |
It happened again. Here are some logs, now with RUST_LOG=debug.
|
Does it completely stop writing to the log after it hangs, or some log messages still appear? What is neptune-core CPU usage like when it hangs? If CPU usage is high that can suggest an infinite loop somewhere. If idle/nil, that can suggest a deadlock. You can build with --features log-lock_events and run with See script run-multiple-instances-from-genesis-log-locks.sh for example, or just run it. That way the log will display each time a lock is try_acquire, acquire, and release. If a deadlock you should see an acquisition that never releases and a try_acquire that never reaches acquire. Or if you determine a reliable way to reproduce, let me know and I can try it out. |
Might be related: after a while after rebooting I observe this message:
which one does not expect when running in bootstrapping mode. EDIT: Disregard this message. I think it is unrelated and deserves a separate issue. |
Complete stop. No more log messages.
I did not think to check
Can I do this with the release executables? The VPS is not powerful enough to compile. That said, I should be able to compile and generate an installer using
That sounds really useful.
Right now your best bet is to use the same command I was using and wait ... |
My node crashed again. This time it was running with Some more log output:
|
Yep. 2 CPUs and both are idle. About logging traces: I tried yesterday evening but couldn't produce trace output. In the mean time Thorkil identified a potential cause: "neptune_core=trace" should be "neptune_cash=trace". Just recording this discrepancy here so that we don't lose track of todo's. |
Hunting #282. Co-authored-by: Thorkil Schmidiger <thor@neptune.cash>
Hunting #282. Co-authored-by: Thorkil Schmidiger <thor@neptune.cash>
Again, now with logging lock events and with #d7854b2 / #6c66b18.
|
This does not appear to be a deadlock involving the locks I instrumented, which are the GlobalStateLock and DbtSchema::pending_writes lock. This is both good and bad news. Bad news because a deadlock would be a smoking gun easily seen in the logs and usually easy to resolve once correctly identified. analysis of the above log:
Further thoughts:
yeah my hunch right now (without even looking at any code yet) is that we may have a full channel somewhere and tasks are blocking during channel send(). |
ok, so peer_loop:338 is this:
This read-lock is acquired and released in the same statement. The fact that it is not being released indicates that the statement never completes. archival_state() is just a getter, so it appears that get_block().await is not completing. |
a next step could be to add logging inside ArchivalState::get_block() and possibly ::get_block_from_block_record(). ArchivalState::get_block() reads from both the DB and the filesystem. And the DB likely reads from the filesystem. It's possible the machine might be having some kind of disk error. I'm wondering if this has been reproduced on any other machine? Anyway, logging inside get_block() could show if its the DB read or the call to get_block_from_block_record() that does not complete. It's also possible I'm on the wrong path here, but just following what the logs seem to be saying... |
Thanks, @dan-da. This analysis is very helpful. I sprinkled a bunch of I observed that there was a suspicious
It seems to be happening reliably on my VPS, but I cannot reproduce it on my laptop or on my desktop. However, issues #286 and #285 might be linked to or outright duplicates of this issue. |
So the messages with "(deadlock-hunt)" are new in the latest commit.
|
the latest logs put suspicion on Block::is_valid(). It calls Block::is_valid_internal() which is a quite complicated method. I would suggest instrumenting it with logging as well. and/or reviewing it carefully. |
another thing I sometimes find useful is to run neptune-core inside the gdb debugger.
Then when the hang occurs, hit Ctrl-C in gdb to pause execution. At that point, one can inspect the running threads and the stack of each. relevant commands: Resume execution with Using Ctrl-C to stop (instead of a breakpoint) is kind of a shotgun approach. But if one or more threads are truly stuck waiting on something then it should show up in the stack trace(s). |
Well that's interesting.
|
Probably the call to BlockProgram::verify(), which I see calls triton_vm::verify()..... |
Let me note something here for the record, just in case it's relevant later. I'm changing this function in
into something resembling this:
Note the The point is that the second one doesn't compile. The compiler generates this error:
There are two lifetime issues, one for |
Some further thoughts/observations in case helpful:
|
With the change of wrapping
At the time of writing, my node is synced to block height 3672, which is the current block height reported by neptunefundamentals but is not the block height reported by explorer.neptune.cash. I don't know what to make of this discrepancy. How can we determine whether the peer_loop tasks are stacking up? If we could determine programmatically how many tasks are alive, this would be useful information for a periodic log message.
I will add the log statements. By parameters, do you mean the function arguments? If no, please elaborate. If yes, it's tricky because the proof is around half a megabyte in size, which you obviously can't print in the log. You could save it to a file but then the extra async overhead might cause the problem to go away. Probably the cleanest solution is to log the block height and then later fish the block proof from the database using an RPC endpoint designed for just that purpose. Let's verify with 100% certainty that |
3.5 days now. Reinforces the conclusion. Dropping "high priority" tag. |
I was doing some updates to explorer.neptune.cash a couple days ago, so it might've been behind for a time. At this moment, both sites are at block height 3631. It's interesting that this is 41 blocks behind 3672 that you reported 2 days ago. I'm not sure what to make of that... a large re-org perhaps?
I haven't given the how much thought. Simplest might be to log when a peer task starts and ends. Then parse the log to find difference between start count and complete count. (hopefully no panic occurs in peer task) we could implement some kind of programmatic counter, or Vec of task handles. tokio may provide API(s) for counting tasks.
yes, I meant the function arguments. I'd suggest to temporarily remove the spawn-blocking and log before/after the triton_vm::verify, to be certain that's the cause. With regards to logging the function arguments, that could be done after we are certain. |
I'm running the node now
Will report when I have data. |
It would be interesting if that alone made the problem go away, though I don't see why it would. |
Well, the node is still running after 24 hours ... Previously it would crash after only only two or three hours. |
I observed that my node is hanging. Not sure why. Last few log messages:
The dashboard fails to make a connection. Ctrl-C does not work. I ran it with the following command:
neptune-core --max-num-peers 25 --bootstrap --guess --sleepy-guessing --peers 51.15.139.238:9798 --peers 139.162.193.206:9798
. After reboot it seems to be syncing okay but I am noticing this worrying error popping up a bunch of times:The text was updated successfully, but these errors were encountered: