-
Notifications
You must be signed in to change notification settings - Fork 2.6k
BABE: vrf verification fails #12752
Comments
This is going into Substrate. We also have seen this in Cumulus and I also have reports of this failing on normal Substrate nodes. Something is going on that isn't good here. |
In Cumulus this can be reproduced by running the following command:
(as the while already indicates, it takes some runs to happen) |
VRF verification sometimes fails with this substrate
Command to run (should be run from substrate dir, zombienet 1.3.18+ is needed):
Interestingly changing |
BTW, as far as I have seen, there was no update of any direct dependencies related to schnorrkel recently. (Looked into Cargo.lock for this) |
Using the reproduction invocation posted by @bkchr I've whipped up a crappy script to launch a bunch of invocations in parallel and check if any of them fail. It looks like the chance for this to reproduce is roughly ~10%, and every time it takes roughly ~90 seconds for the problem to appear (based on 160 total test runs, 5 serial runs of 32 parallel runs). Here's the script in case anyone wants to try themselves: #!/usr/bin/ruby
$running = true
$children = {}
$counter = 0
trap("SIGINT") do
$running = false
end
def logs_for nth
"/tmp/r_%02i.txt" % [ nth ]
end
def launch
nth = $counter
$counter += 1
pid = fork do
logs = logs_for nth
exec "RUSTFLAGS='-Cdebug-assertions=y' cargo test --release -p cumulus-test-service -- --ignored test_migrate_solo_to_para &> #{logs}"
# exec "/local/cpp/cumulus/target/release/deps/migrate_solo_to_para-6b0b8c9fbc7bb58c --ignored &> #{logs}"
end
puts "#{Time.now}: Launched child \##{nth}: #{pid}"
$children[nth] = pid
end
COUNT = 32
# COUNT = 1
COUNT.times { |nth| launch }
while $running && !$children.empty?
sleep 0.2
$children.each do |nth, pid|
begin
Process.kill 0, pid
rescue Errno::ESRCH
puts "#{Time.now}: Child finished: \##{nth} (PID #{pid})"
$children.delete nth
if $?.exitstatus != 0
puts "#{Time.now}: REPRODUCED! (exit status, Job \##{nth}, PID #{pid})"
$running = false
end
next
end
path = logs_for nth
data = File.read( path )
if data.include? "assertion failed: pubkey.vrf_verify"
puts "#{Time.now}: REPRODUCED! (assert failed, Job \##{nth}, PID #{pid})"
$running = false
next
elsif data.include? "test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out"
puts "#{Time.now}: Finished OK (Job \##{nth}, PID #{pid})"
$children.delete nth
end
end
end
$children.each do |_, pid|
Process.kill "INT", pid
end Copy to the root of |
It's 10% of runs fail after 90 sec? The other runs do not fail? 90 sec is 15 blocks? Or much much more? 15 is considerably larger than the gap between primary and secondary slots, unless the primary slots were slowed down to make it more like aura? Trying a |
It seems to fail right after 10th block is imported. Yes, other runs do not fail. I printed out all of the arguments to the
|
So it looks like it's basically failing at the first block of a new epoch:
(Same slot as was passed to the failing |
Looks like there's something wrong on the block producer here; this isn't an issue where an imported block fails verification, this fails on the block producer itself. (Although when compiled without debug assertions it'll most likely fail when that block is imported on another node, which is most likely what @michalkucharczyk seen in his reproduction.) |
We already figured out the problem, but for posterity's sake let me just summarize the issue and paste most of the relevant info I wrote in the Element chat just so that it doesn't get lost forever. This fails on the block producer right at the epoch boundary. When claiming a slot on the client (primary or secondary, doesn't matter) to build a block in BABE it picks the wrong epoch, which means the randomness is wrong, which means the VRF's wrong, which results in the debug assertion failing inside of FRAME where the correct epoch index was picked (hence there the randomness is correct). This only fails in FRAME when debug assertions are enabled; without them AFAIK the bad block would just get built and then later fail when another node would try to import it and verify the VRF on import. The reason why it picks the wrong epoch is due to a bug in the pruning of the fork tree. (See the PR which fixes the issue to see exactly where and why.) Looks like when it fails the node which fails picks the epoch descriptor for the previous epoch:
Here's a different reproduction where within the same reproduction Alice produced a block and Bob failed: Fork tree of Alice:
Fork tree of Bob:
Notice the hash of the node in Here's Alice searching for the epoch: (hashes simplified for readability)
And here's the same for Bob:
So On Alice this returns block 6 (0x9862) which is equal to the On Bob this returns block 5 (0xe0cc) which is not equal to the
So it looks like 0x85e1 was built by Alice while 0x9862 was built by Bob, and they both imported both of these blocks, but 0x9862 got finalized. Here's how the full tree of blocks looks like with block numbers and block hashes:
And finally here's the sequence of calls that gets called on the fork tree which results in the wrong thing being pruned (different reproduction so the hashes are different):
|
I just got the same error on polkadot
|
@alexgparity you have collected more logs or? Can you also link the file please. |
loop_out.txt.zip |
@alexgparity I started investigating a bit your issue. First thing cleaning up the log in order to focus on the interesting things. From what I can see this is what happens (the XXX stuff are my annotations)
Beside this, that should be handled correctly anyway, I see that (just before the error triggers), there is a jump in time from 22:xx to 08 AM
Can you explain why? |
@davxy The jump in time is as I put my computer to sleep for the night and then started using it again in the morning. |
But I also got the |
That error is different. As I said in the chat, it happened because the CI machine was too slow for some reason the job triggered 2h timeout, so it was killed. It's unrelated to the VRF verification failure. |
Could that "too slow" not be the issue that the test also looped on that panic like it did for me locally and that's why it didn't terminate in time and was killed? We don't have detailed logs for that job, do we? |
@alexgparity Well. Looks like the issue has something to do with the Babe skipping 1+ epochs.
#13135 should solve the problem if this is the only one :-) ... babe epoch skipping is a quite new feature ... |
Is there a solution to recovering from this issue on the collator nodes? We just ran into that on our testnet. Normally, we'd just reset the testnet but we are currently running a competition on the testnet and a reset would not be ideal right now. |
Parachains don't support BABE. If you mean on the relay chain, it should work if you have the changes included in your node. |
I was referring to the embedded relay chain node in the parachain node.
We upgraded the parachain nodes to 0.9.37 (only nodes not the runtime because no blocks). But that did not help let them produce new blocks. Next, we set To give a bit more context what we see from the logs:
Logs 2023-02-18 12:39:59.767 DEBUG tokio-runtime-worker sync: [Relaychain] 12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi disconnected
2023-02-18 12:39:59.766 WARN tokio-runtime-worker sync: [Relaychain] 💔 Verification failed for block 0xdef612b6bd9d3b6a2ff93a1e9310f385b7a8b715f3f368135731772a3c6f0792 received from peer: 12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi, "VRF verification failed: EquationFalse"
2023-02-18 12:39:59.748 DEBUG tokio-runtime-worker sync: [Relaychain] Connected 12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi
2023-02-18 12:39:59.745 DEBUG tokio-runtime-worker libp2p_ping: [Relaychain] Ping received from PeerId("12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi")
2023-02-18 12:39:59.744 DEBUG tokio-runtime-worker libp2p_ping: [Relaychain] Ping sent to PeerId("12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi")
2023-02-18 12:39:59.743 DEBUG tokio-runtime-worker sub-libp2p: [Relaychain] Libp2p => Connected(PeerId("12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi"))
2023-02-18 12:39:59.743 DEBUG tokio-runtime-worker libp2p_swarm: [Relaychain] Connection established: PeerId("12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi") Listener { local_addr: "/ip4/10.64.1.123/tcp/30334/ws", send_back_addr: "/ip4/10.64.1.125/tcp/56238/ws" }; Total (peer): 1. Total non-banned (peer): 1 |
Did you try to resync? Best would also be to use the 0.9.38 branch or even latest master. We have done some fixes afterwards to make the skipping of missing epochs more smooth and less error prone. |
Awesome, that did the trick! |
What exactly did the trick? :D |
Sorry, was too excited it was working again. What worked:
|
The
collating_using_undying_collator
started to fail spuriously on master:Example job: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2068264
Looks like after paritytech/polkadot#6127 (comment) more often than before.
This could be a legitimate issue and needs further investigation.
The text was updated successfully, but these errors were encountered: