Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

BABE: vrf verification fails #12752

Closed
ordian opened this issue Nov 21, 2022 · 26 comments · Fixed by #12758
Closed

BABE: vrf verification fails #12752

ordian opened this issue Nov 21, 2022 · 26 comments · Fixed by #12758
Assignees

Comments

@ordian
Copy link
Member

ordian commented Nov 21, 2022

The collating_using_undying_collator started to fail spuriously on master:

2022-11-21 13:27:12 [//Bob] Evicting failed runtime instance error=Runtime panicked: assertion failed: pubkey.vrf_verify(transcript.clone(), vrf_output, vrf_proof).is_ok()
2022-11-21 13:27:12 [//Bob] Proposing failed: Import failed: Error at calling runtime api: Execution failed: Runtime panicked: assertion failed: pubkey.vrf_verify(transcript.clone(), vrf_output, vrf_proof).is_ok() 
thread 'tokio-runtime-worker' panicked at 'assertion failed: pubkey.vrf_verify(transcript.clone(), vrf_output, vrf_proof).is_ok()', /usr/local/cargo/git/checkouts/substrate-7e08433d4c370a21/a73a35e/frame/babe/src/lib.rs:378:29

Example job: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2068264

Looks like after paritytech/polkadot#6127 (comment) more often than before.

This could be a legitimate issue and needs further investigation.

@bkchr
Copy link
Member

bkchr commented Nov 21, 2022

This is going into Substrate. We also have seen this in Cumulus and I also have reports of this failing on normal Substrate nodes. Something is going on that isn't good here.

@bkchr bkchr transferred this issue from paritytech/polkadot Nov 21, 2022
@bkchr bkchr changed the title spurious integration test failure: pubkey.vrf_verify(transcript.clone(), vrf_output, vrf_proof).is_ok() BABE: vrf verification fails Nov 21, 2022
@bkchr
Copy link
Member

bkchr commented Nov 21, 2022

In Cumulus this can be reproduced by running the following command:

while RUSTFLAGS="-Cdebug-assertions=y" cargo test --release -p cumulus-test-service -- --ignored test_migrate_solo_to_para; do true; done

(as the while already indicates, it takes some runs to happen)

@michalkucharczyk
Copy link
Contributor

michalkucharczyk commented Nov 21, 2022

VRF verification sometimes fails with this substrate warp sync zombienet network:
#12751 (at 4850ad2).

2022-11-21 14:24:33 💔 Verification failed for block 0x4fcee0198698a2ef2c28719ee35b3262704e1ac8bef3aedc85ffaba6055c94ea received from peer: 12D3KooWQCkBm1BYtkHpocxCwMgR8yjitEeHGx8spzcDLGt2gkBm, "VRF verification failed: EquationFalse"   

Command to run (should be run from substrate dir, zombienet 1.3.18+ is needed):

cd substrate/
zombienet-linux test --provider podman zombienet/0001-basic-warp-sync/0001-test-warp-sync.zndsl

Interestingly changing charlie's config from db_snapshot to --sync full "hides" the problem.

@bkchr
Copy link
Member

bkchr commented Nov 21, 2022

BTW, as far as I have seen, there was no update of any direct dependencies related to schnorrkel recently. (Looked into Cargo.lock for this)

@koute
Copy link
Contributor

koute commented Nov 21, 2022

Using the reproduction invocation posted by @bkchr I've whipped up a crappy script to launch a bunch of invocations in parallel and check if any of them fail.

It looks like the chance for this to reproduce is roughly ~10%, and every time it takes roughly ~90 seconds for the problem to appear (based on 160 total test runs, 5 serial runs of 32 parallel runs).

Here's the script in case anyone wants to try themselves:

#!/usr/bin/ruby

$running = true
$children = {}
$counter = 0

trap("SIGINT") do
    $running = false
end

def logs_for nth
    "/tmp/r_%02i.txt" % [ nth ]
end

def launch
    nth = $counter
    $counter += 1

    pid = fork do
        logs = logs_for nth
        exec "RUSTFLAGS='-Cdebug-assertions=y' cargo test --release -p cumulus-test-service -- --ignored test_migrate_solo_to_para &> #{logs}"
        # exec "/local/cpp/cumulus/target/release/deps/migrate_solo_to_para-6b0b8c9fbc7bb58c --ignored &> #{logs}"
    end
    puts "#{Time.now}: Launched child \##{nth}: #{pid}"
    $children[nth] = pid
end

COUNT = 32
# COUNT = 1
COUNT.times { |nth| launch }

while $running && !$children.empty?
    sleep 0.2
    $children.each do |nth, pid|
        begin
            Process.kill 0, pid
            rescue Errno::ESRCH
                puts "#{Time.now}: Child finished: \##{nth} (PID #{pid})"
                $children.delete nth

                if $?.exitstatus != 0
                    puts "#{Time.now}: REPRODUCED! (exit status, Job \##{nth}, PID #{pid})"
                    $running = false
                end
                next
        end

        path = logs_for nth
        data = File.read( path )
        if data.include? "assertion failed: pubkey.vrf_verify"
            puts "#{Time.now}: REPRODUCED! (assert failed, Job \##{nth}, PID #{pid})"
            $running = false
            next
        elsif data.include? "test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out"
            puts "#{Time.now}: Finished OK (Job \##{nth}, PID #{pid})"
            $children.delete nth
        end
    end
end

$children.each do |_, pid|
    Process.kill "INT", pid
end

Copy to the root of cumulus; modify the COUNT = 32 line to change how many reproductions are launched in parallel. The logs are saved to /tmp.

@burdges
Copy link

burdges commented Nov 21, 2022

It's 10% of runs fail after 90 sec? The other runs do not fail? 90 sec is 15 blocks? Or much much more?

15 is considerably larger than the gap between primary and secondary slots, unless the primary slots were slowed down to make it more like aura? Trying a ! predigest.is_primary() check in https://github.com/paritytech/substrate/blob/master/frame/babe/src/lib.rs#L380 would tell you that its only primary VRFs failing.

@koute
Copy link
Contributor

koute commented Nov 21, 2022

It's 10% of runs fail after 90 sec? The other runs do not fail? 90 sec is 15 blocks? Or much much more?

It seems to fail right after 10th block is imported.

Yes, other runs do not fail.

I printed out all of the arguments to the vrf_verify from a failing test; here they are:

2022-11-22 00:45:00 [//Alice] Evicting failed runtime instance error=Runtime panicked: XXX: assertion failed: pubkey.vrf_verify:
key=#PublicKey( RistrettoPoint( CompressedRistretto: [212, 53, 147, 199, 21, 253, 211, 28, 97, 20, 26, 189, 4, 169, 159, 214, 130, 44, 133, 88, 133, 76, 205, 227, 154, 86, 132, 231, 165, 109, 162, 125] ) )
randomness=[2, 232, 2, 244, 166, 226, 138, 102, 132, 237, 9, 130, 42, 88, 216, 122, 74, 210, 211, 143, 83, 217, 31, 210, 129, 101, 20, 125, 168, 0, 36, 78]
slot=Slot(278174250)
epoch_index=#2
vrf_output=#VRFOutput(VRFOutput([156, 34, 168, 123, 10, 12, 27, 135, 196, 209, 164, 241, 244, 244, 41, 150, 104, 157, 3, 251, 173, 204, 214, 251, 166, 62, 134, 38, 32, 151, 184, 27]))
vrf_proof=#VRFProof(VRFProof { c: Scalar{bytes: [148, 32, 50, 76, 180, 35, 209, 103, 127, 47, 5, 207, 165, 127, 123, 207, 211, 3, 72, 158, 169, 54, 64, 111, 215, 22, 12, 115, 34, 46, 78, 11],}, s: Scalar{bytes: [42, 49, 167, 228, 80, 197, 54, 112, 39, 241, 239, 90, 86, 217, 53, 154, 77, 127, 140, 52, 107, 109, 230, 36, 137, 170, 71, 204, 98, 62, 12, 14],} })
-> out=#Err(EquationFalse)

2022-11-22 00:45:00 [//Bob] Evicting failed runtime instance error=Runtime panicked: XXX: assertion failed: pubkey.vrf_verify:
key=#PublicKey( RistrettoPoint( CompressedRistretto: [142, 175, 4, 21, 22, 135, 115, 99, 38, 201, 254, 161, 126, 37, 252, 82, 135, 97, 54, 147, 201, 18, 144, 156, 178, 38, 170, 71, 148, 242, 106, 72] ) )
randomness=[2, 232, 2, 244, 166, 226, 138, 102, 132, 237, 9, 130, 42, 88, 216, 122, 74, 210, 211, 143, 83, 217, 31, 210, 129, 101, 20, 125, 168, 0, 36, 78]
slot=Slot(278174250)
epoch_index=#2
vrf_output=#VRFOutput(VRFOutput([94, 177, 179, 20, 198, 243, 118, 222, 39, 243, 135, 227, 104, 94, 214, 113, 133, 10, 247, 152, 171, 182, 41, 85, 38, 158, 129, 29, 69, 8, 235, 26]))
vrf_proof=#VRFProof(VRFProof { c: Scalar{bytes: [167, 116, 132, 118, 94, 166, 126, 66, 252, 255, 177, 52, 0, 86, 116, 49, 97, 137, 125, 151, 121, 166, 235, 249, 165, 14, 244, 51, 11, 81, 141, 3],}, s: Scalar{bytes: [143, 225, 86, 22, 151, 17, 93, 156, 112, 43, 99, 114, 11, 143, 156, 231, 28, 110, 182, 102, 236, 6, 236, 149, 155, 85, 157, 185, 77, 173, 63, 0],} })
-> out=#Err(EquationFalse)

@koute
Copy link
Contributor

koute commented Nov 21, 2022

So it looks like it's basically failing at the first block of a new epoch:

2022-11-22 00:44:30 [//Bob] 👶 Next epoch starts at slot 278174250

(Same slot as was passed to the failing vrf_verify.)

@koute
Copy link
Contributor

koute commented Nov 21, 2022

Looks like there's something wrong on the block producer here; this isn't an issue where an imported block fails verification, this fails on the block producer itself. (Although when compiled without debug assertions it'll most likely fail when that block is imported on another node, which is most likely what @michalkucharczyk seen in his reproduction.)

@koute
Copy link
Contributor

koute commented Nov 23, 2022

We already figured out the problem, but for posterity's sake let me just summarize the issue and paste most of the relevant info I wrote in the Element chat just so that it doesn't get lost forever.

This fails on the block producer right at the epoch boundary. When claiming a slot on the client (primary or secondary, doesn't matter) to build a block in BABE it picks the wrong epoch, which means the randomness is wrong, which means the VRF's wrong, which results in the debug assertion failing inside of FRAME where the correct epoch index was picked (hence there the randomness is correct). This only fails in FRAME when debug assertions are enabled; without them AFAIK the bad block would just get built and then later fail when another node would try to import it and verify the VRF on import.

The reason why it picks the wrong epoch is due to a bug in the pruning of the fork tree. (See the PR which fixes the issue to see exactly where and why.)


Looks like when it fails the node which fails picks the epoch descriptor for the previous epoch:

2022-11-22 13:56:01 [//Alice] 👶 New epoch 1 launching at block 0x42c0…7fc6 (block slot 278182160 >= start slot 278182160, src = NetworkBroadcast).
2022-11-22 13:56:01 [//Alice] 👶 Next epoch starts at slot 278182165
2022-11-22 13:56:01 [//Bob] 👶 New epoch 1 launching at block 0xccb7…d57e (block slot 278182160 >= start slot 278182160, src = NetworkBroadcast).
2022-11-22 13:56:01 [//Bob] 👶 Next epoch starts at slot 278182165

2022-11-22 13:56:30 [//Bob] 👶 New epoch 2 launching at block 0xc2f8…112f (block slot 278182165 >= start slot 278182165, src = Own).
2022-11-22 13:56:30 [//Bob] 👶 Next epoch starts at slot 278182170
2022-11-22 13:56:30 [//Bob] XXX: Attempting to claim slot 278182165 with: epoch_descriptor=Signaled(EpochIdentifier { position: Regular, hash: 0xccb772b6ffb23ca137a18ee3ceb75c120cab426be6e0db225721ef6fb8abd57e, number: 6 }, EpochHeader { start_slot: Slot(278182165), end_slot: Slot(278182170) })
2022-11-22 13:56:30 [//Alice] XXX: Attempting to claim slot 278182165 with: epoch_descriptor=Signaled(EpochIdentifier { position: Genesis1, hash: 0x46cbfa68979cd3c45cc3997cef55bffe2526aeb1c382b156b11201176ed691d6, number: 1 }, EpochHeader { start_slot: Slot(278182160), end_slot: Slot(278182165) })
2022-11-22 13:56:30 [//Alice] Evicting failed runtime instance error=Runtime panicked: XXX: assertion failed: pubkey.vrf_verify

Here's a different reproduction where within the same reproduction Alice produced a block and Bob failed:

Fork tree of Alice:

inner: ForkTree {
    roots: [
        Node {
            hash: 0x4b5fbe4c2f20b796096a0396c29b3669d9d10f79e0cc2634b922165e644de4ba,
            number: 1,
            data: Genesis(
                EpochHeader {
                    start_slot: Slot(
                        278183811,
                    ),
                    end_slot: Slot(
                        278183816,
                    ),
                },
                EpochHeader {
                    start_slot: Slot(
                        278183816,
                    ),
                    end_slot: Slot(
                        278183821,
                    ),
                },
            ),
            children: [
                Node {
                    hash: 0x9862372020c3a0f7a974a16cdabb28c6f5a9f5931505a63fbca52bfc4936b3ab,
                    number: 6,
                    data: Regular(
                        EpochHeader {
                            start_slot: Slot(
                                278183821,
                            ),
                            end_slot: Slot(
                                278183826,
                            ),
                        },
                    ),
                    children: [],
                },
            ],
        },
    ],
    best_finalized_number: None,
},

Fork tree of Bob:

inner: ForkTree {
    roots: [
        Node {
            hash: 0x4b5fbe4c2f20b796096a0396c29b3669d9d10f79e0cc2634b922165e644de4ba,
            number: 1,
            data: Genesis(
                EpochHeader {
                    start_slot: Slot(
                        278183811,
                    ),
                    end_slot: Slot(
                        278183816,
                    ),
                },
                EpochHeader {
                    start_slot: Slot(
                        278183816,
                    ),
                    end_slot: Slot(
                        278183821,
                    ),
                },
            ),
            children: [
                Node {
                    hash: 0x85e1a8f436aabef4f4aebac1f71b8d6b2ef64879d4ec9ee9061fb9a3fa5edb3c,
                    number: 6,
                    data: Regular(
                        EpochHeader {
                            start_slot: Slot(
                                278183821,
                            ),
                            end_slot: Slot(
                                278183826,
                            ),
                        },
                    ),
                    children: [],
                },
            ],
        },
    ],
    best_finalized_number: None,
},

Notice the hash of the node in children. On Alice it's 0x9862... and on Bob it's 0x85e1.

Here's Alice searching for the epoch: (hashes simplified for readability)

2022-11-22 16:42:06 [//Alice] XXX: aux_data: parent=Header { parent_hash: 0xd072..., number: 10
2022-11-22 16:42:06 [//Alice] XXX: is_descendent_of: current=Some((FAKE_HEAD, 0xd4d0...)) base=0x9862... hash=FAKE_HEAD
2022-11-22 16:42:06 [//Alice] XXX: is_descendent_of: true ancestor=HashAndNumber { number: 6, hash: 0x9862... }
2022-11-22 16:42:06 [//Alice] XXX: aux_data: result=Ok(Signaled(EpochIdentifier { position: Regular, hash: 0x9862..., number: 6 }, EpochHeader { start_slot: Slot(278183821), end_slot: Slot(278183826) }))

And here's the same for Bob:

2022-11-22 16:42:06 [//Bob] XXX: aux_data: parent=Header { parent_hash: 0xd072..., number: 10
2022-11-22 16:42:06 [//Bob] XXX: is_descendent_of: current=Some((FAKE_HEAD, 0xd4d0...)) base=0x85e1... hash=FAKE_HEAD
2022-11-22 16:42:06 [//Bob] XXX: is_descendent_of: false ancestor=HashAndNumber { number: 5, hash: 0xe0cc... }
2022-11-22 16:42:06 [//Bob] XXX: is_descendent_of: current=Some((FAKE_HEAD, 0xd4d0...)) base=0x4b5f... hash=FAKE_HEAD
2022-11-22 16:42:06 [//Bob] XXX: is_descendent_of: true ancestor=HashAndNumber { number: 1, hash: 0x4b5f... }
2022-11-22 16:42:06 [//Bob] XXX: aux_data: result=Ok(Signaled(EpochIdentifier { position: Genesis1, hash: 0x4b5f...., number: 1 }, EpochHeader { start_slot: Slot(278183816), end_slot: Slot(278183821) }))

So is_descendent_of is first called with base equal to the hash of the node in children from the fork tree. That calls sp_blockchain::lowest_common_ancestor with one hash being the fake head and the other hash being the base.

On Alice this returns block 6 (0x9862) which is equal to the base hash (0x9862), so is_descendent_of returns true, so epoch from that block (epoch 2) is picked.

On Bob this returns block 5 (0xe0cc) which is not equal to the base hash (0x85e1), so is_descendent_of return false, so it tries to go deeper into the tree and only picks epoch 1 for block production.

2022-11-22 16:41:36 [//Bob] 🔖 Pre-sealed block for proposal at 6. Hash now 0x9862
2022-11-22 16:41:36 [//Bob] 👶 New epoch 1 launching at block 0x9862…b3ab
2022-11-22 16:41:37 [//Bob] ✨ Imported #6 (0x9862…b3ab)
2022-11-22 16:41:37 [//Bob] 👶 New epoch 1 launching at block 0x85e1…db3c
2022-11-22 16:41:39 [//Bob] ✨ Imported #6 (0x85e1…db3c)
2022-11-22 16:41:40 [//Bob] 💤 Idle (3 peers), best: #6 (0x9862…b3ab), finalized #3 (0xe44e…8d1b), ⬇ 1.0kiB/s ⬆ 1.6kiB/s
2022-11-22 16:41:42 [//Bob] 🙌 Starting consensus session on top of parent 0x9862
2022-11-22 16:41:36 [//Alice] 🔖 Pre-sealed block for proposal at 6. Hash now 0x85e1
2022-11-22 16:41:36 [//Alice] 👶 New epoch 1 launching at block 0x85e1…db3c
2022-11-22 16:41:37 [//Alice] ✨ Imported #6 (0x85e1…db3c)
2022-11-22 16:41:37 [//Alice] 👶 New epoch 1 launching at block 0x9862…b3ab
2022-11-22 16:41:39 [//Alice] ♻️  Reorg on #6,0x85e1…db3c to #6,0x9862…b3ab, common ancestor #5,0xe0cc…57e8
2022-11-22 16:41:39 [//Alice] ✨ Imported #6 (0x9862…b3ab)
2022-11-22 16:41:53 [//Alice] 💤 Idle (3 peers), best: #8 (0x9313…c66a), finalized #6 (0x9862…b3ab), ⬇ 1.4kiB/s ⬆ 1.2kiB/s

So it looks like 0x85e1 was built by Alice while 0x9862 was built by Bob, and they both imported both of these blocks, but 0x9862 got finalized.

Here's how the full tree of blocks looks like with block numbers and block hashes:

0 5e8f (genesis)
└ 1 4b5f (epoch 0)
  └ 2 315c (epoch 0)
     └ 3 e44e (epoch 0)
       └ 4 ffaa (epoch 0)
         └ 5 e0cc (epoch 0)
           ├ 6 9862 (epoch 1)
           | └ 7 9efe (epoch 1)
           |   └ 8 9313 (epoch 1)
           |     └ 9 d072 (epoch 1)
           |       └ 10 d4d0 (epoch 1)
           |         └ 11 ae3c (epoch 2)
           └ 6 85e1 (epoch 1)

And finally here's the sequence of calls that gets called on the fork tree which results in the wrong thing being pruned (different reproduction so the hashes are different):

2022-11-22 20:38:52 [//Bob] XXX: epoch_changes: rebalance    
2022-11-22 20:38:52 [//Bob] XXX: epoch_changes: prune_finalized: hash=0x5e8f0b6d60b48da41007e493f4b78e1a549bdca634131b7c39baa6c5c60bcb0e number=0 slot=Slot(0)    
2022-11-22 20:38:55 [//Bob] XXX: epoch_changes: prune_finalized: hash=0x5e8f0b6d60b48da41007e493f4b78e1a549bdca634131b7c39baa6c5c60bcb0e number=0 slot=Slot(0)    
2022-11-22 20:38:56 [//Bob] XXX: epoch_changes: import: hash=0xebc98b3226f50a4bbed366bd7d66a116eb933b1d0a5c40dbd33f4b83dd9598be number=1 parent_hash=0x5e8f0b6d60b48da41007e493f4b78e1a549bdca634131b7c39baa6c5c60bcb0e epoch=IncrementedEpoch(Genesis(Epoch { epoch_index: 0, start_slot: Slot(278186189), duration: 5, authorities: [(Public(d43593c715fdd31c61141abd04a99fd6822c8558854ccde39a5684e7a56da27d (5GrwvaEF...)), 1), (Public(8eaf04151687736326c9fea17e25fc5287613693c912909cb226aa4794f26a48 (5FHneW46...)), 1)], randomness: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], config: BabeEpochConfiguration { c: (1, 4), allowed_slots: AllowedSlots::PrimaryAndSecondaryVRFSlots } }, Epoch { epoch_index: 1, start_slot: Slot(278186194), duration: 5, authorities: [(Public(d43593c715fdd31c61141abd04a99fd6822c8558854ccde39a5684e7a56da27d (5GrwvaEF...)), 1), (Public(8eaf04151687736326c9fea17e25fc5287613693c912909cb226aa4794f26a48 (5FHneW46...)), 1)], randomness: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], config: BabeEpochConfiguration { c: (1, 4), allowed_slots: AllowedSlots::PrimaryAndSecondaryVRFSlots } }))    
2022-11-22 20:39:24 [//Bob] XXX: epoch_changes: prune_finalized: hash=0x5c53e2839e55131ec24a6b447546a9674591a58a8ddaa076293eec9d045564cb number=3 slot=Slot(278186191)    
2022-11-22 20:39:25 [//Bob] XXX: epoch_changes: import: hash=0x3160db38092ed5d82940b7f15a26f1c73770f14dfd63f619120ca7b585f294a1 number=6 parent_hash=0x40991791639b090113ea9d7e893043e0a6cb7e55f362c6b877b1a6a0907457ed epoch=IncrementedEpoch(Regular(Epoch { epoch_index: 2, start_slot: Slot(278186199), duration: 5, authorities: [(Public(d43593c715fdd31c61141abd04a99fd6822c8558854ccde39a5684e7a56da27d (5GrwvaEF...)), 1), (Public(8eaf04151687736326c9fea17e25fc5287613693c912909cb226aa4794f26a48 (5FHneW46...)), 1)], randomness: [2, 232, 2, 244, 166, 226, 138, 102, 132, 237, 9, 130, 42, 88, 216, 122, 74, 210, 211, 143, 83, 217, 31, 210, 129, 101, 20, 125, 168, 0, 36, 78], config: BabeEpochConfiguration { c: (1, 4), allowed_slots: AllowedSlots::PrimaryAndSecondaryVRFSlots } }))    
2022-11-22 20:39:26 [//Bob] XXX: epoch_changes: prune_finalized: hash=0x5c53e2839e55131ec24a6b447546a9674591a58a8ddaa076293eec9d045564cb number=3 slot=Slot(278186191)    
2022-11-22 20:39:26 [//Bob] XXX: epoch_changes: import: hash=0x46b9293899d277b7331d396470003ec12e9de064c5087f6460ab57cb79bd59f1 number=6 parent_hash=0x40991791639b090113ea9d7e893043e0a6cb7e55f362c6b877b1a6a0907457ed epoch=IncrementedEpoch(Regular(Epoch { epoch_index: 2, start_slot: Slot(278186199), duration: 5, authorities: [(Public(d43593c715fdd31c61141abd04a99fd6822c8558854ccde39a5684e7a56da27d (5GrwvaEF...)), 1), (Public(8eaf04151687736326c9fea17e25fc5287613693c912909cb226aa4794f26a48 (5FHneW46...)), 1)], randomness: [2, 232, 2, 244, 166, 226, 138, 102, 132, 237, 9, 130, 42, 88, 216, 122, 74, 210, 211, 143, 83, 217, 31, 210, 129, 101, 20, 125, 168, 0, 36, 78], config: BabeEpochConfiguration { c: (1, 4), allowed_slots: AllowedSlots::PrimaryAndSecondaryVRFSlots } }))

@ghost
Copy link

ghost commented Jan 16, 2023

I just got the same error on polkadot master. test ensure_test_service_build_blocks has been running for over 60 seconds error.

2023-01-14 08:46:06.015  WARN             tokio-runtime-worker wasm-runtime: [//Alice] Evicting failed runtime instance error=Runtime panicked: assertion failed: pubkey.vrf_verify(transcript.clone(), vrf_output, vrf_proof).is_ok()
2023-01-14 08:46:06.015 TRACE             tokio-runtime-worker state: [//Alice] Return ext_id=9303 was_native=true result=Err(RuntimePanicked("assertion failed: pubkey.vrf_verify(transcript.clone(), vrf_output, vrf_proof).is_ok()"))
2023-01-14 08:46:06.015 TRACE             tokio-runtime-worker state-db-pin: [//Alice] Unpinned block: 0x5c73779d6d41e1f5b4abee3f8a0a989c2e6457c67faa5c50bd373856d48e0e61    
2023-01-14 08:46:06.015  WARN             tokio-runtime-worker babe: [//Alice] Proposing failed: Import failed: Error at calling runtime api: Execution failed: Runtime panicked: assertion failed: pubkey.vrf_verify(transcript.clone(), vrf_output, vrf_proof).is_ok()    

@bkchr
Copy link
Member

bkchr commented Jan 16, 2023

@alexgparity you have collected more logs or? Can you also link the file please.

@ghost
Copy link

ghost commented Jan 16, 2023

loop_out.txt.zip
Full log is attached. I ran while cargo test --profile testnet --test build-blocks ensure_test_service_build_blocks; do truncate -s 0 loop_out; done 2> loop_out. Took quite a few hours to hit the error

@davxy
Copy link
Member

davxy commented Jan 18, 2023

@alexgparity I started investigating a bit your issue.

First thing cleaning up the log in order to focus on the interesting things.

From what I can see this is what happens (the XXX stuff are my annotations)

XXX (Alice) Creates block #1 (0xa3dd…1d12)
2023-01-13 22:06:48.023  INFO             tokio-runtime-worker babe: [//Alice] 🔖 Pre-sealed block for proposal at 1. Hash now 0xa3dd ...
...
XXX (Alice) Imports its block #1
XXX (Alice) Next epoch starts @ slot 278940673 (~ 22:07)
2023-01-13 22:06:48.024  INFO             tokio-runtime-worker babe: [//Alice] 👶 Next epoch starts at slot 278940673      
...
XXX (Bob) Import Alice's block #1
XXX (Bob) Next epoch starts @ slot 278940673 (~ 22:07)
2023-01-13 22:06:48.028  INFO             tokio-runtime-worker babe: [//Bob] 👶 Next epoch starts at slot 278940673   
...
XXX (Bob) Creates block #1 (0x5c73…0e610)
2023-01-13 22:06:48.044  INFO             tokio-runtime-worker babe: [//Bob] 🔖 Pre-sealed block for proposal at 1. Hash now 0x5c73...    
...
XXX (Bob) Imports Its block #1
XXX (Bob) Next epoch starts @ slot 278940673 (~ 22:07)
2023-01-13 22:06:48.045  INFO             tokio-runtime-worker babe: [//Bob] 👶 Next epoch starts at slot 278940673    
...
XXX (Alice) Import Bob's block #1
2023-01-13 22:06:48.047  INFO             tokio-runtime-worker babe: [//Alice] 👶 Next epoch starts at slot 278940673    
XXX So far we thus have a fork at block 1 for both the nodes (thus two different epoch0/1 parameters/randomness values)

Beside this, that should be handled correctly anyway, I see that (just before the error triggers), there is a jump in time from 22:xx to 08 AM

2023-01-14 08:36:51.807  WARN             tokio-runtime-worker babe: [//Bob] Proposing failed: Import failed: Error at calling runtime api: Execution failed: Runtime panicked: assertion failed: pubkey.vrf_verify(transcript.clone(), vrf_output, vrf_proof).is_ok()    

Can you explain why?

@ghost
Copy link

ghost commented Jan 18, 2023

@davxy The jump in time is as I put my computer to sleep for the night and then started using it again in the morning.

@ghost
Copy link

ghost commented Jan 18, 2023

But I also got the test ensure_test_service_build_blocks has been running for over 60 seconds error on CI https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2251466 and wanted to reproduce locally with logs. You think putting my machine to sleep could have triggered the issue somehow?

@ordian
Copy link
Member Author

ordian commented Jan 18, 2023

But I also got the test ensure_test_service_build_blocks has been running for over 60 seconds error on CI https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2251466

That error is different. As I said in the chat, it happened because the CI machine was too slow for some reason the job triggered 2h timeout, so it was killed. It's unrelated to the VRF verification failure.

@ghost
Copy link

ghost commented Jan 18, 2023

But I also got the test ensure_test_service_build_blocks has been running for over 60 seconds error on CI https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2251466

That error is different. As I said in the chat, it happened because the CI machine was too slow for some reason the job triggered 2h timeout, so it was killed. It's unrelated to the VRF verification failure.

Could that "too slow" not be the issue that the test also looped on that panic like it did for me locally and that's why it didn't terminate in time and was killed?

We don't have detailed logs for that job, do we?

@davxy
Copy link
Member

davxy commented Jan 18, 2023

@alexgparity Well. Looks like the issue has something to do with the Babe skipping 1+ epochs.
Let's look at Bob.

  1. Produces the first block (that announces data for current epoch 0 and next epoch 1
  2. Some epochs are skipped without producing any block
  3. On epoch X > 1 he wakes up and tries to produce a block
  4. It will use as a fallback the best epoch data he can get (data of epoch 1)
  5. This data contains information used to compute VRF output (in particular epoch index = 1)
  6. On finalization of the produced block the runtime updated the EpochIndex value to X
  7. Thus the VRF verify here uses epoch index X that will give an output different from the one embedded in the header digest.

#13135 should solve the problem

if this is the only one :-) ... babe epoch skipping is a quite new feature ...

@nud3l
Copy link

nud3l commented Feb 18, 2023

Is there a solution to recovering from this issue on the collator nodes? We just ran into that on our testnet. Normally, we'd just reset the testnet but we are currently running a competition on the testnet and a reset would not be ideal right now.

@bkchr
Copy link
Member

bkchr commented Feb 18, 2023

Parachains don't support BABE. If you mean on the relay chain, it should work if you have the changes included in your node.

@nud3l
Copy link

nud3l commented Feb 19, 2023

I was referring to the embedded relay chain node in the parachain node.

  • Relaychain: validators are at 0.9.37 when the issue occurred and are producing blocks.
  • Parachain: collators have embedded node at 0.9.36 and reject the new relaychain blocks. They are not producing new blocks.

We upgraded the parachain nodes to 0.9.37 (only nodes not the runtime because no blocks). But that did not help let them produce new blocks. Next, we set execution=native for the embedded relay node in the collator but also that did not seem to do the trick just yet.

To give a bit more context what we see from the logs:

  1. Collator node connects to the relaychain node to get the latest block.
  2. Collator node gets block but runs into the VRF error. 💔 Verification failed for block 0xdef612b6bd9d3b6a2ff93a1e9310f385b7a8b715f3f368135731772a3c6f0792 received from peer: 12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi, "VRF verification failed: EquationFalse"
  3. Subsequent connections to the relaychain nodes are then refused and we get /ip4/127.0.0.1/tcp/30335/p2p/12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi provided a different peer ID 12D3KooWP5Ye7wZLtDvjb6qJnpGMVvH6kqFgWTkN3x1Z3SfScEUY than the one you expect 12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi.

Logs

2023-02-18 12:39:59.767 DEBUG tokio-runtime-worker sync: [Relaychain] 12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi disconnected    
2023-02-18 12:39:59.766  WARN tokio-runtime-worker sync: [Relaychain] 💔 Verification failed for block 0xdef612b6bd9d3b6a2ff93a1e9310f385b7a8b715f3f368135731772a3c6f0792 received from peer: 12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi, "VRF verification failed: EquationFalse"     
2023-02-18 12:39:59.748 DEBUG tokio-runtime-worker sync: [Relaychain] Connected 12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi      
2023-02-18 12:39:59.745 DEBUG tokio-runtime-worker libp2p_ping: [Relaychain] Ping received from PeerId("12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi")    
2023-02-18 12:39:59.744 DEBUG tokio-runtime-worker libp2p_ping: [Relaychain] Ping sent to PeerId("12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi")    
2023-02-18 12:39:59.743 DEBUG tokio-runtime-worker sub-libp2p: [Relaychain] Libp2p => Connected(PeerId("12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi"))    
2023-02-18 12:39:59.743 DEBUG tokio-runtime-worker libp2p_swarm: [Relaychain] Connection established: PeerId("12D3KooWEG2yLqePwnfa6s3CycUuqkzCC4LdG63Ke8qfrfhj7jPi") Listener { local_addr: "/ip4/10.64.1.123/tcp/30334/ws", send_back_addr: "/ip4/10.64.1.125/tcp/56238/ws" }; Total (peer): 1. Total non-banned (peer): 1 

@bkchr
Copy link
Member

bkchr commented Feb 19, 2023

Did you try to resync? Best would also be to use the 0.9.38 branch or even latest master. We have done some fixes afterwards to make the skipping of missing epochs more smooth and less error prone.

@nud3l
Copy link

nud3l commented Feb 20, 2023

Awesome, that did the trick!

@bkchr
Copy link
Member

bkchr commented Feb 20, 2023

What exactly did the trick? :D

@nud3l
Copy link

nud3l commented Feb 20, 2023

Sorry, was too excited it was working again.

What worked:

  • Spin up a new parachain node with relaychain part set to --execution=native (on polkadot 0.9.37)
  • New node is syncing to latest blocks
  • Reset existing collator nodes one by one (all on 0.9.37) and they sync/produce blocks again

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants