Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Bad signature synchronization error of past blocks with Aura #10103

Closed
remzrn opened this issue Oct 26, 2021 · 26 comments · Fixed by #12492
Closed

Bad signature synchronization error of past blocks with Aura #10103

remzrn opened this issue Oct 26, 2021 · 26 comments · Fixed by #12492
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.

Comments

@remzrn
Copy link
Contributor

remzrn commented Oct 26, 2021

Hello everyone,
During migration to substrate 4 of edgeware, which uses Aura, I encountered the following issue: upon authority changes at each epoch, the chain fails with a bad signature.
Here is how I traced it:
The error message happens only here (and in babe, but Edgeware does not use babe)

#[display(fmt = "Bad signature on {:?}", _0)]

And the enum is only present (for aura) in the check_header function here:

Err(Error::BadSignature(hash))

So I built two nodes, and old one and a new one, with respective modified substrate versions that printed out the information, and it seems on block 700 (at the end of the first session), the authority set of the working (old) version is extended, ending up with an expected_author which is different from my upgraded version that keeps the same authority set. But I don’t really understand what I missed during the migration that caused this, so I went on a bit more:
The function check_header is only used in aura (also in pow and babe, but whatever), and the authorities are fetched there:

let authorities = authorities(self.client.as_ref(), &BlockId::Hash(parent_hash))

Which seems to reference this function:

fn authorities<A, B, C>(client: &C, at: &BlockId<B>) -> Result<Vec<A>, ConsensusError>

printing the two outputs also show the difference in the authority set, with 4.0.0-dev not picking up the change, while it was correct in substrate 3.
I migrated the chain following the examples in node-template, and it does not seem like any change is required in the runtime api implementation for the aura API. Besides, since i am syncing past blocks, I suppose the runtime that is executed is the one that was effective at the production time, so the authorities should be the same.
Unfortunately, I could not get any further since the Runtime API construction implementation is too complicated for me to understand and involves macros etc.
I also have a weird feeling that the issue might be somehow related to this one too, with some information from a block not being processed or not going through the adequate call chain.

Any help or point would be really appreciated. I am ready to try things to trace further but since the runtime that is executed is a very old one, I cannot print the state further once the API is called.

@DylanVerstraete
Copy link

@remzrn we face a similar issue on our networks. Did you get it resolved already?

@bkchr
Copy link
Member

bkchr commented Sep 23, 2022

@DylanVerstraete try to revert this pr: #9132 it could may be the problem

@DylanVerstraete
Copy link

DylanVerstraete commented Sep 23, 2022

@DylanVerstraete try to revert this pr: #9132 it could may be the problem

Reverting only that PR does not really work, since this code depends on the ProvideCache which was removed at some point..

@bkchr
Copy link
Member

bkchr commented Sep 23, 2022

The important point is probably to add a call to initialize_block before calling authorities. So, something like this:

let runtime_api = client.runtime_api();
runtime_api.initialize_block();
runtime_api.authorities();

@DylanVerstraete
Copy link

The important point is probably to add a call to initialize_block before calling authorities. So, something like this:

let runtime_api = client.runtime_api();
runtime_api.initialize_block();
runtime_api.authorities();

Something like this? https://github.com/DylanVerstraete/substrate/blob/c66577f77b10269f4f7b2070fad7f16803170e60/client/consensus/aura/src/import_queue.rs#L208

I tried it, doesn't work. I still get the same error...

@DylanVerstraete
Copy link

DylanVerstraete commented Sep 23, 2022

I will try to explain this in a simple example,

at block 10 there are 2 authorities (Bob, Alice), in block 11 a call was made to add an authority (Ferdie), in block 12 grandpa authorities changed.

The client can import block 10 and block 11 but it fails at block 12.

When block 12 is being imported, it tries to read the authorities from the parent block, see:

let authorities = authorities(self.client.as_ref(), &BlockId::Hash(parent_hash))
BUT in our case block 12 is created by Ferdie (the newly added authority). So it only finds Alice and Bob as authorities, tries to verify the signature but fails because the block was signed by Ferdie. When trying to read the session validators storage at block 11 I get Alice and Bob but ferdie is not there yet, Ferdie get's added to the session validators at block 12.

When I patch this code to request the authorities from the currently imported block (12) I get following error:

Current state of blockchain has invalid authorities set

All of this does not really explain why it worked on Substrate V3, so I am more confused then ever now.

@DylanVerstraete
Copy link

DylanVerstraete commented Sep 23, 2022

Okay, I think I found the issue to all of this. For adding / removing validators we use: https://github.com/gautamdhameja/substrate-validator-set

Which implements the SessionManager. Now, when adding a validator, this pallet doesn't modify the Aura pallet storage. Rather if forces session rotation, when the session is rotated the validator gets added to the aura pallet storage automatically, but this is a block later than the actual adding of the validator.

I reworked the implementation of authorities in our runtime api: https://github.com/threefoldtech/tfchain/pull/462/files and upgraded the runtime, the sync now properly continues because it can fetch the authorities from the substrate validator set pallet instead from aura.

Still not sure why it worked om Substrate V3 but not on polkadot-0.9.24.

@DylanVerstraete
Copy link

@bkchr Seems like my last comment is not working actually. What could I be missing here?

@remzrn
Copy link
Contributor Author

remzrn commented Oct 7, 2022

@DylanVerstraete : I thought I did, but we discovered a couple of days ago that we didn't unfortunately.
I did something similar to the fix that @bkchr proposed. You can see the change is quite trivial: edgeware-network/edg-aura@c92b1fa.
Initializing the block works for syncing the past, because it rotates the session so update the validators similarly to what you are doing, it seems (but it does not work in the future, see below).
I think I understand the logic of your fix, but I don't get why works for the synchronization of past blocks: the runtime that's applied for those should not be the one you modified but the one that's onchain for the past blocks, so any change in your runtime should not really be reflected in the logic that processes past blocks. Am I missing something?

@bkchr : The problem gets actually nastier when initializing the block in the aura import queue. It synchronizes the past perfectly, but then you can get into a situation where the validator set changes in such a way that the validator that is set to produce the block is actually one that gets excluded on the era change. Then it bricks the chain, because that validator produces the block, but in the header checking function, the authorities after initialization have already been updated and this block is seen as invalid because produced by an unknown authority, then all validators drop, because the new authority is set for verification, but the block that should enshrine this is not accepted and it totally bricks the chain.

I don't see any elegant way to solve this, as the logic changed in substrate at that time, so we have to choose between not being able to sync from scratch if we use the current substrate, or risking a chain halt at any point if we modify it to initialize the block.
Do you see any way to handle both cases as we are on a very old version of substrate (v2.0.0) and this is a blocker for us to upgrade.

@DylanVerstraete
Copy link

DylanVerstraete commented Oct 7, 2022

@remzrn My applied fix does not work because when syncing history it uses the runtime at that block.

What we have noticed in our research about this issue is the following:

We use https://github.com/gautamdhameja/substrate-validator-set for adding/removing validators, the previous version (which was compatible with susbtrate 3.0.0) used to force rotate a session when you added a validator. Then this new validator would create the block after getting added. Now this gives issues while trying to sync past this block with any substrate 4.0.0-dev version, because the author of that block is not recognised in the aura authorities.

Currently we migrated some weeks ago to substrate 4.0.0 dev version. To sync our network from 0 we use the latest binary which was still on substrate 3.0.0. This properly syncs until the block where we upgraded our network to subtrate 4.0.0. It stops syncing when it reaches the point where we performed the upgraded, from that point we need to use our latest binary (substrate 4.0.0 dev version) to sync until the current head of our chain.

We chose to not modify any substrate related packages because we don't really know in detail how they work together.

@remzrn
Copy link
Contributor Author

remzrn commented Oct 7, 2022

@DylanVerstraete : Yes it makes sense that a change in the runtime does not help with the past, which is why we tried to slightly amend substrate.
I was thinking about a lot of ugly workarounds, but frankly I don't think asking validators to change binaries in the middle of a sync is very doable. I would rather hardcode a block number at which we shall switch the logic, even though I hope parity can think of something more elegant.
We used to have a lot of specific modifications of substrate on Edgeware since we have a really old chain, that has been a pain to migrate in the past. But for this the change is actually minimal as you can see, in the commit, because it just impacts the aura client and none of the primitives. The downside being that it introduces some need to maintain that part up to date with substrate changes. I would personally do this, but at the moment, we cannot have both past syncing and guarantee the chain health in the future even if we modify it, so we are delaying the upgrade, and it gets more and more difficult to catch up.

@DylanVerstraete
Copy link

@remzrn I'm not really following in why your change would not work for future created blocks that contain an authoritiy set change. Isn't the point of your change to circumvent this?

@remzrn
Copy link
Contributor Author

remzrn commented Oct 7, 2022

The point of the change, at meta level, is just to be able to comply with the logic that was enforced when the chain history was built. Initializing the block achieves this, it seems.

The problem is the disconnect between who builds the block and who the authorities are. I will try to give an example: 3 validators v0, v1, v2 are in the set. Block 100 is an era change, and v0 and v1 stop, but v3 and v4 are set to join, v2 remains. Block #100 will be produced by v1 (say) that sends the block to everybody, (in the new logic substrate logic, the validators are still the old ones at the turn block, so v1 can produce the block). The other validators receive this block, and they will check it. They call the verify function which checks the header, but with the "fix" it initialized the block, which applies the authority change, and so v1 is seen as illegitimate. So they all reject block 100.

Every validator in its turn will then attend to produce another one. But since the authority change did not get enacted as the block was not approved, we still have the old validators trying to produce blocks, and rejecting them subsequently. Only v2 can actually produce blocks properly, because it was part of the new and the old set.
So you get the following issues: 1. validators chosen to produce blocks exactly at the turn of the era will produce an invalid block if the new era excludes them from the set (it could mean slashing maybe? Not sure since the signature is seen as invalid, so I don't think it would try to punish the validator, but bad blocks is bad enough) 2. If more than 2/3 of the validators change during the era switch, they will all refuse the new blocks (2/3 excluded will try to produce but will get refused by the other validators, 2/3 newcomers won't be recognized as newcomers yet), so you brick your chain.

PS: This example is simplified but we had the case on a testnet, just a few days before our planned release. It could sound convoluted but it does happen and it will happen.
PPS: I can think about a way to summarize this maybe: Non-initializing the block makes the logic between the block producer and the authorities set inconsistent with the old logic. Initializing the block makes the logic between the block producer and the authorities inconsistent with the new logic, so it seems that there is a coupling between the block producer choice and the authorities that we break when we initialize the block. This however was totally consistent with the older versions.
PPPS: Additional thought on how to upgrade. You mentioned syncing with old binaries and complying with the new logic in new binaries, but I am not sure how it is possible, since the consensus is not part of the runtime, validators will need to update their clients before the runtime upgrade can happen and they won't do it all exactly at the same block. This means that some will run the new client and some the old client at the same time, regardless of what the runtime version is, and so they are at risk of disagreeing about who the authorities are at the end of an era if there is a change in the validator set.

@DylanVerstraete
Copy link

You mentioned syncing with old binaries and complying with the new logic in new binaries, but I am not sure how it is possible, since the consensus is not part of the runtime, validators will need to update their clients before the runtime upgrade can happen and they won't do it all exactly at the same block.

This is true, we have done this on 2 networks already and it all went smoothly. Since no authorities changes were made between running the old binary and the new binary (we don't have a notion of era's on our chain, only sessions).

@bkchr
Copy link
Member

bkchr commented Oct 12, 2022

@bkchr : The problem gets actually nastier when initializing the block in the aura import queue. It synchronizes the past perfectly, but then you can get into a situation where the validator set changes in such a way that the validator that is set to produce the block is actually one that gets excluded on the era change.

The problem of your fix is that you did not revert the commit/pr I mentioned. If you take a close look to the code that I removed back then, I removed code that fetched the authorities from the cache. This cache was filled on import here. So, while writing this I realized that we actually never called on_initialize on block import, meaning we don't need to do this now.

@DylanVerstraete
Copy link

@bkchr will this code be reverted in the next release?

@bkchr
Copy link
Member

bkchr commented Oct 12, 2022

@DylanVerstraete which code?

@DylanVerstraete
Copy link

@bkchr the removing of the cache

client
.cache()
.and_then(|cache| cache
.get_at(&well_known_cache_keys::AUTHORITIES, at)
.unwrap_or(None)
.and_then(|(_, _, v)| Decode::decode(&mut &v[..]).ok())
)

@bkchr
Copy link
Member

bkchr commented Oct 12, 2022

No, I removed this on purpose. Back then I didn't thought that anyone would use AURA in production. However, we work on a solution for you.

@DylanVerstraete in what way did you fix your problem? Or you didn't yet fixed it? I don't really get why the old binary stops syncing? Do you got some db right before the syncing stops? Then I could try this on my own.

@DylanVerstraete
Copy link

DylanVerstraete commented Oct 12, 2022

@bkchr Sorry for the confusion but we did not really fix our problem. We circumvent our problem by running the old binary until the block where we upgraded our network to polkadot-0.9.24 dependencies (syncing stops there automatically). We then run our upgraded version of our chain to sync the remainder.

To get to the point where the syncing stops you can do this:

git clone git@github.com:threefoldtech/tfchain.git
cd tfchain/substrate_node
cargo build --release

./target/release/tfchain --chain chainspecs/dev/chainSpecRaw.json --pruning=archive -d /tmp/dev

Syncing will stop at height: 113245

Height 113245 is the first height where we did an authority change.

@DylanVerstraete
Copy link

This explains why the syncing stops: #10103 (comment)

@remzrn
Copy link
Contributor Author

remzrn commented Oct 13, 2022

@bkchr : Thanks for the pointers, but I think it's not the only problem: I thought about the digest logs and the cache, but at the time concluded it was not where the information about authorities could come from because the cache does not seem to be implemented for full clients.
The code you point out where the update happens is just returning a set (Key, value) that gets propagated into a map downstream, and meant to be cached. It seems to happen with a call to a function called update_cache defined in the client backend trait:

fn update_cache(&mut self, cache: HashMap<well_known_cache_keys::Id, Vec<u8>>);

This trait is only implemented in 3 places: the database, the in-memory client, and the light client, but only the light client implements is (I suppose for the purpose of caching auth data, it should be relevant only in memory)

light.rs

fn update_cache(&mut self, cache: HashMap<well_known_cache_keys::Id, Vec<u8>>) {
self.cache = cache;
}

db
fn update_cache(&mut self, _cache: HashMap<well_known_cache_keys::Id, Vec<u8>>) {
// Currently cache isn't implemented on full nodes.
}

in_mem.rs
fn update_cache(&mut self, _cache: HashMap<CacheKeyId, Vec<u8>>) {}

I checked the history yesterday, and it seems that there used to be a in-memory cache, that got removed by this commit:
1c842c4#diff-264272e65c1ba70e4f8ed755f2a9cbfddff835b2e10f4545dcb097e295c74959

Could it be that the consensus got actually broken by this (the authorities changes would not get cached so a validator from the old authorities could try to produce a block, but that one would fail verification later on, because the block was still initialized before being verified so the expected author belongs to the new authorities) and that your removal of the block initialization later one actually fixed it (though by taking a different convention at the turn of the epoch than was applicable before)?

Despite this, it also looks like the caching traits etc. have all been removed by further commits so re-enabling the cache would not be possible without forking at least the blockchain primitive and it's backend and a big part of the client and forward-porting the old logic there.
Instead, I think on Edgeware, it could be acceptable to switch the consensus, for example, I could produce a client built from our old substrate 2 version that disables block initialization and caching at a forward block (I think it's probably easier to do), then we would ask validators to upgrade within the release and the block marking the consensus change. Once this is done, our consensus should be in line with the new substrate, and then we can perform a second upgrade to substrate 4 including runtime. Do you see any issue with this strategy?

@bkchr
Copy link
Member

bkchr commented Oct 13, 2022

This pr should fix your problem: #12492

It introduces a compatibility mode. I added some docs and explanations to the pr and the code itself. If you have more questions, I'm here to help you ;)

@DylanVerstraete
Copy link

@bkchr awesome thanks!

@remzrn
Copy link
Contributor Author

remzrn commented Oct 13, 2022

On a testnet it seems the behaviour is as expected before and after the block change (I have tried to put it in the middle of an era, to avoid specific turn behaviours), it also seem to sync the past blocks as expected. I will let it run to see if I reach the tip of the chain correctly, but usually the problems - if any - pop up rather early, so it seems all good!

@remzrn
Copy link
Contributor Author

remzrn commented Oct 14, 2022

Fixed by #12492

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants