Skip to content
This repository has been archived by the owner on Jun 3, 2024. It is now read-only.

Goerli Deadlock Discussion #78

Closed
q9f opened this issue Mar 5, 2021 · 30 comments
Closed

Goerli Deadlock Discussion #78

q9f opened this issue Mar 5, 2021 · 30 comments

Comments

@q9f
Copy link
Member

q9f commented Mar 5, 2021

Online:

0x22ea9f6b28db76a7162054c05ed812deb2f519cd 5/25 ConsenSys
0x4c2ae482593505f0163cdefc073e81c63cda4107 4/25 Nethermind
0x8b24eb4e6aae906058242d83e51fb077370c4720 4/25 Infura/EG
0x9d525e28fe5830ee92d7aa799c4d21590567b595 4/25 Görli/Ronin
0xa6dd2974b96e959f2c8930024451a30afec24203 4/25 EF/Geth
0xe0a2bd4258d2768837baa26a28fe71dc079f84c7 4/25 Görli/Afri

Offline

@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

Screenshot_2021-03-05 Goerli Blocks Etherscan

Looking at the Gas usage here...

Not sure if this is the right issue to track this, but an observation: https://goerli.etherscan.io/block/4391762 seems to have pushed the gas limit > 8_000_000, and just a few blocks later the network is stuck.

@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

My Besu validator reports

2021-03-05 15:59:14.828+01:00 | pool-7-thread-165 | INFO  | BlockMiner | Produced #4,391,756 / 27 tx / 0 om / 2,761,361 (34.5%) gas / (0xeebf54c658a1819a9dde2b908229aa9419c2e6611fdc30441cb139d8924f0d42) in 2.317s
2021-03-05 15:59:27.440+01:00 | EthScheduler-Workers-3 | INFO  | PersistBlockTask | Imported #4,391,757 / 19 tx / 0 om / 1,721,786 (21.5%) gas / (0x695e144860e22185d91303f4a4b66b7125798364f3c48ba44480bf1bcf9267ce) in 0.345s. Peers: 153
2021-03-05 15:59:43.719+01:00 | EthScheduler-Workers-3 | INFO  | PersistBlockTask | Imported #4,391,758 / 18 tx / 0 om / 1,635,302 (20.4%) gas / (0x1b85e78e2a8d2f428ede62360f650460fbd48c93e0094cb1369c4ec81cf04e97) in 0.474s. Peers: 156
2021-03-05 15:59:58.257+01:00 | EthScheduler-Workers-3 | INFO  | PersistBlockTask | Imported #4,391,759 / 1 tx / 0 om / 86,468 (1.1%) gas / (0x7e0ae72abd6c30cfdc869ad8236cd4d9492a7fa83ff7d0ed85b06c094a1d1c5e) in 0.048s. Peers: 151
2021-03-05 16:00:15.308+01:00 | nioEventLoopGroup-3-5 | INFO  | SyncTargetManager | Found common ancestor with peer 0x938443ca0b5cec2d18... at block 4391759
2021-03-05 16:00:19.147+01:00 | EthScheduler-Workers-1 | INFO  | PersistBlockTask | Imported #4,391,760 / 18 tx / 0 om / 1,550,189 (19.4%) gas / (0xd9582ac06bfdded5eb8f6e175ee989ba1d07f75a8aa07240458d236139338cca) in 4.667s. Peers: 151
2021-03-05 16:00:27.541+01:00 | pool-7-thread-165 | INFO  | BlockMiner | Produced #4,391,760 / 98 tx / 0 om / 7,919,067 (99.0%) gas / (0x6a26a79141a64dbc48ee2c4acfb201e0ea481cd525598a6e4e9c076e1cd5588e) in 14.384s
2021-03-05 16:00:30.544+01:00 | nioEventLoopGroup-3-8 | INFO  | SyncTargetManager | Found common ancestor with peer 0x1130a02957bfb331cf... at block 4391760
2021-03-05 16:00:32.497+01:00 | EthScheduler-Workers-2 | INFO  | PersistBlockTask | Imported #4,391,761 / 76 tx / 0 om / 7,987,801 (99.8%) gas / (0x18d380c818a4f83ac58b4baf90d451babb853e76db94ceab936b754135e66fd0) in 4.123s. Peers: 150
2021-03-05 16:00:46.093+01:00 | nioEventLoopGroup-3-8 | INFO  | SyncTargetManager | Found common ancestor with peer 0x1130a02957bfb331cf... at block 4391761
2021-03-05 16:00:48.043+01:00 | EthScheduler-Workers-0 | INFO  | PersistBlockTask | Imported #4,391,762 / 86 tx / 0 om / 8,003,662 (100.0%) gas / (0x33a3e5e27d446ddb02851650c23aeab96e1bbae14f6d76045847dd68afdd3099) in 3.305s. Peers: 155
2021-03-05 16:01:03.455+01:00 | nioEventLoopGroup-3-1 | INFO  | SyncTargetManager | Found common ancestor with peer 0x25dd3567ab8010729d... at block 4391762
2021-03-05 16:01:05.835+01:00 | EthScheduler-Workers-3 | INFO  | PersistBlockTask | Imported #4,391,763 / 93 tx / 0 om / 7,947,426 (99.3%) gas / (0x1ea3b6ed0b94b5de4f16dbb5754d370792b50b4a76694b35e77587a656959275) in 5.628s. Peers: 152
2021-03-05 16:01:08.469+01:00 | pool-7-thread-166 | INFO  | BlockMiner | Produced #4,391,763 / 100 tx / 0 om / 7,955,772 (99.4%) gas / (0x0dec78fe99c1b21b139adf33c1b9d52df5bf406f268bf92656b6beb559e13ea1) in 9.320s
2021-03-05 16:01:17.705+01:00 | pool-7-thread-165 | INFO  | BlockMiner | Produced #4,391,764 / 101 tx / 0 om / 7,999,036 (99.9%) gas / (0x37b349ddf0d27c825545674c20d50c66036fba3a0e915de1d02a4126ef26017e) in 3.862s
2021-03-05 16:01:20.942+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.077+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.083+01:00 | EthScheduler-Transactions-0 | WARN  | TransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.133+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.120+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.295+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.296+01:00 | EthScheduler-Transactions-1 | WARN  | TransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.328+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.386+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.662+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.715+01:00 | EthScheduler-Transactions-1 | WARN  | TransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.761+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.784+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.830+01:00 | EthScheduler-Transactions-0 | WARN  | TransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.831+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.883+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.939+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:21.955+01:00 | EthScheduler-Transactions-0 | WARN  | TransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:22.039+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:22.145+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:22.146+01:00 | EthScheduler-Transactions-1 | WARN  | TransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:22.146+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:22.284+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:22.509+01:00 | EthScheduler-Transactions-0 | WARN  | TransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:22.575+01:00 | EthScheduler-Transactions-1 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:23.194+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:23.354+01:00 | EthScheduler-Transactions-0 | WARN  | PendingTransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:23.615+01:00 | EthScheduler-Transactions-0 | WARN  | TransactionsMessageProcessor | 1000 expired transaction messages have been skipped.
2021-03-05 16:01:31.128+01:00 | nioEventLoopGroup-3-2 | INFO  | SyncTargetManager | Found common ancestor with peer 0x5fc7f70ab8181b80c8... at block 4391764
2021-03-05 16:01:33.020+01:00 | EthScheduler-Workers-3 | INFO  | PersistBlockTask | Imported #4,391,765 / 102 tx / 0 om / 7,978,500 (99.7%) gas / (0xf5c1b228cda5efa13b8265ba68bd1ef4aa82d9305d9ca6e83276e89ad6448e7d) in 2.340s. Peers: 150
2021-03-05 16:01:39.725+01:00 | EthScheduler-Workers-3 | INFO  | PersistBlockTask | Imported #4,391,764 / 100 tx / 0 om / 7,974,859 (99.7%) gas / (0xb36a93a579b0d298fa37f5223035cfcd1c4e7bf5fe4e686a7d62c296a0c43dbc) in 1.957s. Peers: 147

@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

We have two competing heads

  • A) Afri, Besu, Nethermind
  • B) Infura, Geth
  • (unknown: Ronin)

And 2 offline validators

  • Prysm
  • POA

So we either have to manually resync Geth and/or Infura validators.

Or wait for Prysm and/or POA to online their validators.

Edit: Prysm appears to be online but stuck or resyncing?

@prestonvanloon
Copy link
Member

Our geth node appears to be resyncing very slowly after updating to geth 1.10.0. The node went offline for some time yesterday after the update due to deletion of deprecated flags and now has been syncing goerli for about 28 hours now. Still syncing state entries with no ETA.

Latest logs:

INFO [03-05|16:19:02.907] Imported new state entries               count=0    elapsed=11.569s     processed=22338000 pending=93681  trieretry=186 coderetry=0  duplicate=199 unexpected=1137554
INFO [03-05|16:20:02.942] Imported new state entries               count=966  elapsed=2.227s      processed=22338966 pending=93980  trieretry=0   coderetry=0  duplicate=199 unexpected=1137554
INFO [03-05|16:20:23.677] Imported new state entries               count=1536 elapsed=4.015s      processed=22340502 pending=95375  trieretry=0   coderetry=0  duplicate=199 unexpected=1137554
WARN [03-05|16:20:46.521] Dropping unsynced node during sync       id=f1cfce005fd2fbbe conn=dyndial addr=52.208.13.240:30303   type=Geth/v1.9.25-stable-...
INFO [03-05|16:21:41.478] Writing clean trie cache to disk         path=/ethereum/geth/triecache threads=1
INFO [03-05|16:21:50.803] Regenerated local transaction journal    transactions=0 accounts=0
INFO [03-05|16:21:50.807] Persisted the clean trie cache           path=/ethereum/geth/triecache elapsed=9.329s
INFO [03-05|16:22:23.595] Imported new state entries               count=1152 elapsed=2.567s      processed=22341654 pending=94999  trieretry=0   coderetry=0  duplicate=199 unexpected=1137554
INFO [03-05|16:24:18.922] New local node record                    seq=150 id=587b289d77568df6 ip=34.66.20.107 udp=30303 tcp=30303
INFO [03-05|16:25:02.315] Imported new state entries               count=768  elapsed=2.112s      processed=22342422 pending=95457  trieretry=0   coderetry=0  duplicate=199 unexpected=1137554

@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

Ok, Igor confirmed their node is also resyncing.

@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

Ok, Ronin is also resyncing.

So whoever is done with the sync first will kick off a new block

@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

Screenshot at 2021-03-05 17-36-36

@shemnon
Copy link

shemnon commented Mar 5, 2021

Syncing Goerli for 28 hours? Something must be amiss. Even Mantis can full sync goerli faster than that.

@shemnon
Copy link

shemnon commented Mar 5, 2021

Do geth and besu have different fork choice rules? Do we need to soft fork to the same tie breaker?

@prestonvanloon
Copy link
Member

prestonvanloon commented Mar 5, 2021

We restarted our node and now it is syncing again from genesis... We may need to roll back to 1.9.

Edit: We'll wait a few hours to see if this syncs successfully.

@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

Do geth and besu have different fork choice rules? Do we need to soft fork to the same tie breaker?

no, the problem is (once again) the race for out of turn blocks creating competing heads with the same total difficulty and the corner case that there is a deadlock situation (unable to decide on the same best head). this was discussed in EIPs 1955, 2181, 2182, and 2183.

@konradkonrad
Copy link

Yeah

Not sure if this is the right issue to track this, but an observation: https://goerli.etherscan.io/block/4391762 seems to have pushed the gas limit > 8_000_000, and just a few blocks later the network is stuck.

The gas limit is a red herring -- I skimmed through the last 12h of blocks and there were multiple streaks with a gas limit above 8mio.

@shemnon
Copy link

shemnon commented Mar 5, 2021

I think we need better fork choice rules in the protocol then. Something like (in order)

  • Highest Difficulty
  • Lowest Block Number (prefer more in-order blocks)
  • lowest block hash (some arbitrator that is deterministic and never produces ties)

These can be soft forked without any wire changes.

We may want to add something before the final arbitrator that prefer signers that are closest to in-order. But "first seen" is a always a chain halt waiting to happen.

@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

I just thought the same.

Highest Difficulty
Lowest Block Number (prefer more in-order blocks)
lowest block hash (some arbitrator that is deterministic and never produces ties)

This should resolve this fairly easily. Wondering if this would even be beneficial for other consensus engines or if we should limit this to Clique only.

@roninkaizen
Copy link

running again :)

@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

Thanks everyone.

@q9f q9f closed this as completed Mar 5, 2021
@q9f
Copy link
Member Author

q9f commented Mar 5, 2021

@shemnon should this be an EIP or shall we discuss this elsewhere with the teams?

@shemnon
Copy link

shemnon commented Mar 5, 2021

let's get consensus before we post an EIP.

@roninkaizen
Copy link

it´s stuck again, confirmed!

@roninkaizen roninkaizen reopened this Mar 5, 2021
@roninkaizen
Copy link

restarted OpenEthereum Goerli suprn0de#Z1 from zero-
at 0.21 cest- i hope it works to work it out like that.

@ryanschneider
Copy link

So I noticed that our geth nodes were on:

  • number: 0x4306cd hash: 0x3f97052725a19587f7f9223d9cf307797193cb4532669130a39f887a77d949c8 TD: 0x62dba9

But then reorg'ed back to:

  • number: 0x4306cc hash: 0x72c2021f3e4eadf958ca5d1456b18c9f0b0d6f7e9f9efd9b0c2dc660a03e3e44 TD: 0x62dba9

Which is a one less height but with the same TD. And then a and a new block 0x4306cd was never received (not sure if never sealed or not).

What I don't understand is why did it reorg to a lower block number with the same TD, shouldn't it only reorg once it sees a higher TD? Is this basically what Danno was suggesting in #78 (comment)?

Note that this is a normal node not a signer/validator.

@prestonvanloon
Copy link
Member

Update from Prysmatic Labs. We are having difficulty syncing with geth 1.10, we have rolled back to 1.9 and it is syncing state entries. Not sure how long it takes to sync goerli, but it has been 3h37m since we rolled back. There seem to be around 20000 state entries left.

@prestonvanloon
Copy link
Member

We are synced now and proposed a block. We will stay on 1.9 until it is required to upgrade for Berlin.

@shemnon
Copy link

shemnon commented Mar 6, 2021

Looks like geth always prefers shorter chains - https://github.com/ethereum/go-ethereum/blob/master/core/blockchain.go#L1596

@q9f
Copy link
Member Author

q9f commented Mar 6, 2021

Looks like geth always prefers shorter chains - https://github.com/ethereum/go-ethereum/blob/master/core/blockchain.go#L1596

We are stuck again but now both chains are the same length and same total difficulty. maybe we can agree on lowest hash?

@q9f
Copy link
Member Author

q9f commented Mar 6, 2021

Also wondering if it's related to Swarm filling up all blocks. (Should not, but this should also be resolved asap, currently 5k pending transactions.)

https://twitter.com/q9fmz/status/1367897982495059977

@q9f
Copy link
Member Author

q9f commented Mar 6, 2021

Looks like geth always prefers shorter chains - https://github.com/ethereum/go-ethereum/blob/master/core/blockchain.go#L1596

Would it be possible to patch Besu with such a forkchoice rule (diff, number, hash) so we could test it on Goerli?

@Eknir
Copy link

Eknir commented Mar 7, 2021

Also wondering if it's related to Swarm filling up all blocks. (Should not, but this should also be resolved asap, currently 5k pending transactions.)

Hi guys, Swarm product owner here.

The surge of blockchain transactions coming from Swarm comes from users who want to earn the right to get airdropped BZZ tokens when Swarm goes live on mainnet (see announcement). There are currently 9000+ nodes in the Swarm network, of which the vast majority is not controlled by us.

Most transactions we are seeing are from users cashing out cheques. A cheque is some sort of a layer-2 construct, allowing users to pay each other without going on L1. The cashing-out of a cheque is a L1 transaction, causing a transfer of tokens (see smart-contracts).

The airdrop-announcement is misunderstood by some users, who believe that each cheque is worth a certain amount of airdropped BZZ, whereas in fact the worth of the cheques is what will give you the right for airdropped BZZ. Because of this, the number of transactions arising from Swarm usage is much more than strictly needed.

We attempt to rectify this misunderstanding, by clearly outlining that the frequently cashing out cheques does not yield any benefit for them and by making our users aware that the Goerli network is a public resource which should be mindfully utilized. Please let me know how the situation evolves.

You can find me on the Swarm Discord as @Eknir.

@Eknir
Copy link

Eknir commented Mar 7, 2021

I have one question. To what degree is the current situation on the Goerli network caused by our users?

@q9f
Copy link
Member Author

q9f commented Mar 9, 2021

Answered in DM (not at all)

@q9f q9f closed this as completed Mar 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants