Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve light client time to warp sync finished #864

Open
tomaka opened this issue Jul 6, 2023 · 2 comments
Open

Improve light client time to warp sync finished #864

tomaka opened this issue Jul 6, 2023 · 2 comments

Comments

@tomaka
Copy link
Contributor

tomaka commented Jul 6, 2023

Here's an example break down of how long it takes from smoldot's initialization (when we enter the smoldot Rust code) to warp syncing being completely finished in a best case scenario: (done on top of #863)

  • Parsing the chain spec (assuming a stripped down chain spec): 30ms
  • Open first connection + various handshakes (WebSocket + encryption): 158ms
  • Peering with that node: 56ms
  • Download warp syncing proof: 85ms (in this example the proof is one fragment)
  • Verify warp syncing proof: 82ms
  • Download the hash of the runtime code to check whether it matches the one in cache: 56ms
  • Build the runtime with wasmi: 328ms
  • Download call proofs containing various storage items such as the current list of authorities: 60ms
  • Perform some runtime calls to verify what we've just downloaded: 8ms

Overall that's roughly 415ms of waiting for networking messages, and 410ms of verifying signatures and compiling the runtime (where I rely on third party libraries that I don't really want to dig into).

What we can do is:

  • Download the runtime and call proofs as soon as we have downloaded the end of the warp sync proof. We can simply assume that the proof is correct, and later discard the download if it turns out that it isn't. The call proofs can be done immediately before downloading the runtime, by assuming that the runtime is using the latest versions of all APIs. That would save 116ms in the example above.

  • WebRTC should remove one round-trip time I believe, which would save 50 to 60ms.

  • Look into why parsing the chain spec takes so long.

@tomaka
Copy link
Contributor Author

tomaka commented Aug 15, 2023

cc wasmi-labs/wasmi#732

@tomaka
Copy link
Contributor Author

tomaka commented Dec 23, 2023

With all the changes that have been done since this issue has been opened, I've just measured a 547ms initialization time with Westend.

Here are the logs:

[14:17:47.469] [smoldot] Smoldot v2.0.15
[14:17:47.484] [smoldot] Chain initialization complete for westend2. Name: "Westend". Genesis hash: 0xe143…423e. Database starting at: 0x1254…62f8 (#18797921)
[14:17:47.484] [smoldot] Chain specification of westend2 contains a non-empty `telemetryEndpoints` field. Smoldot doesn't support telemetry endpoints and as such this field is unused.
[14:17:47.484] [smoldot] Chain specification of westend2 contains a list of bad blocks. Bad blocks are not implemented in the light client. An appropriate way to silence this warning is to remove the bad blocks from the chain specification, which can safely be done:
- For relay chains: if the chain specification contains a checkpoint and that the bad blocks have a block number inferior to this checkpoint.
- For parachains: if the bad blocks have a block number inferior to the current parachain finalized block.
[14:17:47.485] [network] Chains <= AddChain(id=westend2)
[14:17:47.485] [network] Slots(westend2) ∋ 12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS
[14:17:47.485] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) <= StartConnecting(remote_addr=/dns/westend-bootnode-1.polkadot.io/tcp/30334/ws, local_peer_id=12D3KooWKj46bPNVbD5yWbmLjG4Q4ERTRhgEsu5aU3XWpm3HFNwc)
[14:17:47.489] [network] Slots(westend2) ∋ 12D3KooWBdcVpUwnB3AgwZQXcyrvd4yzPBXtSLknvoBSHQZNoftP
[14:17:47.489] [network] Connections(12D3KooWBdcVpUwnB3AgwZQXcyrvd4yzPBXtSLknvoBSHQZNoftP) <= StartConnecting(remote_addr=/ip4/192.96.202.172/tcp/11002, local_peer_id=12D3KooWKURvBbK5f4ZFnbEfZr1A4bTQToVGuN75pisjpeZ8KcXs)
[14:17:47.490] [network] Slots(westend2) ∋ 12D3KooWJvPDCZmReU46ghpCMJCPVUvUCav4WQdKtXQhZgJdH6tZ
[14:17:47.490] [network] Connections(12D3KooWJvPDCZmReU46ghpCMJCPVUvUCav4WQdKtXQhZgJdH6tZ) <= StartConnecting(remote_addr=/dns/westend-bootnode.turboflakes.io/tcp/30410/wss, local_peer_id=12D3KooWLPyJctrzqfcvMyBGpCuKGdtiVdXaM2vQjKiDbKF8Wiiy)
[14:17:47.505] [network] Slots(westend2) ∋ 12D3KooWAM4W23Mx74bboG2PJ1NUjSDR7HJHnvpALrG5MX4tcurg
[14:17:47.506] [network] Connections(12D3KooWAM4W23Mx74bboG2PJ1NUjSDR7HJHnvpALrG5MX4tcurg) <= StartConnecting(remote_addr=/ip4/64.62.224.4/tcp/30336/ws, local_peer_id=12D3KooWMBX3pLxSPkz8Cx2U8gDQEDjvsni82wAuo57hCrVCz1nB)
[14:17:47.506] [runtime-westend2] Worker <= Reset(finalized_block: 0x1254…62f8)
[14:17:47.506] [runtime-westend2] Worker => NewDownload(block=0x1254…62f8)
[14:17:47.506] [runtime-westend2] Worker <= FailedDownload(blocks=[0x1254…62f8], error=StorageQuery(StorageQueryError { errors: [] }))
[14:17:47.506] [json-rpc-westend2] JSON-RPC => {"jsonrpc":"2.0","id":1,"method":"chainHead_unstable_finalizedDatabase","params":[]}
[14:17:47.506] [json-rpc-westend2] JSON-RPC <= {"jsonrpc":"2.0","id":1,"result":"{\"genesisHash\":\"e143f23803ac50e8f6f8e62695d1ce9e4e1d68aa36c1cd2cfd15340213f3423e\",\"chain\":{\"version\":\"1\",\"finalized_block_header\":\"ec8be824f11ef80eac4c47cbd182c6323bac3b5d322cbf1395e61d93367133ad86557b04…
[14:17:47.563] [network] Connections(12D3KooWJvPDCZmReU46ghpCMJCPVUvUCav4WQdKtXQhZgJdH6tZ) <= StartConnecting(remote_addr=/dns/westend-bootnode.turboflakes.io/tcp/30310, local_peer_id=12D3KooWNJDU9JXSfYzJn7ctDz3XtHV3E8jzSN4a2kNwfFDHtfAA)
[14:17:47.563] [network] Connections(12D3KooWJvPDCZmReU46ghpCMJCPVUvUCav4WQdKtXQhZgJdH6tZ, /dns/westend-bootnode.turboflakes.io/tcp/30410/wss) => Shutdown(handshake_finished=false)
[14:17:47.563] [network] Slots(westend2) ∌ 12D3KooWJvPDCZmReU46ghpCMJCPVUvUCav4WQdKtXQhZgJdH6tZ (reason=pre-handshake-disconnect, ban-duration=5s)
[14:17:47.563] [network] Slots(westend2) ∋ 12D3KooWJXF6rDeJHibrXXg3QAKku8AovaCgr8rsrXNqxZd3W6K3
[14:17:47.563] [network] Connections(12D3KooWJXF6rDeJHibrXXg3QAKku8AovaCgr8rsrXNqxZd3W6K3) <= StartConnecting(remote_addr=/ip4/195.144.22.130/tcp/30333, local_peer_id=12D3KooWE7Fgj3aK6tzLaZboS1vv5HDBBXurVkedwuy3uGizsCzL)
[14:17:47.610] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS, /dns/westend-bootnode-1.polkadot.io/tcp/30334/ws) => HandshakeFinished
[14:17:47.610] [network] Gossip(westend2, 12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) <= Open
[14:17:47.652] [network] Gossip(westend2, 12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) => Opened(best_height=18797926, best_hash=0x9d0f…d736)
[14:17:47.736] [network] Gossip(westend2, 12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) => GrandpaNeighborPacket(round_number=1209, set_id=7251, commit_finalized_height=18797924)
[14:17:47.736] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) <= WarpSyncRequest(chain=westend2, start=0x1254…62f8)
[14:17:47.778] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) => WarpSyncRequest(chain=westend2, num_fragments=1, finished=true)
[14:17:47.780] [sync-service-westend2] Sync => WarpSyncFragmentVerified(sender=12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS, verified_hash=0x4612…eef8, verified_height=18797924)
[14:17:47.780] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) <= StorageProofRequest(chain=westend2, block=0x4612…eef8)
[14:17:47.786] [network] Connections(12D3KooWJvPDCZmReU46ghpCMJCPVUvUCav4WQdKtXQhZgJdH6tZ, /dns/westend-bootnode.turboflakes.io/tcp/30310) => HandshakeFinished
[14:17:47.822] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) => StorageProofRequest(chain=westend2, total_size=780 B)
[14:17:47.885] [sync-service-westend2] Sync => WarpSyncRuntimeBuild(success=true, duration=63.267ms)
[14:17:47.886] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) <= CallProofRequest(westend2, 0x4612…eef8, BabeApi_configuration)
[14:17:47.928] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) => CallProofRequest(westend2, total_size: 2.0 kiB)
[14:17:47.928] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) <= CallProofRequest(westend2, 0x4612…eef8, BabeApi_next_epoch)
[14:17:47.930] [network] Connections(12D3KooWBdcVpUwnB3AgwZQXcyrvd4yzPBXtSLknvoBSHQZNoftP, /ip4/192.96.202.172/tcp/11002) => HandshakeFinished
[14:17:47.930] [network] Gossip(westend2, 12D3KooWBdcVpUwnB3AgwZQXcyrvd4yzPBXtSLknvoBSHQZNoftP) <= Open
[14:17:47.970] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) => CallProofRequest(westend2, total_size: 2.1 kiB)
[14:17:47.970] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) <= CallProofRequest(westend2, 0x4612…eef8, BabeApi_current_epoch)
[14:17:48.013] [network] Connections(12D3KooWPVPzs42GvRBShdUMtFsk4SvnByrSdWqb6aeAAHvLMSLS) => CallProofRequest(westend2, total_size: 2.1 kiB)
[14:17:48.016] [sync-service-westend2] Sync => WarpSyncBuildChainInformation(success=true)
[14:17:48.016] [sync-service-westend2] GrandPa warp sync finished to #18797924 (0x4612…eef8)
[14:17:48.016] [runtime-westend2] Worker <= Reset(finalized_block: 0x4612…eef8)
[14:17:48.016] [runtime-westend2] Finalized block runtime ready. Spec version: 1005000. Size of `:code`: 1.5 MiB.
[14:17:48.016] [runtime-westend2] Worker => RuntimeKnown(finalized_hash=0x4612…eef8)

Parsing chain spec: 15ms
Connecting to peer and performing handshake: 125ms
Opening block announces substream: 42ms
Waiting for Grandpa neighbor packet: 84ms
Downloading and verifying warp sync proof: 44ms
Downloading the runtime: 42ms
Compiling the runtime: 63ms
Downloading the call proofs: 27ms
Executing the calls: 3ms

Three things to note:

  • There's a new item "waiting for Grandpa neighor packet" that wasn't there before. This is because we now wait before we know the finalized block of a peer before sending a warp sync request, whereas before we just sent requests blindly. This was introduced in Warp sync state machine refactoring #1060. It is overall more correct and robust to do it this way. There's unfortunately not much we can do to improve this time, as it is unfortunately coming from Substrate doing some internal messages processing.
  • Downloading the call proofs is supposed to happen at the same time as the runtime is downloaded. This isn't the case because at the time when we have downloaded the runtime we haven't connected to any other peer.
  • Downloading the call proofs is supposed to happen at the same time as the runtime is compiled. This isn't the case because the sync service prioritizes processing queued things over starting new downloads. This fixable after Unify all services in full-node/light-base to the same paradigms #1382.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant