[loader] Generation counters to keep global module cache in sync #15167

georgemitenkov · 2024-11-02T15:00:57Z

Description

Previously, global module cache assumed linear execution: blocks executed in order, adding and removing entries from the global cache. This is not true if:

We replay transactions. We replay chunks which are not ordered by versions:

# From prepare.txt generated by replay run.
# It seems that chunks are run out of order (versions decrease)

...
[
  "0-0 1875415027 1876000000 Replay epoch 9070 - 9071, 584974 txns starting from version 1875415027.",
  "0-46 1840383848 1841128591 Replay epoch 8978 - 8979, 744744 txns starting from version 1840383848.",
  "0-92 1809104670 1809555455 Replay epoch 8886 - 8887, 450786 txns starting from version 1809104670.",
  "0-138 1779570671 1780137777 Replay epoch 8794 - 8795, 567107 txns starting from version 1779570671.",
  ...
]
...

We may retry blocks, have non-linear history in the future, etc.

This PR adds a generation counter to module cache which is incremented every block. For each module in cache, we also keep track of its generation. If generation of a module is not the same as of block, we check that the cached entries are the same as the one in state (compare hash and state value metadata). If not equal, we use MV data structures, if equal, we set the generation of the module to the generation of the block. As a result, only first accesses are validated per-block.

This does have a small impact on performance (e.g., ~1k TPS on no-op) but not sufficient for regressions and this is still better than V1 implementation when there are many modules. For now, this is sufficient to be able to 1) pass replay, 2) protect against retries, 3) ensure that we always sync with the state view when executing the block. The last part is actually great - it allows us to share the cache even across multiple tests, etc. because we will invalidate entries which do not pass the check w.r.t. state view.

Replay run: https://github.com/aptos-labs/aptos-core/actions/runs/11643658386. There are only few mainnet failures, no more 300+ errors including weird backwards compatibility error which confirms the hypothesis.

How Has This Been Tested?

Unit tests
Replay

Key Areas to Review

Type of Change

Which Components or Systems Does This Change Impact?

Checklist

I have read and followed the CONTRIBUTING doc
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I identified and added all stakeholders and component owners affected by this change as reviewers
I tested both happy and unhappy path of the functionality
I have made corresponding changes to the documentation

trunk-io · 2024-11-02T15:01:01Z

⏱️ 150h 37m total CI duration on this PR

Slowest 15 Jobs	Cumulative Duration	Recent Runs
replay-mainnet / replay-verify (34)	4h	🟩 🟩
replay-mainnet / replay-verify (3)	3h 40m	🟥 🟩
execution-performance / single-node-performance	3h 36m	🟥 🟥 🟥 🟥 🟥 (+2 more)
single-node-performance	3h 24m	🟥 🟥 🟥 🟥 🟥 (+1 more)
replay-mainnet / replay-verify (14)	3h 24m	🟩 🟥
replay-mainnet / replay-verify (8)	3h 16m	🟩 🟩
replay-mainnet / replay-verify (4)	3h 15m	🟩 🟩
replay-mainnet / replay-verify (2)	3h 12m	🟥 🟥
replay-mainnet / replay-verify (21)	3h 12m	🟩 🟥
replay-mainnet / replay-verify (11)	3h 12m	🟩 🟩
replay-mainnet / replay-verify (10)	3h 11m	🟩 🟩
replay-mainnet / replay-verify (9)	3h 11m	🟩 🟩
replay-mainnet / replay-verify (6)	3h 10m	🟩 🟩
replay-mainnet / replay-verify (19)	3h 10m	🟩 🟩
replay-mainnet / replay-verify (7)	3h 10m	🟩 🟩

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

georgemitenkov · 2024-11-02T15:01:14Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

[single-node-perf] Adjust calibration for V2 loader #15175
[loader] Generation counters to keep global module cache in sync #15167 👈
[DO NOT LAND] Enable V2 loader #15155 : 1 other dependent PR (#15168 )
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @georgemitenkov and the rest of your teammates on Graphite

- Add more tests - Refactor APIs of block AptosVM - Enable global cache for e2e-move-tests

github-actions · 2024-11-04T14:20:38Z

✅ Forge suite `realistic_env_max_load` success on `cbb791043967abe0d90db0d717d949a23d87ecf1`

two traffics test: inner traffic : committed: 14325.29 txn/s, latency: 2770.84 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5446760
two traffics test : committed: 99.97 txn/s, latency: 1640.34 ms, (p50: 1400 ms, p70: 1500, p90: 1600 ms, p99: 9800 ms), latency samples: 1780
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.999, avg: 1.558", "ConsensusProposalToOrdered: max: 0.330, avg: 0.294", "ConsensusOrderedToCommit: max: 0.390, avg: 0.377", "ConsensusProposalToCommit: max: 0.680, avg: 0.671"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 1.05s no progress at version 2169650 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.50s no progress at version 2169648 (avg 8.50s) [limit 15].
Test Ok

github-actions · 2024-11-04T14:24:31Z

✅ Forge suite `compat` success on `1086a5e00d773704731ab84fb4ed3538613b2250` ==> `cbb791043967abe0d90db0d717d949a23d87ecf1`

Compatibility test results for 1086a5e00d773704731ab84fb4ed3538613b2250 ==> cbb791043967abe0d90db0d717d949a23d87ecf1 (PR)
1. Check liveness of validators at old version: 1086a5e00d773704731ab84fb4ed3538613b2250
compatibility::simple-validator-upgrade::liveness-check : committed: 12651.08 txn/s, latency: 2262.22 ms, (p50: 1700 ms, p70: 1900, p90: 2300 ms, p99: 23500 ms), latency samples: 500880
2. Upgrading first Validator to new version: cbb791043967abe0d90db0d717d949a23d87ecf1
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6839.75 txn/s, latency: 4113.78 ms, (p50: 4700 ms, p70: 5000, p90: 5000 ms, p99: 5200 ms), latency samples: 123760
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 7052.56 txn/s, latency: 4616.59 ms, (p50: 5000 ms, p70: 5100, p90: 5500 ms, p99: 6200 ms), latency samples: 238520
3. Upgrading rest of first batch to new version: cbb791043967abe0d90db0d717d949a23d87ecf1
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6375.50 txn/s, latency: 4477.92 ms, (p50: 5100 ms, p70: 5400, p90: 5500 ms, p99: 5600 ms), latency samples: 114940
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 5956.98 txn/s, latency: 5355.49 ms, (p50: 5500 ms, p70: 5600, p90: 7200 ms, p99: 7500 ms), latency samples: 198960
4. upgrading second batch to new version: cbb791043967abe0d90db0d717d949a23d87ecf1
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 6376.65 txn/s, latency: 4504.71 ms, (p50: 5100 ms, p70: 5500, p90: 5900 ms, p99: 6100 ms), latency samples: 119140
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 6604.18 txn/s, latency: 4959.58 ms, (p50: 5400 ms, p70: 5900, p90: 6200 ms, p99: 6800 ms), latency samples: 229340
5. check swarm health
Compatibility test for 1086a5e00d773704731ab84fb4ed3538613b2250 ==> cbb791043967abe0d90db0d717d949a23d87ecf1 passed
Test Ok

msmouse · 2024-11-04T16:39:26Z

We replay chunks which are not ordered by versions

No, the job ranges are run from different processes... But I guess it's still an issue we need to address

georgemitenkov · 2024-11-04T16:43:03Z

@msmouse how? the global cache within a single replay job gets speculative updates, this is only possible because we execute out-of-order? i.e., a job executes txns 20-30, then 10-20, etc.?

msmouse · 2024-11-04T16:50:07Z

okay it's this:

aptos-core/execution/executor/src/chunk_executor/mod.rs

Line 624 in 4767333

Ok(version + 1)

-- if a mismatch is hit at version, say 1026, while it's trying to verify chunk 1000-1999, it will try to replay 1027-2026 next.

georgemitenkov mentioned this pull request Nov 2, 2024

[DO NOT LAND] Enable V2 loader #15155

Open

24 tasks

georgemitenkov added CICD:run-execution-performance-test Run execution performance test CICD:run-execution-performance-full-test Run execution performance test (full version) labels Nov 2, 2024

georgemitenkov mentioned this pull request Nov 2, 2024

[experiment] Shared cache #15168

Closed

22 tasks

georgemitenkov changed the title ~~[experiment] Generation to keep cache in sync~~ [loader] Generation counters to keep global module cache in sync Nov 2, 2024

georgemitenkov force-pushed the george/orderless-module-cache branch from 5d432d0 to 108a23d Compare November 2, 2024 18:19

georgemitenkov requested review from runtian-zhou, gelash, zekun000, igor-aptos and ziaptos November 4, 2024 11:14

georgemitenkov marked this pull request as ready for review November 4, 2024 11:15

georgemitenkov requested review from sasha8, danielxiangzl, davidiw, wrwg and vgao1996 as code owners November 4, 2024 11:15

georgemitenkov requested review from msmouse and removed request for davidiw, sasha8, wrwg and danielxiangzl November 4, 2024 11:15

This comment has been minimized.

Sign in to view

georgemitenkov added 3 commits November 4, 2024 13:39

[experiment] Generation to keep cache in sync

0cd0fd2

[nits] Multiple improvements

12867e0

- Add more tests - Refactor APIs of block AptosVM - Enable global cache for e2e-move-tests

[fix] Fix global cache test to have > 0 txns

cbb7910

georgemitenkov force-pushed the george/orderless-module-cache branch from db24883 to cbb7910 Compare November 4, 2024 13:52

georgemitenkov mentioned this pull request Nov 4, 2024

[single-node-perf] Adjust calibration for V2 loader #15175

Open

22 tasks

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[loader] Generation counters to keep global module cache in sync #15167

[loader] Generation counters to keep global module cache in sync #15167

georgemitenkov commented Nov 2, 2024 •

edited

Loading

trunk-io bot commented Nov 2, 2024 •

edited

Loading

georgemitenkov commented Nov 2, 2024 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Nov 4, 2024

github-actions bot commented Nov 4, 2024

msmouse commented Nov 4, 2024 •

edited

Loading

georgemitenkov commented Nov 4, 2024

msmouse commented Nov 4, 2024

[loader] Generation counters to keep global module cache in sync #15167

Are you sure you want to change the base?

[loader] Generation counters to keep global module cache in sync #15167

Conversation

georgemitenkov commented Nov 2, 2024 • edited Loading

Description

How Has This Been Tested?

Key Areas to Review

Type of Change

Which Components or Systems Does This Change Impact?

Checklist

trunk-io bot commented Nov 2, 2024 • edited Loading

georgemitenkov commented Nov 2, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Nov 4, 2024

✅ Forge suite realistic_env_max_load success on cbb791043967abe0d90db0d717d949a23d87ecf1

github-actions bot commented Nov 4, 2024

✅ Forge suite compat success on 1086a5e00d773704731ab84fb4ed3538613b2250 ==> cbb791043967abe0d90db0d717d949a23d87ecf1

msmouse commented Nov 4, 2024 • edited Loading

georgemitenkov commented Nov 4, 2024

msmouse commented Nov 4, 2024

georgemitenkov commented Nov 2, 2024 •

edited

Loading

trunk-io bot commented Nov 2, 2024 •

edited

Loading

georgemitenkov commented Nov 2, 2024 •

edited

Loading

✅ Forge suite `realistic_env_max_load` success on `cbb791043967abe0d90db0d717d949a23d87ecf1`

✅ Forge suite `compat` success on `1086a5e00d773704731ab84fb4ed3538613b2250` ==> `cbb791043967abe0d90db0d717d949a23d87ecf1`

msmouse commented Nov 4, 2024 •

edited

Loading