State trie storage #1063
Closed
Mirko-von-Leipzig
started this conversation in
Ideas
Replies: 1 comment
-
Implemented :) And even better - no reference counting required! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Background
Starknet state is made provable using patricia-merkle tries. This trie is DAG containing two internal node variants -
Binary
andEdge
. These get used to traverse a key's path along the DAG until one arrives at the leaf node containing the value. At its heart is it a kv-store with merkle paths i.e. its hashes all the way down.Currently storing the trie's account for ~70% of our storage aka its huge.
Starknet has several such tries who's root node becomes a commitment to the data it is trie'ing over.
key = storage address
,value = storage value
.key = contract address
,value = state hash
.state hash
is a mapping involving class, contract storage commitment and nonce. It is not entirely relevant to this discussion so I won't go into more detail.key = class hash
,value = H(CLASS_LEAF_VERSION, casm hash)
Status Quo
The transaction trie is never persisted, but is calculated ephemerally for verification and then discarded.
Each other type of trie has its own table, i.e. there are three tables storing node data. This means all contract storage tries share a single table.
We also store the tries for all time. i.e. we so far have never removed any. This is mostly due to us using them for our RPC methods, however with #1038 we no longer have this requirement. This discussion is about how we can leverage this new freedom to reduce storage requirements.
Use cases
The class and storage commitment tries have one main purpose -- trustless verification of the starknet state. For this purpose we only need the very latest state trie to build on or to reorg from.
The other purpose is to provide merkle proofs. Merkle proofs are currently only used by
pathfinder_getProof
, but a more important future use-case is to provide state trie chunk proofs for nodes trying to fast sync via p2p.What do we want to provide
pathfinder_getProof
for someN
most blocks.Proposal
Nth
block. The latestN
diffs are stored in memory until it is time to persist them. Alternatively, we can persist every time some maximum diff size is achieved i.e. limit by how-much memory is used.pathfinder_getProof
archive support can be configured by specifying how quickly to prune older tries. For (1) we only need a single trie copy, but the more block data we retain, the less we have to re-calculate for older proofs. We can make this configurable as well.For chunk proofs, I think a good middle-ground is to enforce the storage of at least one trie that is proven on L1. This can then be very trusted by peers with low chance of failure. Alternatively, depending on how large L1 to L2 gap is, maybe we just rely on some
latest-N
again.Pruning & actual storage
Turns out pruning a forest of DAGs is not a lot of fun ito disk IO. Let's say you want to remove a specific trie. You start at the root node, traverse it and delete its children recursively. The issue is that these nodes may be used in tries.
Here are some known strategies that work.
pathfinder --prune
trie ID, block number, bit path
. This uses the most space, but the benefit is you can simply delete a node without any additional checks as every "duplicate" node will have its own ID and be a separate entity. This is used by juno atm.We used to have (3) for inserting (but no deleting, since that wasn't done). However I removed it, thinking we could simplify without it. Turns out we cannot 😬 I suggest we do reference counting again, with less disk IO by only persisting at selected blocks. And somehow smarter sql queries? We previously found that persisting trie data was causing huge disk IO.
Storage considerations
We should consider using a more hash-lookup specialized database for storing trie node data. However, this can be left for a future exercise, unless somehow has some good ideas right away.
The other consideration is whether storing the full path can somehow be made more storage efficient - I don't know how kv stores work wrt storing a key like
"{block_number}_{contract_address}_{node_path}" = {node_data}
, but if they hash the key to a constu64
size or something then maybe this way is even better?How does reference counting work?
Inserting or removing an instance of a state trie is a recursive procedure. Note that there is no difference between doing so for a full trie, or a subtrie within the trie -- both are just about inserting / removing a root and its children.
For insertion:
For deletion:
Note this procedure also applies if we are storing multiple such trie's in memory.
Beta Was this translation helpful? Give feedback.
All reactions