Reconsider database key for trie node #4527

bowenwang1996 · 2021-07-15T22:25:45Z

Today when we save a trie node, the key we use is the concatenation of shard id and the trie node hash. This is problematic when we want to split one shard into multiple as it does not allow us to simply split on the trie level. Instead we need to rebuild the trie with a different shard id for the database key. The reason why we decided to have shard id as part of the database key in the first place is because of #2568, but, as @ilblackdragon suggested, this seems to be an overkill.

ilblackdragon · 2021-07-16T09:19:42Z

The source of this consideration is the fact that if we only needed to split Merkle Trie or Tree that is aligned by the account_id (or it's hash) that would be easy to do in single walk down the the Trie/Tree.

Generally, ref counting is a source of huge problems across the stack (including performance). Also ref counting made the storage code extremely complex.

For context, the reasoning for ref counting was because two account infos with the same information (like having exactly the same amount of money and same nonce and no contract) will hash into the same key.

There are two proposals where made to address this issue:

Remove ref count and replace the hash of the node data as key with global incrementing id in the underlying database.
Use IDs for nodes as offsets in the underlaying flat file, removing RocksDB.

Also given this reconsideration will require database re-work, we should also reconsider the account layout. Currently account info is split across few prefixes as well (e.g. 0x0+account id for account info, 0x1 +account_id + key for keys, etc).

Separately, need to handle contract data which is stored under separate hash of the contract. Figuring out if we want to go with global contract storage (discussed somewhat here https://gov.near.org/t/store-protocol-config-in-a-contract/2105) and shard specific contract storage.

In general, if we are considering switching to any other kind of storage - this is good to batch all together and do in a single migration. Depending on how long it will take to remap data from old storage into new, if it's under few minutes on average server - we may want to consider doing in a single process after some block vs working out on some mechanism to do this lazily or over an epoch.

ilblackdragon · 2021-07-16T17:47:57Z

Alternatively, we should consider not maintaining persistent tree on disk.

We only write to hard drive final data. All non final changes are maintained in memory.
This would also allow to dramatically optimize the performance of reads because can maintain direct hash map snapshot of the most final data.

Also writing to hard drive is not blocking in this case the chunk application, which means it can be done in parallel.

This creates potential vector of attack by maintaining a large number of blocks without finality. One option to address this would be to have block wait time to be a function of distance to last final block. This would allow potentially censored or slowed down validators to propagate their approvals vs block marked as missing.

bowenwang1996 · 2021-07-16T19:02:41Z

Alternatively, we should consider not maintaining persistent tree on disk.

We only write to hard drive final data. All non final changes are maintained in memory.
This would also allow to dramatically optimize the performance of reads because can maintain direct hash map snapshot of the most final data.

Also writing to hard drive is not blocking in this case the chunk application, which means it can be done in parallel.

This creates potential vector of attack by maintaining a large number of blocks without finality. One option to address this would be to have block wait time to be a function of distance to last final block. This would allow potentially censored or slowed down validators to propagate their approvals vs block marked as missing.

This is an interesting idea, but I think it is orthogonal to the issue we have here. Even if we only write final data, we can still end up having clashes since two shard tries can have some trie nodes that are the same.

ilblackdragon · 2021-07-17T13:29:03Z

but I think it is orthogonal to the issue we have here.

It's can be orthogonal, but actually if we don't need to have persistent tree on disk, there are a bunch of other things that can do to simplify the storage and this will affect removing ref counting, which would address the initial issues which led to adding shard_id to storage.

For example, we can with this switch to binary merkle tree.
We already know that current hexary tree is too complicated, both to maintain and verify (and proofs end up very large).
Ethereum also wants to move away from hexary last time we talked with them (https://eips.ethereum.org/EIPS/eip-3102)

Then given we have a single merkle binary tree in storage, the layout on disk can be a hashmap indexed by data key. In the leafs nodes key -> Leaf(parent_id, data) we would store integer id of the parent.
Intermediate nodes then become just id -> Node(parent_id, left_id, left_hash, right_id, right_hash).
This way there are no collisions on the storage level and don't need reference counting.

Reads become direct from the underlying storage. And writes to the storage happen when block becomes final, where a whole batch of updates is saved with set of {ids -> Nodes} + {keys -> Leafs} are updated and some new added.

Note that most of the intermediate nodes don't need to be stored on disk and can be recomputed form the in-memory data as at this point it's known that storage access is way bigger bottleneck than computing a bunch of hashes.

Also splitting, merging and removing sub-tree becomes way simpler with binary merkle trees.

bowenwang1996 · 2021-07-17T17:21:36Z

@ilblackdragon if we don't have persistent trie on disk then we likely need to limit how far back a challenge can go to a small number. Otherwise a malicious actor can stall the chain for a long period of time by challenging a chunk that is in some 20000 blocks earlier. Another issue is that we need to prune while we process new blocks because if we wait for garbage collection it could blow up memory usage.

we would store integer id of the parent.

Those ids are globally unique I assume?

ilblackdragon · 2021-07-19T23:41:17Z

Otherwise a malicious actor can stall the chain for a long period of time by challenging a chunk that is in some 20000 blocks earlier.

Well, one would say that if you can challenge a block, it's not truly final. So I agree that limiting how far before one can challenge is important. This is also true in the current implementation as well - we already limit challenges to the current epoch, beyond which one can't really challenge as validator may not have the state.

Another issue is that we need to prune while we process new blocks because if we wait for garbage collection it could blow up memory usage.

Prune what exactly? Do you mean prune the in memory trie of the changes?
The changes on disk automatically remove the old items and add new items as part of the change transaction (similar how it's done today)

Those ids are globally unique I assume?

Yes.

bowenwang1996 · 2021-07-20T03:02:42Z

Prune what exactly? Do you mean prune the in memory trie of the changes?

Yes. Otherwise we would keep all the old forks in memory until garbage collector catches up, which could be potentially be exploited.

ilblackdragon · 2021-07-21T22:42:45Z

Yes. Otherwise we would keep all the old forks in memory until garbage collector catches up, which could be potentially be exploited.

Finalization will automatically prune everything in memory that was created before this final block, including forks and uncles.
Pretty much this can be just [block_id->state_root] map and when some block_id is finalized, all state changes present block ids between last final block and this final block - can be removed from memory. Maintaining in-memory persistent trie this should be quick and also can be done in parallel to the process of receival and processing of upcoming blocks.

bowenwang1996 · 2021-07-22T00:37:05Z

Yes. Otherwise we would keep all the old forks in memory until garbage collector catches up, which could be potentially be exploited.

Finalization will automatically prune everything in memory that was created before this final block, including forks and uncles.
Pretty much this can be just [block_id->state_root] map and when some block_id is finalized, all state changes present block ids between last final block and this final block - can be removed from memory. Maintaining in-memory persistent trie this should be quick and also can be done in parallel to the process of receival and processing of upcoming blocks.

@ilblackdragon that means we will have trouble accessing states that are more than a few blocks old, which is not only problematic for rpc nodes, but also poses issues for state sync.

bowenwang1996 · 2021-07-22T18:10:30Z

Discussed this today with the team and we think in general it is a good idea to replace refcounting with global ids. One implication with this is that the same node may be replicated multiple times in storage instead of having one node with refcount. This may be problematic because today certain contracts are deployed on a lot of different accounts (multisig for example) and they are relatively large (~300kb). Replicating them for each account could end up increasing the storage usage significantly.

ilblackdragon · 2021-07-24T13:21:48Z

Yes, this brings to global contract storage question: we should have a way to deploy a contract once and it will be stored across all shards and available via hash. This will avoid problem with replicated contracts been overcharged and resharding problem where you don’t know which shard needs which contracts.

IMO it make sense to draft a storage revamp NEP that includes all of these questions and resolve them via a single transition.

stale · 2021-10-22T15:10:52Z

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

bowenwang1996 added A-storage Area: storage and databases A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) T-core Team: issues relevant to the core team labels Jul 15, 2021

bowenwang1996 added C-enhancement Category: An issue proposing an enhancement or a PR with one. C-discussion Category: Discussion, leading to research or enhancement labels Jul 15, 2021

stale bot added the S-stale label Oct 22, 2021

bowenwang1996 added S-pinned and removed S-stale labels Oct 22, 2021

akhi3030 removed the S-pinned label Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider database key for trie node #4527

Reconsider database key for trie node #4527

bowenwang1996 commented Jul 15, 2021 •

edited by exalate-issue-sync bot

Loading

ilblackdragon commented Jul 16, 2021 •

edited

Loading

ilblackdragon commented Jul 16, 2021

bowenwang1996 commented Jul 16, 2021

ilblackdragon commented Jul 17, 2021

bowenwang1996 commented Jul 17, 2021 •

edited

Loading

ilblackdragon commented Jul 19, 2021 •

edited

Loading

bowenwang1996 commented Jul 20, 2021

ilblackdragon commented Jul 21, 2021

bowenwang1996 commented Jul 22, 2021

bowenwang1996 commented Jul 22, 2021

ilblackdragon commented Jul 24, 2021

stale bot commented Oct 22, 2021

Reconsider database key for trie node #4527

Reconsider database key for trie node #4527

Comments

bowenwang1996 commented Jul 15, 2021 • edited by exalate-issue-sync bot Loading

ilblackdragon commented Jul 16, 2021 • edited Loading

ilblackdragon commented Jul 16, 2021

bowenwang1996 commented Jul 16, 2021

ilblackdragon commented Jul 17, 2021

bowenwang1996 commented Jul 17, 2021 • edited Loading

ilblackdragon commented Jul 19, 2021 • edited Loading

bowenwang1996 commented Jul 20, 2021

ilblackdragon commented Jul 21, 2021

bowenwang1996 commented Jul 22, 2021

bowenwang1996 commented Jul 22, 2021

ilblackdragon commented Jul 24, 2021

stale bot commented Oct 22, 2021

bowenwang1996 commented Jul 15, 2021 •

edited by exalate-issue-sync bot

Loading

ilblackdragon commented Jul 16, 2021 •

edited

Loading

bowenwang1996 commented Jul 17, 2021 •

edited

Loading

ilblackdragon commented Jul 19, 2021 •

edited

Loading