Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider database key for trie node #4527

Open
bowenwang1996 opened this issue Jul 15, 2021 · 12 comments
Open

Reconsider database key for trie node #4527

bowenwang1996 opened this issue Jul 15, 2021 · 12 comments
Labels
A-storage Area: storage and databases A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) C-discussion Category: Discussion, leading to research or enhancement C-enhancement Category: An issue proposing an enhancement or a PR with one. T-core Team: issues relevant to the core team

Comments

@bowenwang1996
Copy link
Collaborator

bowenwang1996 commented Jul 15, 2021

Today when we save a trie node, the key we use is the concatenation of shard id and the trie node hash. This is problematic when we want to split one shard into multiple as it does not allow us to simply split on the trie level. Instead we need to rebuild the trie with a different shard id for the database key. The reason why we decided to have shard id as part of the database key in the first place is because of #2568, but, as @ilblackdragon suggested, this seems to be an overkill.

@bowenwang1996 bowenwang1996 added A-storage Area: storage and databases A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) T-core Team: issues relevant to the core team labels Jul 15, 2021
@bowenwang1996 bowenwang1996 added C-enhancement Category: An issue proposing an enhancement or a PR with one. C-discussion Category: Discussion, leading to research or enhancement labels Jul 15, 2021
@ilblackdragon
Copy link
Member

ilblackdragon commented Jul 16, 2021

The source of this consideration is the fact that if we only needed to split Merkle Trie or Tree that is aligned by the account_id (or it's hash) that would be easy to do in single walk down the the Trie/Tree.

Generally, ref counting is a source of huge problems across the stack (including performance). Also ref counting made the storage code extremely complex.

For context, the reasoning for ref counting was because two account infos with the same information (like having exactly the same amount of money and same nonce and no contract) will hash into the same key.

There are two proposals where made to address this issue:

  • Remove ref count and replace the hash of the node data as key with global incrementing id in the underlying database.
  • Use IDs for nodes as offsets in the underlaying flat file, removing RocksDB.

Also given this reconsideration will require database re-work, we should also reconsider the account layout. Currently account info is split across few prefixes as well (e.g. 0x0+account id for account info, 0x1 +account_id + key for keys, etc).

Separately, need to handle contract data which is stored under separate hash of the contract. Figuring out if we want to go with global contract storage (discussed somewhat here https://gov.near.org/t/store-protocol-config-in-a-contract/2105) and shard specific contract storage.

In general, if we are considering switching to any other kind of storage - this is good to batch all together and do in a single migration. Depending on how long it will take to remap data from old storage into new, if it's under few minutes on average server - we may want to consider doing in a single process after some block vs working out on some mechanism to do this lazily or over an epoch.

@ilblackdragon
Copy link
Member

Alternatively, we should consider not maintaining persistent tree on disk.

We only write to hard drive final data. All non final changes are maintained in memory.
This would also allow to dramatically optimize the performance of reads because can maintain direct hash map snapshot of the most final data.

Also writing to hard drive is not blocking in this case the chunk application, which means it can be done in parallel.

This creates potential vector of attack by maintaining a large number of blocks without finality. One option to address this would be to have block wait time to be a function of distance to last final block. This would allow potentially censored or slowed down validators to propagate their approvals vs block marked as missing.

@bowenwang1996
Copy link
Collaborator Author

Alternatively, we should consider not maintaining persistent tree on disk.

We only write to hard drive final data. All non final changes are maintained in memory.
This would also allow to dramatically optimize the performance of reads because can maintain direct hash map snapshot of the most final data.

Also writing to hard drive is not blocking in this case the chunk application, which means it can be done in parallel.

This creates potential vector of attack by maintaining a large number of blocks without finality. One option to address this would be to have block wait time to be a function of distance to last final block. This would allow potentially censored or slowed down validators to propagate their approvals vs block marked as missing.

This is an interesting idea, but I think it is orthogonal to the issue we have here. Even if we only write final data, we can still end up having clashes since two shard tries can have some trie nodes that are the same.

@ilblackdragon
Copy link
Member

but I think it is orthogonal to the issue we have here.

It's can be orthogonal, but actually if we don't need to have persistent tree on disk, there are a bunch of other things that can do to simplify the storage and this will affect removing ref counting, which would address the initial issues which led to adding shard_id to storage.

For example, we can with this switch to binary merkle tree.
We already know that current hexary tree is too complicated, both to maintain and verify (and proofs end up very large).
Ethereum also wants to move away from hexary last time we talked with them (https://eips.ethereum.org/EIPS/eip-3102)

Then given we have a single merkle binary tree in storage, the layout on disk can be a hashmap indexed by data key. In the leafs nodes key -> Leaf(parent_id, data) we would store integer id of the parent.
Intermediate nodes then become just id -> Node(parent_id, left_id, left_hash, right_id, right_hash).
This way there are no collisions on the storage level and don't need reference counting.

Reads become direct from the underlying storage. And writes to the storage happen when block becomes final, where a whole batch of updates is saved with set of {ids -> Nodes} + {keys -> Leafs} are updated and some new added.

Note that most of the intermediate nodes don't need to be stored on disk and can be recomputed form the in-memory data as at this point it's known that storage access is way bigger bottleneck than computing a bunch of hashes.

Also splitting, merging and removing sub-tree becomes way simpler with binary merkle trees.

@bowenwang1996
Copy link
Collaborator Author

bowenwang1996 commented Jul 17, 2021

@ilblackdragon if we don't have persistent trie on disk then we likely need to limit how far back a challenge can go to a small number. Otherwise a malicious actor can stall the chain for a long period of time by challenging a chunk that is in some 20000 blocks earlier. Another issue is that we need to prune while we process new blocks because if we wait for garbage collection it could blow up memory usage.

we would store integer id of the parent.

Those ids are globally unique I assume?

@ilblackdragon
Copy link
Member

ilblackdragon commented Jul 19, 2021

Otherwise a malicious actor can stall the chain for a long period of time by challenging a chunk that is in some 20000 blocks earlier.

Well, one would say that if you can challenge a block, it's not truly final. So I agree that limiting how far before one can challenge is important. This is also true in the current implementation as well - we already limit challenges to the current epoch, beyond which one can't really challenge as validator may not have the state.

Another issue is that we need to prune while we process new blocks because if we wait for garbage collection it could blow up memory usage.

Prune what exactly? Do you mean prune the in memory trie of the changes?
The changes on disk automatically remove the old items and add new items as part of the change transaction (similar how it's done today)

Those ids are globally unique I assume?

Yes.

@bowenwang1996
Copy link
Collaborator Author

Prune what exactly? Do you mean prune the in memory trie of the changes?

Yes. Otherwise we would keep all the old forks in memory until garbage collector catches up, which could be potentially be exploited.

@ilblackdragon
Copy link
Member

Yes. Otherwise we would keep all the old forks in memory until garbage collector catches up, which could be potentially be exploited.

Finalization will automatically prune everything in memory that was created before this final block, including forks and uncles.
Pretty much this can be just [block_id->state_root] map and when some block_id is finalized, all state changes present block ids between last final block and this final block - can be removed from memory. Maintaining in-memory persistent trie this should be quick and also can be done in parallel to the process of receival and processing of upcoming blocks.

@bowenwang1996
Copy link
Collaborator Author

Yes. Otherwise we would keep all the old forks in memory until garbage collector catches up, which could be potentially be exploited.

Finalization will automatically prune everything in memory that was created before this final block, including forks and uncles.
Pretty much this can be just [block_id->state_root] map and when some block_id is finalized, all state changes present block ids between last final block and this final block - can be removed from memory. Maintaining in-memory persistent trie this should be quick and also can be done in parallel to the process of receival and processing of upcoming blocks.

@ilblackdragon that means we will have trouble accessing states that are more than a few blocks old, which is not only problematic for rpc nodes, but also poses issues for state sync.

@bowenwang1996
Copy link
Collaborator Author

Discussed this today with the team and we think in general it is a good idea to replace refcounting with global ids. One implication with this is that the same node may be replicated multiple times in storage instead of having one node with refcount. This may be problematic because today certain contracts are deployed on a lot of different accounts (multisig for example) and they are relatively large (~300kb). Replicating them for each account could end up increasing the storage usage significantly.

@ilblackdragon
Copy link
Member

Yes, this brings to global contract storage question: we should have a way to deploy a contract once and it will be stored across all shards and available via hash. This will avoid problem with replicated contracts been overcharged and resharding problem where you don’t know which shard needs which contracts.

IMO it make sense to draft a storage revamp NEP that includes all of these questions and resolve them via a single transition.

@stale
Copy link

stale bot commented Oct 22, 2021

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage and databases A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) C-discussion Category: Discussion, leading to research or enhancement C-enhancement Category: An issue proposing an enhancement or a PR with one. T-core Team: issues relevant to the core team
Projects
None yet
Development

No branches or pull requests

3 participants