From 11728cf58f3e6b6934fc5dff43d7f28a734408aa Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Mon, 25 Jan 2021 23:26:49 +0100 Subject: [PATCH 01/21] ADR-040: Storage and SMT State Commitments --- docs/architecture/README.md | 3 +- ...r-040-storage-and-smt-state-commitments.md | 118 ++++++++++++++++++ 2 files changed, 120 insertions(+), 1 deletion(-) create mode 100644 docs/architecture/adr-040-storage-and-smt-state-commitments.md diff --git a/docs/architecture/README.md b/docs/architecture/README.md index a979f30e418c..32d27c7665be 100644 --- a/docs/architecture/README.md +++ b/docs/architecture/README.md @@ -73,4 +73,5 @@ Read about the [PROCESS](./PROCESS.md). - [ADR 028: Public Key Addresses](./adr-028-public-key-addresses.md) - [ADR 032: Typed Events](./adr-032-typed-events.md) - [ADR 035: Rosetta API Support](./adr-035-rosetta-api-support.md) -- [ADR 037: Governance Split Votes](./adr-037-gov-split-vote.md) \ No newline at end of file +- [ADR 037: Governance Split Votes](./adr-037-gov-split-vote.md) +- [ADR 040: Storage and SMT State Commitments](./adr-040-storage-and-smt-state-commitments.md) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md new file mode 100644 index 000000000000..08057c2fbf24 --- /dev/null +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -0,0 +1,118 @@ +# ADR 040: Storage and SMT State Commitments + +## Changelog + +- 2020-01-15: Draft + +## Status + +DRAFT Not Implemented + + +## Abstract + +Sparse Merke Tree (SMT) is a version of a Merkle Tree with various storage and performance optimizations. This ADR defines a separation of state commitments from data storage and the SDK transition from IAVL to SMT. + + +## Context + +Currently, Cosmos SDK uses IAVL for both state commitments and data storage. + +IAVL has effectively become an orphaned project within the Cosmos ecosystem and it's proven to be an inefficient state commitment. +In the current design, IAVL is used for both data storage and as a Merkle Tree for state commitments. IAVL is meant to be a standalone Merkelized key/value database, however it's using a KV DB engine to store all tree nodes. So, each node is stored in a separate record in the KV DB. This causes many inefficiencies and problems: + ++ Each object select requires a tree traversal from the root ++ Each edge traversal requires a DB query (nodes are not stored in a memory) ++ Creating snapshots is [expensive](https://github.com/cosmos/cosmos-sdk/issues/7215#issuecomment-684804950). It takes about 30 seconds to export less than 100 MB of state (as of March 2020). ++ Updates in IAVL may trigger tree reorganization and possible O(log(n)) hashes re-computation, which can become a CPU bottleneck. ++ The leaf structure is pretty expensive: it contains the `(key, value)` pair, additional metadata such as height, version. The entire node is hashed, and that hash is used as the key in the underlying database, [ref](https://github.com/cosmos/iavl/blob/master/docs/node/node.md +). + + +Moreover, the IAVL project lacks a support and we already see better, well adapted alternatives. Instead of optimizing the IAVL, we are looking in another solutions for both storage and state commitments. + + +## Decision + +We propose separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with LazyLedger SMT. + + +### Decouple state commitment from storage + +Separation of storage and SMT will allow a specialization in terms of various optimization patterns. + +SMT will use it's own storage (could use the same database underneath) from the state machine store. For every `(key, value)` pair, the SMT will store `hash(key)` in a path and `hash(key, value)` in a leaf. + +For data access we propose 2 additional KV buckets: +1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface. +2. B2: `hash(key, value) → key`: an index needed to extract a value (through: B2 -> B1) having a only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. +3. we could use more buckets to optimize the app usage if needed. + +Above, we propose to use KV DB. However, for state machine we could use RDBMS, which we discuss below. + + +### Requirements + +State Storage requirements: ++ range queries ++ quick (key, value) access ++ creating a snapshot + +State Commitment requirements: ++ fast updates ++ path length should be short ++ creating a snapshot ++ pruning + + +### LazyLedger SMT for State Commitment + +A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intractable size. The assumption here is that as the size of the tree is intractable, there would only be a few leaf nodes with valid data blocks relative to the tree size, rendering the tree as sparse. + +TODO: ++ describe pruning and version management + +### Snapshots + +One of the Stargate core features are snapshots and fast sync. Currently this feature is implemented through IAVL. +Many underlying DB engines support snapshotting. Hence, we propose to reuse that functionality and limit the supported DB engines to ones which support snapshots (Badger, RocksDB, BoltDB) + +### Pruning + +TODO: discuss state storage pruning requirements + +## Consequences + + +### Backwards Compatibility + +This ADR doesn't introduce any SDK level API changes. + +We change a storage layout, so storage migration and a blockchain reboot is required. + +### Positive + ++ Decoupling state from state commitment introduce better engineering opportunities for further optimizations and better storage patterns. ++ Performance improvements. ++ Joining SMT based camp which has wider and proven adoption than IAVL. + +### Negative + ++ Storage migration + +### Neutral + ++ Deprecating IAVL, which is one of the core proposals of Cosmos Whitepaper. + +## Further Discussions + +### RDBMS + +Use of RDBMS instead of simple KV store for state. Use of RDBMS will require an SDK API breaking change (`KVStore` interface), will allow better data extraction and indexing solutions. Instead of saving an object as a single blob of bytes, we could save it as record in a table in the state storage layer, and as a `hash(key, protobuf(object))` in the SMT as outlined above. To verify that an object registered in RDBMS is same as the one committed to SMT, one will need to load it from RDBMS, marshal using protobuf, hash and do SMT search. + + +## References + +- [IAVL What's Next?](https://github.com/cosmos/cosmos-sdk/issues/7100) +- [IAVL overview](https://docs.google.com/document/d/16Z_hW2rSAmoyMENO-RlAhQjAG3mSNKsQueMnKpmcBv0/edit#heading=h.yd2th7x3o1iv) of it's state v0.15 +- [State commitments and storage report](https://paper.dropbox.com/published/State-commitments-and-storage-review--BDvA1MLwRtOx55KRihJ5xxLbBw-KeEB7eOd11pNrZvVtqUgL3h) From 662ec91ca7e9b823d5e2b52f84d5ec19c3e6c609 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Tue, 26 Jan 2021 12:44:30 +0100 Subject: [PATCH 02/21] Update docs/architecture/adr-040-storage-and-smt-state-commitments.md Co-authored-by: Ismail Khoffi --- docs/architecture/adr-040-storage-and-smt-state-commitments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 08057c2fbf24..7a7a6d0732f8 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -29,7 +29,7 @@ In the current design, IAVL is used for both data storage and as a Merkle Tree f ). -Moreover, the IAVL project lacks a support and we already see better, well adapted alternatives. Instead of optimizing the IAVL, we are looking in another solutions for both storage and state commitments. +Moreover, the IAVL project lacks support and a maintainer and we already see better and well-established alternatives. Instead of optimizing the IAVL, we are looking into other solutions for both storage and state commitments. ## Decision From 5fdbe5d1ee6692bde56b42e5c3ca5111ad248e4c Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Tue, 26 Jan 2021 12:45:17 +0100 Subject: [PATCH 03/21] Update docs/architecture/adr-040-storage-and-smt-state-commitments.md Co-authored-by: Ismail Khoffi --- docs/architecture/adr-040-storage-and-smt-state-commitments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 7a7a6d0732f8..6bbc8967aca5 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -39,7 +39,7 @@ We propose separate the concerns of state commitment (**SC**), needed for consen ### Decouple state commitment from storage -Separation of storage and SMT will allow a specialization in terms of various optimization patterns. +Separation of storage and commitment (by the SMT) will allow to optimize the different components according to their usage and access patterns. SMT will use it's own storage (could use the same database underneath) from the state machine store. For every `(key, value)` pair, the SMT will store `hash(key)` in a path and `hash(key, value)` in a leaf. From fa8e9e3be75c26a9418e6679ef2e005f53356d02 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Tue, 26 Jan 2021 13:08:16 +0100 Subject: [PATCH 04/21] Added more details for snapshotting and pruning. --- ...r-040-storage-and-smt-state-commitments.md | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 6bbc8967aca5..055f708b4378 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -34,7 +34,7 @@ Moreover, the IAVL project lacks support and a maintainer and we already see bet ## Decision -We propose separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with LazyLedger SMT. +We propose separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedger SMT](https://github.com/lazyledger/smt). ### Decouple state commitment from storage @@ -57,29 +57,35 @@ State Storage requirements: + range queries + quick (key, value) access + creating a snapshot ++ prunning (garbage collection) State Commitment requirements: + fast updates + path length should be short + creating a snapshot -+ pruning ++ pruning (garbage collection) ### LazyLedger SMT for State Commitment A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intractable size. The assumption here is that as the size of the tree is intractable, there would only be a few leaf nodes with valid data blocks relative to the tree size, rendering the tree as sparse. -TODO: -+ describe pruning and version management ### Snapshots One of the Stargate core features are snapshots and fast sync. Currently this feature is implemented through IAVL. -Many underlying DB engines support snapshotting. Hence, we propose to reuse that functionality and limit the supported DB engines to ones which support snapshots (Badger, RocksDB, BoltDB) +Many underlying DB engines support snapshotting. Hence, we propose to reuse that functionality and limit the supported DB engines to ones which support snapshots (Badger, RocksDB, BoltDB) using a _copy on write_ mechanism. ### Pruning -TODO: discuss state storage pruning requirements +At minimum SC doesn't need to keep old versions. However we need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way:dDuring transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Only when we commit on a root store, all changes are written to the the SMT. + +We can use the same approach for SM Storage. However, we need to keep few past versions (configurable by user, eg: 10 past versions every 100 blocks) in a form of snapshot. Ideally we would like to shift that functionality to a DB engine itself. + +TODO: Verify which DB engines support that. +Otherwise, the solution is to implement a sort of _mark and sweep GC_: once per defined period, a GC will start, mark old objects and prune them. This will require encoding a version mechanism in a KV store. + + ## Consequences @@ -99,6 +105,7 @@ We change a storage layout, so storage migration and a blockchain reboot is requ ### Negative + Storage migration ++ LL SMT doesn't support pruning - we will need to add and test that functionality. ### Neutral From 864927ea71831b575dbf26ea535213781db8987d Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Tue, 26 Jan 2021 13:22:26 +0100 Subject: [PATCH 05/21] updated links and references --- .../adr-040-storage-and-smt-state-commitments.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 055f708b4378..a03bf7fc5bb0 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -34,7 +34,7 @@ Moreover, the IAVL project lacks support and a maintainer and we already see bet ## Decision -We propose separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedger SMT](https://github.com/lazyledger/smt). +We propose separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedger SMT](https://github.com/lazyledger/smt). LazyLedger SMT is based on Diem (called jellyfish) design [*] - it uses a compute-optimised SMT by replacing subtrees with only default values with a single node (same approach is used by Ethereum2 as well). ### Decouple state commitment from storage @@ -44,7 +44,7 @@ Separation of storage and commitment (by the SMT) will allow to optimize the dif SMT will use it's own storage (could use the same database underneath) from the state machine store. For every `(key, value)` pair, the SMT will store `hash(key)` in a path and `hash(key, value)` in a leaf. For data access we propose 2 additional KV buckets: -1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface. +1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface: provides direct access by key and allows prefix iteration (KV DB backend must support it). 2. B2: `hash(key, value) → key`: an index needed to extract a value (through: B2 -> B1) having a only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. 3. we could use more buckets to optimize the app usage if needed. @@ -82,7 +82,7 @@ At minimum SC doesn't need to keep old versions. However we need to be able to p We can use the same approach for SM Storage. However, we need to keep few past versions (configurable by user, eg: 10 past versions every 100 blocks) in a form of snapshot. Ideally we would like to shift that functionality to a DB engine itself. -TODO: Verify which DB engines support that. +TODO: Verify which DB engines support that. I'm pretty confident this (pruning and versioning)can and should be offloaded to a DB engine. Otherwise, the solution is to implement a sort of _mark and sweep GC_: once per defined period, a GC will start, mark old objects and prune them. This will require encoding a version mechanism in a KV store. @@ -111,6 +111,7 @@ We change a storage layout, so storage migration and a blockchain reboot is requ + Deprecating IAVL, which is one of the core proposals of Cosmos Whitepaper. + ## Further Discussions ### RDBMS @@ -120,6 +121,9 @@ Use of RDBMS instead of simple KV store for state. Use of RDBMS will require an ## References -- [IAVL What's Next?](https://github.com/cosmos/cosmos-sdk/issues/7100) -- [IAVL overview](https://docs.google.com/document/d/16Z_hW2rSAmoyMENO-RlAhQjAG3mSNKsQueMnKpmcBv0/edit#heading=h.yd2th7x3o1iv) of it's state v0.15 -- [State commitments and storage report](https://paper.dropbox.com/published/State-commitments-and-storage-review--BDvA1MLwRtOx55KRihJ5xxLbBw-KeEB7eOd11pNrZvVtqUgL3h) ++ [IAVL What's Next?](https://github.com/cosmos/cosmos-sdk/issues/7100) ++ [IAVL overview](https://docs.google.com/document/d/16Z_hW2rSAmoyMENO-RlAhQjAG3mSNKsQueMnKpmcBv0/edit#heading=h.yd2th7x3o1iv) of it's state v0.15 ++ [State commitments and storage report](https://paper.dropbox.com/published/State-commitments-and-storage-review--BDvA1MLwRtOx55KRihJ5xxLbBw-KeEB7eOd11pNrZvVtqUgL3h) ++ [LazyLedger SMT](https://github.com/lazyledger/smt) ++ Facebook Diem (Libra) SMT [design](https://developers.diem.com/papers/jellyfish-merkle-tree/2021-01-14.pdf) ++ [Trillian Revocation Transparency](https://github.com/google/trillian/blob/master/docs/papers/RevocationTransparency.pdf), [Trillian Verifiable Data Structures](https://github.com/google/trillian/blob/master/docs/papers/VerifiableDataStructures.pdf). From 78215b25bb540028500c278ab68fc5a75ef48497 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Tue, 26 Jan 2021 13:26:27 +0100 Subject: [PATCH 06/21] add blockchains which already use SMT --- docs/architecture/adr-040-storage-and-smt-state-commitments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index a03bf7fc5bb0..4359400eafd1 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -100,7 +100,7 @@ We change a storage layout, so storage migration and a blockchain reboot is requ + Decoupling state from state commitment introduce better engineering opportunities for further optimizations and better storage patterns. + Performance improvements. -+ Joining SMT based camp which has wider and proven adoption than IAVL. ++ Joining SMT based camp which has wider and proven adoption than IAVL. Example projects which decided on SMT: Ethereum2, Diem (Libra), Trillan, Tezos, LazyLedger. ### Negative From 6dd0323b26dc1ed80fbf70215bf504e09fb24d67 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Wed, 27 Jan 2021 13:09:54 +0100 Subject: [PATCH 07/21] reorganize versioning and pruning --- ...r-040-storage-and-smt-state-commitments.md | 22 ++++++++++++++----- 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 4359400eafd1..affd746cbd0c 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -73,18 +73,28 @@ A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intrac ### Snapshots -One of the Stargate core features are snapshots and fast sync. Currently this feature is implemented through IAVL. -Many underlying DB engines support snapshotting. Hence, we propose to reuse that functionality and limit the supported DB engines to ones which support snapshots (Badger, RocksDB, BoltDB) using a _copy on write_ mechanism. +One of the Stargate core features are snapshots and fast sync delivered in the `/snapshot` package. Currently this feature is implemented through IAVL. +Many underlying DB engines support snapshotting. Hence, we propose to reuse that functionality and limit the supported DB engines to ones which support snapshots (Badger, RocksDB, ...) using a _copy on write_ mechanism (we can't create a full copy - it would be too big). -### Pruning +The number of snapshots should be configurable by user (eg: 10 past versions - one every 100 blocks). + +Pruning old snapshots is effectively done by DB. If DB allows to configure max number of snapshots, then we are done. Otherwise, we need to hook this mechanism into `EndBlocker`. + +### Versioning At minimum SC doesn't need to keep old versions. However we need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way:dDuring transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Only when we commit on a root store, all changes are written to the the SMT. -We can use the same approach for SM Storage. However, we need to keep few past versions (configurable by user, eg: 10 past versions every 100 blocks) in a form of snapshot. Ideally we would like to shift that functionality to a DB engine itself. +We can use the same approach for SM Storage. + +#### Accessing old, committed state versions + +ABCI interface requires a support for querying data in past versions. The version is specified by a block height (so we query for an object by key `K` at a version committed in block height `H`). The query is defined in the `abci.Query` structure. The number of old versions we support for `abci.Query` is configurable. -TODO: Verify which DB engines support that. I'm pretty confident this (pruning and versioning)can and should be offloaded to a DB engine. -Otherwise, the solution is to implement a sort of _mark and sweep GC_: once per defined period, a GC will start, mark old objects and prune them. This will require encoding a version mechanism in a KV store. +TODO: Verify if we can use same mechanism as for snapshots (offloading it to DB engine): ++ what's DB storage impact - are snapshots expensive? +If this won't work, then we will integrate other mechanism discussed in https://github.com/cosmos/cosmos-sdk/discussions/8297#discussioncomment-309918. +Pruning custom versions could be done using _mark and sweep GC_: once per defined period, a GC will start, mark old objects and prune them. This will require encoding a version mechanism in a KV store. ## Consequences From 250b5ff82de47223b2ea6b656f881c9b10abfd91 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Fri, 29 Jan 2021 02:23:54 +0100 Subject: [PATCH 08/21] Update docs/architecture/adr-040-storage-and-smt-state-commitments.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Tomasz Zdybał --- docs/architecture/adr-040-storage-and-smt-state-commitments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index affd746cbd0c..fdeffbb1c34d 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -82,7 +82,7 @@ Pruning old snapshots is effectively done by DB. If DB allows to configure max n ### Versioning -At minimum SC doesn't need to keep old versions. However we need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way:dDuring transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Only when we commit on a root store, all changes are written to the the SMT. +At minimum SC doesn't need to keep old versions. However we need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Only when we commit on a root store, all changes are written to the the SMT. We can use the same approach for SM Storage. From 374916f996508a4784c391689ae0362f13a97232 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Fri, 29 Jan 2021 03:24:33 +0100 Subject: [PATCH 09/21] Update docs/architecture/adr-040-storage-and-smt-state-commitments.md Co-authored-by: Ismail Khoffi --- docs/architecture/adr-040-storage-and-smt-state-commitments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index fdeffbb1c34d..09e9cc1ca389 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -48,7 +48,7 @@ For data access we propose 2 additional KV buckets: 2. B2: `hash(key, value) → key`: an index needed to extract a value (through: B2 -> B1) having a only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. 3. we could use more buckets to optimize the app usage if needed. -Above, we propose to use KV DB. However, for state machine we could use RDBMS, which we discuss below. +Above, we propose to use KV DB. However, for the state machine, we could use an RDBMS, which we discuss below. ### Requirements From e90bf8aac6602b51f3f1b34758e2ddb4d181c7c3 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Fri, 29 Jan 2021 02:56:12 +0100 Subject: [PATCH 10/21] adding a paragraph about state management --- .../adr-040-storage-and-smt-state-commitments.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 09e9cc1ca389..5588a8ce1dc6 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -97,6 +97,17 @@ If this won't work, then we will integrate other mechanism discussed in https:// Pruning custom versions could be done using _mark and sweep GC_: once per defined period, a GC will start, mark old objects and prune them. This will require encoding a version mechanism in a KV store. +### Managing versions and pruning + +Number of historical versions for `abci.Query` and snapshots for fast sync is part of a node configuration, not a chain configuration. +As outlined above, snapshot and versioning feature is fully offloaded to the underlying DB engine. However, we still need to have a process to instrument the DB engine to create or remove a version. +The `rootmulti.Store` keeps track of the version number. The `Store.Commit` function increments the version on each call, and checks if it needs to remove old versions. We need to add support for not `IAVL` store types there. + +NOTE: `Commit` must be called exactly once per block. Otherwise we risk going out of sync for the version number and block height. + +TODO: It seams we don't need to update the `MultiStore` interface - it encapsulates a `Commiter` interface, which has the `Commit`, `SetPruning`, `GetPruning` functions. However, we may consider splitting that interface into `Committer` and `PrunningCommiter` - only the multiroot should implement `PrunningCommiter`. + + ## Consequences From 8602b3e33381e9ebf5110c41e3482bfb8bfda311 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Thu, 25 Feb 2021 23:38:46 +0100 Subject: [PATCH 11/21] adr-40: update 'accessing old state' section --- .../adr-040-storage-and-smt-state-commitments.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 5588a8ce1dc6..fe29a5f7717d 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -76,7 +76,7 @@ A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intrac One of the Stargate core features are snapshots and fast sync delivered in the `/snapshot` package. Currently this feature is implemented through IAVL. Many underlying DB engines support snapshotting. Hence, we propose to reuse that functionality and limit the supported DB engines to ones which support snapshots (Badger, RocksDB, ...) using a _copy on write_ mechanism (we can't create a full copy - it would be too big). -The number of snapshots should be configurable by user (eg: 10 past versions - one every 100 blocks). +New snapshot will be created in every `EndBlocker`. The number of snapshots should be configurable by user (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Pruning old snapshots is effectively done by DB. If DB allows to configure max number of snapshots, then we are done. Otherwise, we need to hook this mechanism into `EndBlocker`. @@ -88,13 +88,11 @@ We can use the same approach for SM Storage. #### Accessing old, committed state versions -ABCI interface requires a support for querying data in past versions. The version is specified by a block height (so we query for an object by key `K` at a version committed in block height `H`). The query is defined in the `abci.Query` structure. The number of old versions we support for `abci.Query` is configurable. +One of the functional requirements is to access old state. This is done with `abci.Query` structure. The version is specified by a block height (so we query for an object by key `K` at a version committed in block height `H`). The number of old versions supported for `abci.Query` is configurable. Moreover, SDK could provide a way to directly access the state. However, a state machines shouldn't do that - since the number of snapshots is configurable, it would lead to a not deterministic execution. -TODO: Verify if we can use same mechanism as for snapshots (offloading it to DB engine): -+ what's DB storage impact - are snapshots expensive? +We validated the Snapshot mechanism for querying old state versions. -If this won't work, then we will integrate other mechanism discussed in https://github.com/cosmos/cosmos-sdk/discussions/8297#discussioncomment-309918. -Pruning custom versions could be done using _mark and sweep GC_: once per defined period, a GC will start, mark old objects and prune them. This will require encoding a version mechanism in a KV store. +Pruning custom versions could be done using a Garbage Collector: once per defined period, a GC will start, and remove old snapshots. This will require encoding a version mechanism in a KV store. ### Managing versions and pruning From aedce213bd25191e6dc4082b78409b52d451c4b6 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Fri, 23 Apr 2021 16:12:01 +0200 Subject: [PATCH 12/21] update based on all recent discussions and validations --- ...r-040-storage-and-smt-state-commitments.md | 54 +++++++++++-------- 1 file changed, 31 insertions(+), 23 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index fe29a5f7717d..7202597a516b 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -28,7 +28,6 @@ In the current design, IAVL is used for both data storage and as a Merkle Tree f + The leaf structure is pretty expensive: it contains the `(key, value)` pair, additional metadata such as height, version. The entire node is hashed, and that hash is used as the key in the underlying database, [ref](https://github.com/cosmos/iavl/blob/master/docs/node/node.md ). - Moreover, the IAVL project lacks support and a maintainer and we already see better and well-established alternatives. Instead of optimizing the IAVL, we are looking into other solutions for both storage and state commitments. @@ -61,7 +60,7 @@ State Storage requirements: State Commitment requirements: + fast updates -+ path length should be short ++ tree path should be short + creating a snapshot + pruning (garbage collection) @@ -71,39 +70,36 @@ State Commitment requirements: A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intractable size. The assumption here is that as the size of the tree is intractable, there would only be a few leaf nodes with valid data blocks relative to the tree size, rendering the tree as sparse. -### Snapshots - -One of the Stargate core features are snapshots and fast sync delivered in the `/snapshot` package. Currently this feature is implemented through IAVL. -Many underlying DB engines support snapshotting. Hence, we propose to reuse that functionality and limit the supported DB engines to ones which support snapshots (Badger, RocksDB, ...) using a _copy on write_ mechanism (we can't create a full copy - it would be too big). - -New snapshot will be created in every `EndBlocker`. The number of snapshots should be configurable by user (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). +### Snapshots for storage sync and versioning +s +One of the Stargate core features are snapshots and fast sync delivered in the `/snapshot` package. This feature is implemented in SDK and requires a storage support. Currently the only supported is IAVL. -Pruning old snapshots is effectively done by DB. If DB allows to configure max number of snapshots, then we are done. Otherwise, we need to hook this mechanism into `EndBlocker`. +Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big), usually a snapshot mechanism is based on a _copy on write_ and it allows to efficiently deliver DB state at a certain stage. +Some DB engines support snapshotting. Hence, we propose to reuse that functionality for the state sync and versioning (described below). It will the supported DB engines to ones which efficiently implement snapshots. In a final section we will discuss evaluated DBs. -### Versioning - -At minimum SC doesn't need to keep old versions. However we need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Only when we commit on a root store, all changes are written to the the SMT. +New snapshot will be created in every `EndBlocker`. The `rootmulti.Store` keeps track of the version number and implements the `MultiStore` interface. `MultiStore` encapsulates a `Commiter` interface, which has the `Commit`, `SetPruning`, `GetPruning` functions which will be used for creating and removing snapshots. The `Store.Commit` function increments the version on each call, and checks if it needs to remove old versions. We will need to update the SMT interface to implement the `Commiter` interface. +NOTE: `Commit` must be called exactly once per block. Otherwise we risk going out of sync for the version number and block height. +NOTE: For the SDK storage, we may consider splitting that interface into `Committer` and `PrunningCommiter` - only the multiroot should implement `PrunningCommiter` (cache and prefix store don't need pruning). -We can use the same approach for SM Storage. +Number of historical versions (snapshots) for `abci.Query` and fast sync is part of a node configuration, not a chain configuration. A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all snapshots. -#### Accessing old, committed state versions +Pruning old snapshots is effectively done by DB. Whenever we update a record in SC, SMT will create a new one without removing the old one. Since we are using a snapshot for each block, we must update the mechanism and immediately remove an orphaned from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. -One of the functional requirements is to access old state. This is done with `abci.Query` structure. The version is specified by a block height (so we query for an object by key `K` at a version committed in block height `H`). The number of old versions supported for `abci.Query` is configurable. Moreover, SDK could provide a way to directly access the state. However, a state machines shouldn't do that - since the number of snapshots is configurable, it would lead to a not deterministic execution. +To manage the active snapshots we will either us a DB _max number of snapshots_ option (if available), or will remove snapshots in the `EndBlocker`. The latter option can be done efficiently by identifying snapshots with block height. -We validated the Snapshot mechanism for querying old state versions. +#### Accessing old state versions -Pruning custom versions could be done using a Garbage Collector: once per defined period, a GC will start, and remove old snapshots. This will require encoding a version mechanism in a KV store. +One of the functional requirements is to access old state. This is done through `abci.Query` structure. The version is specified by a block height (so we query for an object by a key `K` at block height `H`). The number of old versions supported for `abci.Query` is configurable. Accessing an old state is done by using available snapshots. +`abci.Query` doesn't need old state of SC. So, for efficiency, we should keep SC and SM Storage in different databases (however using the same DB engine). We will only create snapshots for +Moreover, SDK could provide a way to directly access the state. However, a state machines shouldn't do that - since the number of snapshots is configurable, it would lead to a not deterministic execution. -### Managing versions and pruning +We positively validated a snapshot mechanism for querying old state with regards to the database we evaluated. -Number of historical versions for `abci.Query` and snapshots for fast sync is part of a node configuration, not a chain configuration. -As outlined above, snapshot and versioning feature is fully offloaded to the underlying DB engine. However, we still need to have a process to instrument the DB engine to create or remove a version. -The `rootmulti.Store` keeps track of the version number. The `Store.Commit` function increments the version on each call, and checks if it needs to remove old versions. We need to add support for not `IAVL` store types there. -NOTE: `Commit` must be called exactly once per block. Otherwise we risk going out of sync for the version number and block height. +### Rollbacks -TODO: It seams we don't need to update the `MultiStore` interface - it encapsulates a `Commiter` interface, which has the `Commit`, `SetPruning`, `GetPruning` functions. However, we may consider splitting that interface into `Committer` and `PrunningCommiter` - only the multiroot should implement `PrunningCommiter`. +We need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Once we finish the block processing, in the `Endblocker`, we commit a root store - at that time, all changes are written to the SMT and to the SM Storage and a snapshot is created. ## Consequences @@ -131,8 +127,19 @@ We change a storage layout, so storage migration and a blockchain reboot is requ + Deprecating IAVL, which is one of the core proposals of Cosmos Whitepaper. +## Alternative designs. + +Most of the alternative designs were evaluated in [state commitments and storage report](https://paper.dropbox.com/published/State-commitments-and-storage-review--BDvA1MLwRtOx55KRihJ5xxLbBw-KeEB7eOd11pNrZvVtqUgL3h). + +Ethereum research published [Verkle Tire](https://notes.ethereum.org/_N1mutVERDKtqGIEYc-Flw#fnref1) - an idea of combining polynomial commitments with merkle tree in order to reduce the tree height. This concept has a very good potential, but we think it's too early to implement it. The current, SMT based design could be easily updated to the Verkle Tire once other research implement all necessary libraries. The main advantage of the design described in this ADR is the separation of state commitments from the data storage and designing a more powerful interface. + + ## Further Discussions +### Evaluated KV Databases + +We verified existing databases KV databases for evaluating snapshot support. The following DBs provide efficient snapshot mechanism: Badger, RocksDB, [Pebbe](https://github.com/cockroachdb/pebble). DB which don't provide such support or are not production ready: boltdb, leveldb, goleveldb, membdb, lmdb. + ### RDBMS Use of RDBMS instead of simple KV store for state. Use of RDBMS will require an SDK API breaking change (`KVStore` interface), will allow better data extraction and indexing solutions. Instead of saving an object as a single blob of bytes, we could save it as record in a table in the state storage layer, and as a `hash(key, protobuf(object))` in the SMT as outlined above. To verify that an object registered in RDBMS is same as the one committed to SMT, one will need to load it from RDBMS, marshal using protobuf, hash and do SMT search. @@ -146,3 +153,4 @@ Use of RDBMS instead of simple KV store for state. Use of RDBMS will require an + [LazyLedger SMT](https://github.com/lazyledger/smt) + Facebook Diem (Libra) SMT [design](https://developers.diem.com/papers/jellyfish-merkle-tree/2021-01-14.pdf) + [Trillian Revocation Transparency](https://github.com/google/trillian/blob/master/docs/papers/RevocationTransparency.pdf), [Trillian Verifiable Data Structures](https://github.com/google/trillian/blob/master/docs/papers/VerifiableDataStructures.pdf). ++ Design and implementation [discussion](https://github.com/cosmos/cosmos-sdk/discussions/8297). From f70427948a0e250436e83e5de6f21095888b145c Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Tue, 27 Apr 2021 12:44:05 +0200 Subject: [PATCH 13/21] adding more explanation about KV interface --- .../adr-040-storage-and-smt-state-commitments.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 7202597a516b..188e576cad41 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -35,16 +35,18 @@ Moreover, the IAVL project lacks support and a maintainer and we already see bet We propose separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedger SMT](https://github.com/lazyledger/smt). LazyLedger SMT is based on Diem (called jellyfish) design [*] - it uses a compute-optimised SMT by replacing subtrees with only default values with a single node (same approach is used by Ethereum2 as well). +The storage model presented here doesn't deal with data structure nor serialization. It's a Key-Value database, where both key and value are binaries. The storage user is responsible about data serialization. ### Decouple state commitment from storage + Separation of storage and commitment (by the SMT) will allow to optimize the different components according to their usage and access patterns. -SMT will use it's own storage (could use the same database underneath) from the state machine store. For every `(key, value)` pair, the SMT will store `hash(key)` in a path and `hash(key, value)` in a leaf. +SMT will use it's own storage (could use the same database underneath) from the state machine store. For every `(key, value)` pair, the SMT will store `hash(key)` in a path (needed to evenly distribute keys in the tree) and `hash(key, value)` in a leaf (to bind the (key, value) pair stored in the `SS`). Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf. For data access we propose 2 additional KV buckets: 1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface: provides direct access by key and allows prefix iteration (KV DB backend must support it). -2. B2: `hash(key, value) → key`: an index needed to extract a value (through: B2 -> B1) having a only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. +2. B2: `hash(key, value) → key`: an index needed to extract a value (through: SMT → B2 → B1) having a only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. 3. we could use more buckets to optimize the app usage if needed. Above, we propose to use KV DB. However, for the state machine, we could use an RDBMS, which we discuss below. @@ -83,14 +85,14 @@ NOTE: For the SDK storage, we may consider splitting that interface into `Commit Number of historical versions (snapshots) for `abci.Query` and fast sync is part of a node configuration, not a chain configuration. A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all snapshots. -Pruning old snapshots is effectively done by DB. Whenever we update a record in SC, SMT will create a new one without removing the old one. Since we are using a snapshot for each block, we must update the mechanism and immediately remove an orphaned from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. +Pruning old snapshots is effectively done by DB. Whenever we update a record in `SC`, SMT will create a new one without removing the old one. Since we are using a snapshot for each block, we must update the mechanism and immediately remove an orphaned from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. To manage the active snapshots we will either us a DB _max number of snapshots_ option (if available), or will remove snapshots in the `EndBlocker`. The latter option can be done efficiently by identifying snapshots with block height. #### Accessing old state versions One of the functional requirements is to access old state. This is done through `abci.Query` structure. The version is specified by a block height (so we query for an object by a key `K` at block height `H`). The number of old versions supported for `abci.Query` is configurable. Accessing an old state is done by using available snapshots. -`abci.Query` doesn't need old state of SC. So, for efficiency, we should keep SC and SM Storage in different databases (however using the same DB engine). We will only create snapshots for +`abci.Query` doesn't need old state of `SC`. So, for efficiency, we should keep `SC` and `SS` in different databases (however using the same DB engine). We will only create snapshots for Moreover, SDK could provide a way to directly access the state. However, a state machines shouldn't do that - since the number of snapshots is configurable, it would lead to a not deterministic execution. @@ -99,7 +101,7 @@ We positively validated a snapshot mechanism for querying old state with regards ### Rollbacks -We need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Once we finish the block processing, in the `Endblocker`, we commit a root store - at that time, all changes are written to the SMT and to the SM Storage and a snapshot is created. +We need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Once we finish the block processing, in the `Endblocker`, we commit a root store - at that time, all changes are written to the SMT and to the `SS` and a snapshot is created. ## Consequences From 7537c84884f24656a24797e3c163faf679979664 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Wed, 28 Apr 2021 15:42:36 +0200 Subject: [PATCH 14/21] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Tomasz Zdybał Co-authored-by: Marko --- .../adr-040-storage-and-smt-state-commitments.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 188e576cad41..28aa250d3af5 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -35,7 +35,7 @@ Moreover, the IAVL project lacks support and a maintainer and we already see bet We propose separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedger SMT](https://github.com/lazyledger/smt). LazyLedger SMT is based on Diem (called jellyfish) design [*] - it uses a compute-optimised SMT by replacing subtrees with only default values with a single node (same approach is used by Ethereum2 as well). -The storage model presented here doesn't deal with data structure nor serialization. It's a Key-Value database, where both key and value are binaries. The storage user is responsible about data serialization. +The storage model presented here doesn't deal with data structure nor serialization. It's a Key-Value database, where both key and value are binaries. The storage user is responsible for data serialization. ### Decouple state commitment from storage @@ -46,7 +46,7 @@ SMT will use it's own storage (could use the same database underneath) from the For data access we propose 2 additional KV buckets: 1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface: provides direct access by key and allows prefix iteration (KV DB backend must support it). -2. B2: `hash(key, value) → key`: an index needed to extract a value (through: SMT → B2 → B1) having a only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. +2. B2: `hash(key, value) → key`: an index needed to extract a value (through: SMT → B2 → B1) having only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. 3. we could use more buckets to optimize the app usage if needed. Above, we propose to use KV DB. However, for the state machine, we could use an RDBMS, which we discuss below. @@ -73,7 +73,6 @@ A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intrac ### Snapshots for storage sync and versioning -s One of the Stargate core features are snapshots and fast sync delivered in the `/snapshot` package. This feature is implemented in SDK and requires a storage support. Currently the only supported is IAVL. Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big), usually a snapshot mechanism is based on a _copy on write_ and it allows to efficiently deliver DB state at a certain stage. From 1cc123ea9f04e99324a7bb383f86689508d69403 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Wed, 28 Apr 2021 22:33:19 +0200 Subject: [PATCH 15/21] Apply suggestions from code review Co-authored-by: Marko --- .../adr-040-storage-and-smt-state-commitments.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 28aa250d3af5..e421b9ae753d 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -40,7 +40,7 @@ The storage model presented here doesn't deal with data structure nor serializat ### Decouple state commitment from storage -Separation of storage and commitment (by the SMT) will allow to optimize the different components according to their usage and access patterns. +Separation of storage and commitment (by the SMT) will allow the optimization of different components according to their usage and access patterns. SMT will use it's own storage (could use the same database underneath) from the state machine store. For every `(key, value)` pair, the SMT will store `hash(key)` in a path (needed to evenly distribute keys in the tree) and `hash(key, value)` in a leaf (to bind the (key, value) pair stored in the `SS`). Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf. @@ -49,7 +49,7 @@ For data access we propose 2 additional KV buckets: 2. B2: `hash(key, value) → key`: an index needed to extract a value (through: SMT → B2 → B1) having only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. 3. we could use more buckets to optimize the app usage if needed. -Above, we propose to use KV DB. However, for the state machine, we could use an RDBMS, which we discuss below. +Above, we propose to use a KV DB. However, for the state machine, we could use an RDBMS, which we discuss below. ### Requirements @@ -73,7 +73,7 @@ A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intrac ### Snapshots for storage sync and versioning -One of the Stargate core features are snapshots and fast sync delivered in the `/snapshot` package. This feature is implemented in SDK and requires a storage support. Currently the only supported is IAVL. +One of the Stargate core features are snapshots and state sync delivered in the `/snapshot` package. This feature is implemented in SDK and requires storage support. Currently IAVL is the only supported backend. Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big), usually a snapshot mechanism is based on a _copy on write_ and it allows to efficiently deliver DB state at a certain stage. Some DB engines support snapshotting. Hence, we propose to reuse that functionality for the state sync and versioning (described below). It will the supported DB engines to ones which efficiently implement snapshots. In a final section we will discuss evaluated DBs. @@ -84,7 +84,7 @@ NOTE: For the SDK storage, we may consider splitting that interface into `Commit Number of historical versions (snapshots) for `abci.Query` and fast sync is part of a node configuration, not a chain configuration. A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all snapshots. -Pruning old snapshots is effectively done by DB. Whenever we update a record in `SC`, SMT will create a new one without removing the old one. Since we are using a snapshot for each block, we must update the mechanism and immediately remove an orphaned from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. +Pruning old snapshots is effectively done by the database. Whenever we update a record in `SC`, SMT will create a new one without removing the old one. Since we are snapshoting each block, we update the mechanism and immediately remove an orphaned from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. To manage the active snapshots we will either us a DB _max number of snapshots_ option (if available), or will remove snapshots in the `EndBlocker`. The latter option can be done efficiently by identifying snapshots with block height. @@ -93,14 +93,14 @@ To manage the active snapshots we will either us a DB _max number of snapshots_ One of the functional requirements is to access old state. This is done through `abci.Query` structure. The version is specified by a block height (so we query for an object by a key `K` at block height `H`). The number of old versions supported for `abci.Query` is configurable. Accessing an old state is done by using available snapshots. `abci.Query` doesn't need old state of `SC`. So, for efficiency, we should keep `SC` and `SS` in different databases (however using the same DB engine). We will only create snapshots for -Moreover, SDK could provide a way to directly access the state. However, a state machines shouldn't do that - since the number of snapshots is configurable, it would lead to a not deterministic execution. +Moreover, SDK could provide a way to directly access the state. However, a state machines shouldn't do that - since the number of snapshots is configurable, it would lead to nondeterministic execution. We positively validated a snapshot mechanism for querying old state with regards to the database we evaluated. ### Rollbacks -We need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Once we finish the block processing, in the `Endblocker`, we commit a root store - at that time, all changes are written to the SMT and to the `SS` and a snapshot is created. +We need to be able to process transactions and roll-back state updates if a transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Once we finish the block processing, in the `Endblocker`, we commit a root store - at that time, all changes are written to the SMT and to the `SS` and a snapshot is created. ## Consequences @@ -110,7 +110,7 @@ We need to be able to process transactions and roll-back state updates if transa This ADR doesn't introduce any SDK level API changes. -We change a storage layout, so storage migration and a blockchain reboot is required. +We change the storage layout of the state machine, a storage migration and network upgrade is required to incorporate these changes. ### Positive @@ -139,7 +139,7 @@ Ethereum research published [Verkle Tire](https://notes.ethereum.org/_N1mutVERDK ### Evaluated KV Databases -We verified existing databases KV databases for evaluating snapshot support. The following DBs provide efficient snapshot mechanism: Badger, RocksDB, [Pebbe](https://github.com/cockroachdb/pebble). DB which don't provide such support or are not production ready: boltdb, leveldb, goleveldb, membdb, lmdb. +We verified existing databases KV databases for evaluating snapshot support. The following databases provide efficient snapshot mechanism: Badger, RocksDB, [Pebble](https://github.com/cockroachdb/pebble). Databases which don't provide such support or are not production ready: boltdb, leveldb, goleveldb, membdb, lmdb. ### RDBMS From d321dac2f3f99afc968dad2352facd18519328cb Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Wed, 28 Apr 2021 23:10:25 +0200 Subject: [PATCH 16/21] review comments --- .../adr-040-storage-and-smt-state-commitments.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index e421b9ae753d..5ebcd20eac47 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -33,7 +33,7 @@ Moreover, the IAVL project lacks support and a maintainer and we already see bet ## Decision -We propose separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedger SMT](https://github.com/lazyledger/smt). LazyLedger SMT is based on Diem (called jellyfish) design [*] - it uses a compute-optimised SMT by replacing subtrees with only default values with a single node (same approach is used by Ethereum2 as well). +We propose to separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedgers' SMT](https://github.com/lazyledger/smt). LazyLedger SMT is based on Diem (called jellyfish) design [*] - it uses a compute-optimised SMT by replacing subtrees with only default values with a single node (same approach is used by Ethereum2 as well). The storage model presented here doesn't deal with data structure nor serialization. It's a Key-Value database, where both key and value are binaries. The storage user is responsible for data serialization. @@ -44,7 +44,7 @@ Separation of storage and commitment (by the SMT) will allow the optimization of SMT will use it's own storage (could use the same database underneath) from the state machine store. For every `(key, value)` pair, the SMT will store `hash(key)` in a path (needed to evenly distribute keys in the tree) and `hash(key, value)` in a leaf (to bind the (key, value) pair stored in the `SS`). Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf. -For data access we propose 2 additional KV buckets: +For data access we propose 2 additional KV buckets (namespaces for the key-value pairs, sometimes called [column family](https://github.com/facebook/rocksdb/wiki/Terminology)): 1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface: provides direct access by key and allows prefix iteration (KV DB backend must support it). 2. B2: `hash(key, value) → key`: an index needed to extract a value (through: SMT → B2 → B1) having only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. 3. we could use more buckets to optimize the app usage if needed. @@ -73,6 +73,7 @@ A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intrac ### Snapshots for storage sync and versioning + One of the Stargate core features are snapshots and state sync delivered in the `/snapshot` package. This feature is implemented in SDK and requires storage support. Currently IAVL is the only supported backend. Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big), usually a snapshot mechanism is based on a _copy on write_ and it allows to efficiently deliver DB state at a certain stage. @@ -82,21 +83,24 @@ New snapshot will be created in every `EndBlocker`. The `rootmulti.Store` keeps NOTE: `Commit` must be called exactly once per block. Otherwise we risk going out of sync for the version number and block height. NOTE: For the SDK storage, we may consider splitting that interface into `Committer` and `PrunningCommiter` - only the multiroot should implement `PrunningCommiter` (cache and prefix store don't need pruning). -Number of historical versions (snapshots) for `abci.Query` and fast sync is part of a node configuration, not a chain configuration. A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all snapshots. +Number of historical versions (snapshots) for `abci.Query` and fast sync is part of a node configuration, not a chain configuration (configuration implied by the blockchain consensus). A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all snapshots. -Pruning old snapshots is effectively done by the database. Whenever we update a record in `SC`, SMT will create a new one without removing the old one. Since we are snapshoting each block, we update the mechanism and immediately remove an orphaned from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. +Pruning old snapshots is effectively done by a database. Whenever we update a record in `SC`, SMT will create a new one without removing the old one. Since we are snapshoting each block, we update the mechanism and immediately remove an orphaned from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. To manage the active snapshots we will either us a DB _max number of snapshots_ option (if available), or will remove snapshots in the `EndBlocker`. The latter option can be done efficiently by identifying snapshots with block height. #### Accessing old state versions One of the functional requirements is to access old state. This is done through `abci.Query` structure. The version is specified by a block height (so we query for an object by a key `K` at block height `H`). The number of old versions supported for `abci.Query` is configurable. Accessing an old state is done by using available snapshots. -`abci.Query` doesn't need old state of `SC`. So, for efficiency, we should keep `SC` and `SS` in different databases (however using the same DB engine). We will only create snapshots for +`abci.Query` doesn't need old state of `SC`. So, for efficiency, we should keep `SC` and `SS` in different databases (however using the same DB engine). Moreover, SDK could provide a way to directly access the state. However, a state machines shouldn't do that - since the number of snapshots is configurable, it would lead to nondeterministic execution. -We positively validated a snapshot mechanism for querying old state with regards to the database we evaluated. +We positively [validated](https://github.com/cosmos/cosmos-sdk/discussions/8297) a snapshot mechanism for querying old state with regards to the database we evaluated. + +### State Proofs +For any object stored in State Store (SS), we have corresponding object in `SC`. A proof for object `V` identified by a key `K` is a branch of `SC`, where the path corresponds to the key `hash(K)`, and the leaf is `hash(K, V)`. ### Rollbacks From 80d01220985ebf94ca3584db6c800e55092e271f Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Wed, 28 Apr 2021 23:57:58 +0200 Subject: [PATCH 17/21] adding paragraph about commiting to an object without storying it --- .../adr-040-storage-and-smt-state-commitments.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 5ebcd20eac47..98f08e0d508d 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -107,6 +107,12 @@ For any object stored in State Store (SS), we have corresponding object in `SC`. We need to be able to process transactions and roll-back state updates if a transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Once we finish the block processing, in the `Endblocker`, we commit a root store - at that time, all changes are written to the SMT and to the `SS` and a snapshot is created. +### Committing to an object without saving it + +We identified use-cases, where modules will need to save an object commitment without storing an object itself. Sometimes clients are receiving complex objects, and they have no way to prove a correctness of that object without knowing the storage layout. For those use cases it would be easier to commit to the object without storing it directly. + + + ## Consequences @@ -149,6 +155,10 @@ We verified existing databases KV databases for evaluating snapshot support. The Use of RDBMS instead of simple KV store for state. Use of RDBMS will require an SDK API breaking change (`KVStore` interface), will allow better data extraction and indexing solutions. Instead of saving an object as a single blob of bytes, we could save it as record in a table in the state storage layer, and as a `hash(key, protobuf(object))` in the SMT as outlined above. To verify that an object registered in RDBMS is same as the one committed to SMT, one will need to load it from RDBMS, marshal using protobuf, hash and do SMT search. +### Off Chain Store + +We were discussing use case where modules can use a support database, which is not automatically committed. Module will responsible for having a sound storage model and can optionally use the feature discussed in __Committing to an object without saving it_ section. + ## References From 962a28bc9a779aa1487e54c8cdcd12f27a8d9915 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Fri, 30 Apr 2021 13:18:22 +0200 Subject: [PATCH 18/21] review updates --- ...r-040-storage-and-smt-state-commitments.md | 22 ++++++++++--------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 98f08e0d508d..66a2ef13f3a3 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -11,21 +11,21 @@ DRAFT Not Implemented ## Abstract -Sparse Merke Tree (SMT) is a version of a Merkle Tree with various storage and performance optimizations. This ADR defines a separation of state commitments from data storage and the SDK transition from IAVL to SMT. +Sparse Merke Tree ([SMT](https://osf.io/8mcnh/)) is a version of a Merkle Tree with various storage and performance optimizations. This ADR defines a separation of state commitments from data storage and the SDK transition from IAVL to SMT. ## Context -Currently, Cosmos SDK uses IAVL for both state commitments and data storage. +Currently, Cosmos SDK uses IAVL for both state [commitments](https://cryptography.fandom.com/wiki/Commitment_scheme) and data storage. -IAVL has effectively become an orphaned project within the Cosmos ecosystem and it's proven to be an inefficient state commitment. +IAVL has effectively become an orphaned project within the Cosmos ecosystem and it's proven to be an inefficient state commitment data structure. In the current design, IAVL is used for both data storage and as a Merkle Tree for state commitments. IAVL is meant to be a standalone Merkelized key/value database, however it's using a KV DB engine to store all tree nodes. So, each node is stored in a separate record in the KV DB. This causes many inefficiencies and problems: -+ Each object select requires a tree traversal from the root -+ Each edge traversal requires a DB query (nodes are not stored in a memory) ++ Each object query requires a tree traversal from the root. Subsequent queries for the same object are cached on the SDK level. ++ Each edge traversal requires a DB query. + Creating snapshots is [expensive](https://github.com/cosmos/cosmos-sdk/issues/7215#issuecomment-684804950). It takes about 30 seconds to export less than 100 MB of state (as of March 2020). + Updates in IAVL may trigger tree reorganization and possible O(log(n)) hashes re-computation, which can become a CPU bottleneck. -+ The leaf structure is pretty expensive: it contains the `(key, value)` pair, additional metadata such as height, version. The entire node is hashed, and that hash is used as the key in the underlying database, [ref](https://github.com/cosmos/iavl/blob/master/docs/node/node.md ++ The node structure is pretty expensive - it contains a standard tree node elements (key, value, left and right element) and additional metadata such as height, version (which is not required by the SDK). The entire node is hashed, and that hash is used as the key in the underlying database, [ref](https://github.com/cosmos/iavl/blob/master/docs/node/node.md ). Moreover, the IAVL project lacks support and a maintainer and we already see better and well-established alternatives. Instead of optimizing the IAVL, we are looking into other solutions for both storage and state commitments. @@ -33,7 +33,7 @@ Moreover, the IAVL project lacks support and a maintainer and we already see bet ## Decision -We propose to separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedgers' SMT](https://github.com/lazyledger/smt). LazyLedger SMT is based on Diem (called jellyfish) design [*] - it uses a compute-optimised SMT by replacing subtrees with only default values with a single node (same approach is used by Ethereum2 as well). +We propose to separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedgers' SMT](https://github.com/lazyledger/smt). LazyLedger SMT is based on Diem (called jellyfish) design [*] - it uses a compute-optimised SMT by replacing subtrees with only default values with a single node (same approach is used by Ethereum2) and implements compact proofs. The storage model presented here doesn't deal with data structure nor serialization. It's a Key-Value database, where both key and value are binaries. The storage user is responsible for data serialization. @@ -42,7 +42,9 @@ The storage model presented here doesn't deal with data structure nor serializat Separation of storage and commitment (by the SMT) will allow the optimization of different components according to their usage and access patterns. -SMT will use it's own storage (could use the same database underneath) from the state machine store. For every `(key, value)` pair, the SMT will store `hash(key)` in a path (needed to evenly distribute keys in the tree) and `hash(key, value)` in a leaf (to bind the (key, value) pair stored in the `SS`). Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf. +`SS` (SMT) is used to commit to a data and compute merkle proofs. `SC` is used to directly access data. To avoid collisions, both `SS` and `SC` will use a separate storage namespace (they could use the same database underneath). `SC` will store each `(key, value)` pair directly (map key -> value). + +SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(key, value)` in a leaf. Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf. For data access we propose 2 additional KV buckets (namespaces for the key-value pairs, sometimes called [column family](https://github.com/facebook/rocksdb/wiki/Terminology)): 1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface: provides direct access by key and allows prefix iteration (KV DB backend must support it). @@ -85,7 +87,7 @@ NOTE: For the SDK storage, we may consider splitting that interface into `Commit Number of historical versions (snapshots) for `abci.Query` and fast sync is part of a node configuration, not a chain configuration (configuration implied by the blockchain consensus). A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all snapshots. -Pruning old snapshots is effectively done by a database. Whenever we update a record in `SC`, SMT will create a new one without removing the old one. Since we are snapshoting each block, we update the mechanism and immediately remove an orphaned from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. +Pruning old snapshots is effectively done by a database. Whenever we update a record in `SC`, SMT won't update nodes - instead it create new nodes on the update path, without removing the old one. Since we are snapshoting each block, we need to update that mechanism to immediately remove orphaned nodes from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. To manage the active snapshots we will either us a DB _max number of snapshots_ option (if available), or will remove snapshots in the `EndBlocker`. The latter option can be done efficiently by identifying snapshots with block height. @@ -120,7 +122,7 @@ We identified use-cases, where modules will need to save an object commitment wi This ADR doesn't introduce any SDK level API changes. -We change the storage layout of the state machine, a storage migration and network upgrade is required to incorporate these changes. +We change the storage layout of the state machine, a storage hard fork and network upgrade is required to incorporate these changes. SMT provides a merkle proof functionality, however it is not compatible with ICS23. Updating the proofs for ICS23 compatibility is required. ### Positive From bb897980ba68d4eb03b2fad89f6464a5fab3f3c0 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Wed, 5 May 2021 18:15:21 +0200 Subject: [PATCH 19/21] Apply suggestions from code review Co-authored-by: Roy Crihfield <30845198+roysc@users.noreply.github.com> --- .../adr-040-storage-and-smt-state-commitments.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index 66a2ef13f3a3..d4358181e98a 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -81,11 +81,11 @@ One of the Stargate core features are snapshots and state sync delivered in the Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big), usually a snapshot mechanism is based on a _copy on write_ and it allows to efficiently deliver DB state at a certain stage. Some DB engines support snapshotting. Hence, we propose to reuse that functionality for the state sync and versioning (described below). It will the supported DB engines to ones which efficiently implement snapshots. In a final section we will discuss evaluated DBs. -New snapshot will be created in every `EndBlocker`. The `rootmulti.Store` keeps track of the version number and implements the `MultiStore` interface. `MultiStore` encapsulates a `Commiter` interface, which has the `Commit`, `SetPruning`, `GetPruning` functions which will be used for creating and removing snapshots. The `Store.Commit` function increments the version on each call, and checks if it needs to remove old versions. We will need to update the SMT interface to implement the `Commiter` interface. +A new state sync snapshot will be created in every `EndBlocker`. The `rootmulti.Store` keeps track of the version number and implements the `CommitMultiStore` interface. `CommitMultiStore` encapsulates a `Committer` interface, which has the `Commit`, `SetPruning`, `GetPruning` functions which will be used for creating and removing snapshots. The `Store.Commit` function increments the version on each call, and checks if it needs to remove old versions. We will need to update the SMT interface to implement the `Committer` interface. NOTE: `Commit` must be called exactly once per block. Otherwise we risk going out of sync for the version number and block height. -NOTE: For the SDK storage, we may consider splitting that interface into `Committer` and `PrunningCommiter` - only the multiroot should implement `PrunningCommiter` (cache and prefix store don't need pruning). +NOTE: For the SDK storage, we may consider splitting that interface into `Committer` and `PruningCommitter` - only the multiroot should implement `PruningCommitter` (cache and prefix store don't need pruning). -Number of historical versions (snapshots) for `abci.Query` and fast sync is part of a node configuration, not a chain configuration (configuration implied by the blockchain consensus). A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all snapshots. +Number of historical versions for `abci.Query` and state sync snapshots is part of a node configuration, not a chain configuration (configuration implied by the blockchain consensus). A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all past versions. Pruning old snapshots is effectively done by a database. Whenever we update a record in `SC`, SMT won't update nodes - instead it create new nodes on the update path, without removing the old one. Since we are snapshoting each block, we need to update that mechanism to immediately remove orphaned nodes from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. @@ -96,9 +96,9 @@ To manage the active snapshots we will either us a DB _max number of snapshots_ One of the functional requirements is to access old state. This is done through `abci.Query` structure. The version is specified by a block height (so we query for an object by a key `K` at block height `H`). The number of old versions supported for `abci.Query` is configurable. Accessing an old state is done by using available snapshots. `abci.Query` doesn't need old state of `SC`. So, for efficiency, we should keep `SC` and `SS` in different databases (however using the same DB engine). -Moreover, SDK could provide a way to directly access the state. However, a state machines shouldn't do that - since the number of snapshots is configurable, it would lead to nondeterministic execution. +Moreover, SDK could provide a way to directly access the state. However, a state machine shouldn't do that - since the number of snapshots is configurable, it would lead to nondeterministic execution. -We positively [validated](https://github.com/cosmos/cosmos-sdk/discussions/8297) a snapshot mechanism for querying old state with regards to the database we evaluated. +We positively [validated](https://github.com/cosmos/cosmos-sdk/discussions/8297) a versioning mechanism for querying old state with regards to the database we evaluated. ### State Proofs From 19d2126a638c25eecd9a170cbb8fd251fa9c0b9a Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Wed, 5 May 2021 22:47:13 +0200 Subject: [PATCH 20/21] review udpates --- ...r-040-storage-and-smt-state-commitments.md | 22 ++++++++++--------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index d4358181e98a..a943d0763255 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -48,7 +48,7 @@ SMT is a merkle tree structure: we don't store keys directly. For every `(key, v For data access we propose 2 additional KV buckets (namespaces for the key-value pairs, sometimes called [column family](https://github.com/facebook/rocksdb/wiki/Terminology)): 1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface: provides direct access by key and allows prefix iteration (KV DB backend must support it). -2. B2: `hash(key, value) → key`: an index needed to extract a value (through: SMT → B2 → B1) having only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs. +2. B2: `hash(key, value) → key`: a reverse index to get a key from an SMT path. Recall that SMT will store `(k, v)` as `(hash(k), hash(key, value))`. So, we can get an object value by composing `SMT_path → B2 → B1`. 3. we could use more buckets to optimize the app usage if needed. Above, we propose to use a KV DB. However, for the state machine, we could use an RDBMS, which we discuss below. @@ -60,37 +60,38 @@ State Storage requirements: + range queries + quick (key, value) access + creating a snapshot -+ prunning (garbage collection) ++ historical versioning ++ pruning (garbage collection) State Commitment requirements: + fast updates + tree path should be short -+ creating a snapshot + pruning (garbage collection) ### LazyLedger SMT for State Commitment -A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intractable size. The assumption here is that as the size of the tree is intractable, there would only be a few leaf nodes with valid data blocks relative to the tree size, rendering the tree as sparse. +A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intractable size. The assumption here is that as the size of the tree is intractable, there would only be a few leaf nodes with valid data blocks relative to the tree size, rendering a sparse tree. ### Snapshots for storage sync and versioning -One of the Stargate core features are snapshots and state sync delivered in the `/snapshot` package. This feature is implemented in SDK and requires storage support. Currently IAVL is the only supported backend. - Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big), usually a snapshot mechanism is based on a _copy on write_ and it allows to efficiently deliver DB state at a certain stage. Some DB engines support snapshotting. Hence, we propose to reuse that functionality for the state sync and versioning (described below). It will the supported DB engines to ones which efficiently implement snapshots. In a final section we will discuss evaluated DBs. -A new state sync snapshot will be created in every `EndBlocker`. The `rootmulti.Store` keeps track of the version number and implements the `CommitMultiStore` interface. `CommitMultiStore` encapsulates a `Committer` interface, which has the `Commit`, `SetPruning`, `GetPruning` functions which will be used for creating and removing snapshots. The `Store.Commit` function increments the version on each call, and checks if it needs to remove old versions. We will need to update the SMT interface to implement the `Committer` interface. +One of the Stargate core features is a _snapshot sync_ delivered in the `/snapshot` package. It provides a way to trustlessly sync a blockchain without repeating all transactions from the genesis. This feature is implemented in SDK and requires storage support. Currently IAVL is the only supported backend. It works by streaming to a client a snapshot of a `SS` at a certain version together with a header chain. + +A new `SS` snapshot will be created in every `EndBlocker` and identified by a block height. The `rootmulti.Store` keeps track of the available snapshots to offer `SS` at a certain version. The `rootmulti.Store` implements the `CommitMultiStore` interface, which encapsulates a `Committer` interface. `Committer` has a `Commit`, `SetPruning`, `GetPruning` functions which will be used for creating and removing snapshots. The `rootStore.Commit` function creates a new snapshot and increments the version on each call, and checks if it needs to remove old versions. We will need to update the SMT interface to implement the `Committer` interface. NOTE: `Commit` must be called exactly once per block. Otherwise we risk going out of sync for the version number and block height. NOTE: For the SDK storage, we may consider splitting that interface into `Committer` and `PruningCommitter` - only the multiroot should implement `PruningCommitter` (cache and prefix store don't need pruning). Number of historical versions for `abci.Query` and state sync snapshots is part of a node configuration, not a chain configuration (configuration implied by the blockchain consensus). A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all past versions. -Pruning old snapshots is effectively done by a database. Whenever we update a record in `SC`, SMT won't update nodes - instead it create new nodes on the update path, without removing the old one. Since we are snapshoting each block, we need to update that mechanism to immediately remove orphaned nodes from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. +Pruning old snapshots is effectively done by a database. Whenever we update a record in `SC`, SMT won't update nodes - instead it creates new nodes on the update path, without removing the old one. Since we are snapshoting each block, we need to update that mechanism to immediately remove orphaned nodes from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions. To manage the active snapshots we will either us a DB _max number of snapshots_ option (if available), or will remove snapshots in the `EndBlocker`. The latter option can be done efficiently by identifying snapshots with block height. + #### Accessing old state versions One of the functional requirements is to access old state. This is done through `abci.Query` structure. The version is specified by a block height (so we query for an object by a key `K` at block height `H`). The number of old versions supported for `abci.Query` is configurable. Accessing an old state is done by using available snapshots. @@ -98,12 +99,14 @@ One of the functional requirements is to access old state. This is done through Moreover, SDK could provide a way to directly access the state. However, a state machine shouldn't do that - since the number of snapshots is configurable, it would lead to nondeterministic execution. -We positively [validated](https://github.com/cosmos/cosmos-sdk/discussions/8297) a versioning mechanism for querying old state with regards to the database we evaluated. +We positively [validated](https://github.com/cosmos/cosmos-sdk/discussions/8297) a versioning and snapshot mechanism for querying old state with regards to the database we evaluated. + ### State Proofs For any object stored in State Store (SS), we have corresponding object in `SC`. A proof for object `V` identified by a key `K` is a branch of `SC`, where the path corresponds to the key `hash(K)`, and the leaf is `hash(K, V)`. + ### Rollbacks We need to be able to process transactions and roll-back state updates if a transaction fails. This can be done in the following way: during transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Once we finish the block processing, in the `Endblocker`, we commit a root store - at that time, all changes are written to the SMT and to the `SS` and a snapshot is created. @@ -114,7 +117,6 @@ We need to be able to process transactions and roll-back state updates if a tran We identified use-cases, where modules will need to save an object commitment without storing an object itself. Sometimes clients are receiving complex objects, and they have no way to prove a correctness of that object without knowing the storage layout. For those use cases it would be easier to commit to the object without storing it directly. - ## Consequences From 356f9876d8ac4aae3d8e7c67ab84bfbcbd814195 Mon Sep 17 00:00:00 2001 From: Robert Zaremba Date: Fri, 7 May 2021 13:56:48 +0200 Subject: [PATCH 21/21] adding clarification --- .../architecture/adr-040-storage-and-smt-state-commitments.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/architecture/adr-040-storage-and-smt-state-commitments.md b/docs/architecture/adr-040-storage-and-smt-state-commitments.md index a943d0763255..0f009e0b4cf0 100644 --- a/docs/architecture/adr-040-storage-and-smt-state-commitments.md +++ b/docs/architecture/adr-040-storage-and-smt-state-commitments.md @@ -74,7 +74,9 @@ State Commitment requirements: A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intractable size. The assumption here is that as the size of the tree is intractable, there would only be a few leaf nodes with valid data blocks relative to the tree size, rendering a sparse tree. -### Snapshots for storage sync and versioning +### Snapshots for storage sync and state versioning + +Below, with simple _snapshot_ we refer to a database snapshot mechanism, not to a _ABCI snapshot sync_. The latter will be referred as _snapshot sync_ (which will directly use DB snapshot as described below). Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big), usually a snapshot mechanism is based on a _copy on write_ and it allows to efficiently deliver DB state at a certain stage. Some DB engines support snapshotting. Hence, we propose to reuse that functionality for the state sync and versioning (described below). It will the supported DB engines to ones which efficiently implement snapshots. In a final section we will discuss evaluated DBs.