Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMR Storage Optimization #2873

Open
antiochp opened this issue Jun 5, 2019 · 3 comments
Open

MMR Storage Optimization #2873

antiochp opened this issue Jun 5, 2019 · 3 comments

Comments

@antiochp
Copy link
Member

antiochp commented Jun 5, 2019

There is a relatively minor change that we can make to our internal MMR implementation that would reduce the storage requirements significantly.

Currently we maintain a data file and a hash file.
The data file stores all the leaf data - the actual entries in the MMR, for example kernels in the kernel MMR.
The hash file stores the trees/mountains of hashes up to the "peaks". Each leaf hash and each parent hash.

In the trivial case of 2 leaves and a single parent we would store elements 1 and 2 in the data file and the hashes for positions 1, 2 and 3 in the hash file.

  3
 / \
1   2

We could skip the leaves in the hash file, trading off reduced storage requirements for the added cost of needing to regenerate the hash of any leaf position.
It would not be unreasonable to need to do this for single leaves though.

The interesting thing is given an MMR of height n, removing the leaves produces an MMR of height n-1.

So in the example above, the MMR has height 2. We would not store the hashes for the leaves at positions 1 and 2 and just store the hash for the parent (also a peak) at position 3.
But if we represented this as an MMR of height 1, it would translate to position 1.

1

We would need to go back to the data file and hash an element if we required the hash for either leaf at position 1 or 2. But this is just a single additional hash operation.

We gain significant space savings by doing this (at least 50% of positions are leaf positions).

Today the kernel MMR hash file contains approx 1,200,000 hashes.
The hash file is approx 37MB in size.
Not storing leaf hashes we would save approx 19MB of local disk storage.


We can also take this further if we consider what we do with our MMRs.

There are basically 3 things we need to be able to do -

  1. Generate a new root hash when appending to an MMR.
  2. Support "rewind" by truncating right-most elements from the MMR.
  3. Provide a peer with MMR data during fast sync.

For the kernel MMR we do not need to provide the hash file to a peer (see #2743). The peer can rebuild the full hash file given only the underlying data file. This is because we do not prune the kernel MMR. For the output and rangeproof MMR we prune but we know which positions have been pruned, so we can work around this. Let us ignore (3) for now.

We can easily support (1) with only the hashes of the peaks.
Appending a single leaf will either add a new peak or update the rightmost peak(s).
No other hashes are necessary to calculate the new MMR root hash.

To support (2) we need some subset rightmost hashes, beneath the rightmost peaks in the MMR. Removing rightmost leaf positions from the MMR "rewinds" and results in previous positions becoming peaks of the MMR.
We could for example, use our pruning "horizon" and only store non-peak hashes after the horizon. This would allow us to support a rewind back to the horizon (7 days).
Everything earlier than this could be rolled up into the relevant peaks, with no additional hashes necessary.

For the kernel MMR (non-prunable) we could store the following -

  • kernels in leaf positions
  • peak hashes prior to horizon
  • rightmost subset of non-leaf hashes after the horizon

We need the set of peaks to determine the root hash.

  • Peaks prior to the horizon can be read directly.
  • Peaks after the horizon can be read directly if they are non-leaf positions.
  • If the rightmost peak is a leaf we can hash the kernel data at that position.

Hashing the root would be slightly more complex and slightly more expensive but we would save significant local storage as we would need to store a relatively small subset of the total MMR hashes.
We would save close to the full 37MB of the existing kernel MMR hash file.

The output and rangeproof MMRs are more complex as we can prune/compact data from the MMR (both data and hash files).
We maintain a list of pruned positions. We could maintain hashes for these pruned positions. A similar approach to that described for the kernel MMR would work if we accounted for these pruned subtrees.

@garyyu
Copy link
Contributor

garyyu commented Jun 6, 2019

👍 for the storage further saving idea.
but I think it's a premature optimization and not the current priority :-) do you think so? since these MMR hash files are not too big in the first several years.

I more like the faster calculation comparing to this storage saving at this stage.

@hunter1342

This comment has been minimized.

@hunter1342

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants