-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dumping to and restoring from dumpfiles #411
Comments
Oh very interesting about the UTXO spend rate. That feels like there should be a bunch of optimization opportunities.
I've been planning to implement compaction (#20) which would help with some of the overhead, but for a benchmark I think the fragmentation and overhead is a feature not a problem. You want it to be as representative as possible. Same for the non-main tables. If you want to prune those though, you could
Table schemas are only validated when you call Ya, I considered adding a dump/export method before (#100) but decided it wasn't a good idea. Your use case is a bit different than what I'd considered then, and I'll give it some more thought. But I think an export format has too large of a design space and is best left to users. For example, some people would probably want it to export in plaintext, and for a backup format you might want a differential based format that saved only the modified key-values. |
We just landed a very nice optimization by @virtuallythere, which exploits this by using a
This makes me think that maybe the thing to do is make it so that the redb format itself can serve as dump format that can let users quickly bootstrap an index from a recent database state. Some thoughts:
|
Nice, 2x is great!
About 25% wasted, so 1.3x larger than the raw btrees
Free pages are already zeroed, although I might make this a feature flag since I think it has a small performance penalty
It should already be deterministic, as long as you execute exactly the same operations in the same order (including any crashes and db repairs) |
25% doesn't seem bad at all.
I might consider defaulting this to on. If the performance penalty is small, then I'd rather avoid having to think about the implications about the database potentially containing old data.
Nice. In that case, I think for our purposes we can definitely just use the database itself as the dump format. The process will be:
Do we need to the compact step if all we've done is insert data into a new database using a single transaction? |
Ya, I've left it on for exactly that reason. Nope, the only reason to compact is if you've done a lot of deletions. If you're just inserting data it shouldn't cause fragmentation |
We are doing a ton of deletions during the indexing process, so doing a compaction would likely be worth it. |
Are they mixed with insertions? They will only cause fragmentation if you have transactions at the end of your indexing process that on net delete a lot of entries. |
We're working on speeding up
ord
indexing, and one issue we've run into is that it's very hard to build a synthetic benchmark which is representative oford
's main workload.An example, which is also quite interesting on its own, is that UTXOs are often destroyed very soon after they're created. @virtuallythere has been working on speeding up
ord
's initial sync time, and he made this table, which shows the block range in which UTXOs are created and destroyed:So 20% of UTXOs are spent in the same block in which they're created, 30% are spent within two blocks of being created, and 60% are spent within 18 blocks of being created.
So any synthetic benchmark we wrote would have to reproduce these kinds of patterns, which would be quite hard.
The alternative to a synthetic benchmark would be to operate on real data, but then we have the challenge that we don't get to a fully representative workload until quite far into the sync process, due to variation in the patterns of on-chain transactions during different parts of Bitcoin's history, in particular, the long run of mostly empty blocks after its creation.
One apporach is to sync the index up until a representative height, and then benchmark blocks from there. A few issues with that approach though:
OUTPOINT_TO_ORDINAL_RANGES
andHEIGHT_TO_BLOCK_HASH
), which increase the size of the databaseOUTPOINT_TO_ORDINAL_RANGES
, which means we might need to create new snapshots for compatibility, even when the table we care about hasn't changedAs a separate concern, we'd eventually like to be able to distribute index snapshots that users can download, so they can start syncing their index from a recent, known-good state, and not have to sync from the beginning of time. Using
redb
database files for this purpose, in addition to the above issues, would introduce some other issues:index.redb
files don't have a deterministic hash, so multiple developers wouldn't be able to independently build and sign snapshotsindex.redb
files are much more complicated internally, and interface directly with much of the code ofredb
so it seems possible that a maliciously constructedindex.redb
might be an attack vectorTo deal with all of the above, we came up with the idea of writing code to dump the main index state to a dump file, and then being able to load it from that file. In comparison to the actual
index.redb
, the dumpfile would:redb
, database, or on the schema of extraneous tables, to minimize the need to rebuild snapshots whenredb
changes, or when we add or remove extraneous tables, or change their schemaThis might be useful for
redb
to provide, since it seems like it would be generally useful. In addition to our purposes, it could be used for:redb
aside from dump and load from dump. (For in-process migration, you could dump and load from memory, or dump into anio::Write
, with the new database loading from anio::Read
, so there's no tempfile.)A sketch of a possible design for such a dump file:
And the API, with a few variations, being something like:
Version which dump all tables would also be useful, although for our purposes excluding tables is useful.
We're actively working on benchmarking, so we'll probably start working on shitty, special purpose dump functionality that only dumps the tables that we care about, but having it be directly supported by
redb
would be awesome, and probably useful to other people as well.The text was updated successfully, but these errors were encountered: