Performance cliff with large database #344

casey · 2022-09-03T02:30:56Z

We found this a little while ago, see https://github.com/casey/ord/issues/371, but I was hoping to have more information for you before making an issue here. However, I really haven't been able to figure out the issue, so here I am T_T

The mainnet ord database is around 50GiB, and syncs reasonably quickly. It processes one block at a time, without any parallelism, makes one commit per block, and takes around 36 hours to sync when connected to a fully synced bitcoin core node.

The main table is OUTPOINT_TO_ORDINAL_RANGES, which maps outputs, which are a TXID and a u32 output index, to ordinal ranges, the encoding of which is a little complicated, but are conceptually just (u64, u64) which give the start and size of each ordinal range in the output.

Recently, we added a new table, OUTPOINT_TO_TXID, which required a rebuild of the database. However, when the rebuild was underway, we noticed that it was extremely slow, and would take at least 9 days to sync, and the database was around 80 GiB, and wasn't yet at half-way synced. Each block was taking multiple seconds, as opposed to just a few milliseconds. We stopped the sync, disabled writes to the new table, and re-deployed, and the full sync was pretty quick again.

One thing that struck me was that the server has 64 GiB of RAM, so maybe the performance cliff happens when the database is larger than available RAM. So, we tried building the database on a machine with 1 GiB of RAM, but didn't notice a similar cliff. So it might be that the degradation only happens when the database is large.

Do you have any suggestions for things we might do to figure out to give you more information, and find out if this is redb issue, or if we're just hitting performance limitations of the box we're on?

The text was updated successfully, but these errors were encountered:

cberner · 2022-09-03T17:20:38Z

Hmm, interesting. Ya, a few ideas:

are you running it on Linux? If so, it'd be useful to collect a flamegraph of the perf of a couple insertions that are slow. I use the flamegraph crate (not cargo-flamegraph)
try printing the database stats to see if anything looks terribly wrong. Like if the height is more than log2(number of keys) then something is very wrong. It should be around log base x, where x is around 2000/size_of()
try using atop to see if disk bandwidth or CPU time is saturated. If disk bandwidth is low and CPU time is at 100%, there's probably a bug

casey · 2022-09-06T18:55:16Z

Here's the output of ord info on a fully synced mainnet database. The two main tables are block height -> hash and outpoint -> ordinal ranges, and the number of keys in both looks right, so blocks indexed is the number of keys in the first table and outputs indexed is the number of keys in the second table.

blocks indexed: 752920
outputs indexed: 135512466
tree height: 7
free pages: 255940053
stored: 25.6 GiB
overhead: 1.38 GiB
fragmented: 20.63 GiB
index size: 49.25 GiB

Is there anything in this that jumps out at you as suspicious?

This is from an index that hasn't hit the performance cliff yet, so if there's a bug, for example in balancing, it might not be manifesting it yet.

cberner · 2022-09-07T00:24:31Z

Nope, that all looks quite reasonable. Fragmented space being a little less than 50% is expected, since full pages split into two half full pages.

It's interesting that your sync takes so long. My benchmark writes 1M keys in 1.4secs. So 24+ hours for 135M keys seems kinda slow. How do I run this sync myself?

casey · 2022-09-07T08:11:11Z

The 135M keys are just the outputs that remain as of the current chain tip. During sync though, there will be a bunch of insertions and deletions, probably orders of magnitude more than 135M.

It's pretty easy to run! You'll need bitcoin core: https://bitcoincore.org/en/download/

Start and sync a bitcoin node by running bitcoind. Then as the same user, run ord server. ord should automatically find and auth with bitcoin. bitcoind will take about 1-2 days to sync, and ord can run and will sync at the same time.

casey · 2022-09-07T08:12:17Z

If it's reasonably fast, try uncommenting out the cfg in index.rs:

    #[cfg(any())]
    for input in &tx.input {
      outpoint_to_txid.insert(&serialize(&input.previous_output), &txid)?;
    }

On our production server, this makes everything much slower.

cberner · 2022-09-07T14:56:26Z

When I run ord I get this error:

cberner@vermeer:(master)~/Code/ord$ cargo run --release -- server
    Finished release [optimized] target(s) in 0.10s
     Running `target/release/ord server`
error: Permission denied (os error 13)

Do I need to do something to grant it permission to talk to bitcoind, or use something else?

cberner · 2022-09-07T15:14:29Z

The 135M keys are just the outputs that remain as of the current chain tip. During sync though, there will be a bunch of insertions and deletions, probably orders of magnitude more than 135M.

Ah right, ok that makes sense. When bitcoind is sync'ing does the tx=10180481 part of each line that it prints indicate the number of transactions that have occurred up to that height?

casey · 2022-09-07T18:41:44Z

When bitcoind is sync'ing does the tx=10180481 part of each line that it prints indicate the number of transactions that have occurred up to that height?

Yup, I think so. My synced node just printed tx=762688384, which comports with this graph.

veryordinally · 2022-09-07T18:48:25Z

are you running it on Linux? If so, it'd be useful to collect a flamegraph of the perf of a couple insertions that are slow. I use the flamegraph crate (not cargo-flamegraph)

Here's the
from indexing a few hundred blocks from around block 150000 on the bitcoin main chain. Does this look reasonable to you in terms of redb behavior? I couldn't see anything that stood out - seems to spend about 75% of index_ranges time on database operations, thereof 1/3 on insertions, 1/3 on deletes, 1/3 on commit, and the rest on getting blocks from bitcoin core - which seems reasonable.

cberner · 2022-09-08T02:46:13Z

Ya, that looks quite reasonable. Is that before or after the big slowdown?

veryordinally · 2022-09-08T13:17:27Z

This is with the "big slowdown" changes backed out, so before the big slowdown.

cberner · 2022-09-08T14:50:55Z

If it's reasonably fast, try uncommenting out the cfg in index.rs:
    #[cfg(any())]
    for input in &tx.input {
      outpoint_to_txid.insert(&serialize(&input.previous_output), &txid)?;
    }
On our production server, this makes everything much slower.

I uncommented that line and also turned on info level logs. I ran the sync up to block height 392844, and I think I see the slowdown you're talking about. Each transaction has gone from taking 1-10ms to more like 1000ms. However, it looks expected, the number of ordinals being committed in each transaction has gone up to around a million!

[2022-09-08T14:35:39Z INFO  ord::index] Wrote 1146834 ordinal ranges in 1786ms

Could the problem be that around that block height people started using Bitcoin more, and so the ordinals are getting mixed around and so each transaction that gets spent includes a large number of ordinal ranges?

veryordinally · 2022-09-08T16:11:44Z

~~A million+ ranges per TX should be a rare exception.~~ We will collect some more stats.

casey · 2022-09-08T17:02:38Z

Could the problem be that around that block height people started using Bitcoin more, and so the ordinals are getting mixed around and so each transaction that gets spent includes a large number of ordinal ranges?

That's definitely true, but I don't think it's the source of the slowdown. When this for loop:

for input in &tx.input {
  outpoint_to_txid.insert(&serialize(&input.previous_output), &txid)?;
}

…is enabled it will perform one additional insert per spent output, which doesn't depend on the number of ordinal ranges moving per transaction. So uncommenting it shouldn't cause a slowdown that occurs when the number of ordinal ranges moving per block gets large. Also, without it, the ordinals.com instance syncs reasonably quickly all the way to the chain tip.

Here's a graph of the full mainnet sync, without the slowdown:

So it definitely slows down as blocks get full, but the average stays around 200ms. Here's a zoomed in view of the end:

It's hard to tell, but it looks like the average stays around 200ms/commit.

cberner · 2022-09-09T00:31:32Z

Ah yes, ok I see. I added a counter for the number of outpoints, and it's only a few thousand per block

cberner · 2022-09-10T19:27:08Z

Ok, so the good news is I don't think it's a bug in redb, but that's also the bad news =P

I hacked my benchmark to simulate your workload, it assumes each value is 12 bytes * 100 ordinal range = 1.2kb, and after an initial load of millions of pairs, it writes transactions of 5000 outpoints.

I tested against lmdb, with varying sizes for the initial load, and redb performs on par with lmdb (redb seems to scale better actually). However, both of them start to get slow around 8M entries, which is probably around the time that it no longer fits in memory (I have 32GB, but it looks like the OS is only using ~14GB for buff/cache according to free):

1M
redb: Bulk loaded 1000000 items in 4704ms
redb: Wrote 100 individual items in 886ms
redb: Wrote 100 x 5000 items in 13178ms
redb: Removed 500000 items in 2804ms
lmdb-rkv: Bulk loaded 1000000 items in 4892ms
lmdb-rkv: Wrote 100 individual items in 876ms
lmdb-rkv: Wrote 100 x 5000 items in 7687ms
lmdb-rkv: Removed 500000 items in 2618ms

2M
redb: Bulk loaded 2000000 items in 10707ms
redb: Wrote 100 individual items in 912ms
redb: Wrote 100 x 5000 items in 15037ms
redb: Removed 1000000 items in 6295ms
lmdb-rkv: Bulk loaded 2000000 items in 14506ms
lmdb-rkv: Wrote 100 individual items in 971ms
lmdb-rkv: Wrote 100 x 5000 items in 8659ms
lmdb-rkv: Removed 1000000 items in 6173ms

4M
redb: Bulk loaded 4000000 items in 25327ms
redb: Wrote 100 individual items in 898ms
redb: Wrote 100 x 5000 items in 14864ms
redb: Removed 2000000 items in 20850ms
lmdb-rkv: Bulk loaded 4000000 items in 36919ms
lmdb-rkv: Wrote 100 individual items in 869ms
lmdb-rkv: Wrote 100 x 5000 items in 8482ms
lmdb-rkv: Removed 2000000 items in 40537ms

8M
redb: Bulk loaded 8000000 items in 71191ms
redb: Wrote 100 individual items in 1077ms
redb: Wrote 100 x 5000 items in 34156ms
redb: Removed 4000000 items in 510961ms
lmdb-rkv: Bulk loaded 8000000 items in 152380ms
lmdb-rkv: Wrote 100 individual items in 933ms
lmdb-rkv: Wrote 100 x 5000 items in 35490ms
lmdb-rkv: Removed 4000000 items in 515297ms

16M
redb: Bulk loaded 16000000 items in 1029296ms
redb: Wrote 100 individual items in 962ms
redb: Wrote 100 x 5000 items in 117697ms
lmdb-rkv: Bulk loaded 16000000 items in 1267875ms
lmdb-rkv: Wrote 100 individual items in 1042ms
lmdb-rkv: Wrote 100 x 5000 items in 118925ms

There's a bunch of weird stuff here, like remove is much slower than insert, which doesn't really make sense to me. Will need to look into it further.

cberner · 2022-09-10T20:04:25Z

Ok, actually I think the insert performance does make sense. I tested throughput on my disk (Samsung 980 Pro) with fio and got about 30k IOPS: fio --filename ./junk.dat --eta-newline=5s --name write-test --size=1g --io_size=1g --blocksize=4k --ioengine=libaio --fsync=1000000000000 --iodepth=1 --direct=1 --numjobs=1 --runtime=20 --fallocate=none --rw=randwrite

We can estimate the insert speed by assuming every write goes directly to disk:

assuming the branch pages are half full, each key takes up 52 bytes (key, offset, child ptr, and child checksum [zeros are stored even for 2PC]), then the tree order will be around 40. The exact value isn't that important, since this is a b-tree it'll only change the height by +/- 1
using that we get a tree height of around 5
and now we calculate the total insert time: 16M * 5 / IOPS = 2,666secs. Whereas the measured time is 1,029secs -- presumably faster due to the page cache.
similarly, the per transaction time is expected to be around 833ms. Whereas measured time is around 3secs -- presumably slower due to overhead, such as zeroing out freed pages

casey · 2022-09-11T01:31:29Z

Thanks for looking into this!

Ahh, damn, I was hoping for an ez fix bug in redb 😭

I tried to simulate the database running out of memory by building an index that was > 1 GiB of memory on a machine with 1 GiB of memory, but didn't run into a cliff. However, that was a small database, so it seems likely that I might just not have seen enough of a performance difference.

casey mentioned this issue Sep 3, 2022

Index is slow ordinals/ord#371

Closed

8 tasks

casey closed this as completed Sep 11, 2022

This was referenced Sep 11, 2022

Deletions scale poorly #356

Closed

Support concurrent writes to disjoint tables in the same transaction #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance cliff with large database #344

Performance cliff with large database #344

casey commented Sep 3, 2022

cberner commented Sep 3, 2022

casey commented Sep 6, 2022

cberner commented Sep 7, 2022

casey commented Sep 7, 2022

casey commented Sep 7, 2022

cberner commented Sep 7, 2022

cberner commented Sep 7, 2022

casey commented Sep 7, 2022

veryordinally commented Sep 7, 2022 •

edited

Loading

cberner commented Sep 8, 2022

veryordinally commented Sep 8, 2022

cberner commented Sep 8, 2022

veryordinally commented Sep 8, 2022 •

edited

Loading

casey commented Sep 8, 2022

cberner commented Sep 9, 2022

cberner commented Sep 10, 2022

cberner commented Sep 10, 2022

casey commented Sep 11, 2022

Performance cliff with large database #344

Performance cliff with large database #344

Comments

casey commented Sep 3, 2022

cberner commented Sep 3, 2022

casey commented Sep 6, 2022

cberner commented Sep 7, 2022

casey commented Sep 7, 2022

casey commented Sep 7, 2022

cberner commented Sep 7, 2022

cberner commented Sep 7, 2022

casey commented Sep 7, 2022

veryordinally commented Sep 7, 2022 • edited Loading

cberner commented Sep 8, 2022

veryordinally commented Sep 8, 2022

cberner commented Sep 8, 2022

veryordinally commented Sep 8, 2022 • edited Loading

casey commented Sep 8, 2022

cberner commented Sep 9, 2022

cberner commented Sep 10, 2022

cberner commented Sep 10, 2022

casey commented Sep 11, 2022

veryordinally commented Sep 7, 2022 •

edited

Loading

veryordinally commented Sep 8, 2022 •

edited

Loading