Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance cliff with large database #344

Closed
casey opened this issue Sep 3, 2022 · 18 comments
Closed

Performance cliff with large database #344

casey opened this issue Sep 3, 2022 · 18 comments

Comments

@casey
Copy link
Contributor

casey commented Sep 3, 2022

We found this a little while ago, see https://github.com/casey/ord/issues/371, but I was hoping to have more information for you before making an issue here. However, I really haven't been able to figure out the issue, so here I am T_T

The mainnet ord database is around 50GiB, and syncs reasonably quickly. It processes one block at a time, without any parallelism, makes one commit per block, and takes around 36 hours to sync when connected to a fully synced bitcoin core node.

The main table is OUTPOINT_TO_ORDINAL_RANGES, which maps outputs, which are a TXID and a u32 output index, to ordinal ranges, the encoding of which is a little complicated, but are conceptually just (u64, u64) which give the start and size of each ordinal range in the output.

Recently, we added a new table, OUTPOINT_TO_TXID, which required a rebuild of the database. However, when the rebuild was underway, we noticed that it was extremely slow, and would take at least 9 days to sync, and the database was around 80 GiB, and wasn't yet at half-way synced. Each block was taking multiple seconds, as opposed to just a few milliseconds. We stopped the sync, disabled writes to the new table, and re-deployed, and the full sync was pretty quick again.

One thing that struck me was that the server has 64 GiB of RAM, so maybe the performance cliff happens when the database is larger than available RAM. So, we tried building the database on a machine with 1 GiB of RAM, but didn't notice a similar cliff. So it might be that the degradation only happens when the database is large.

Do you have any suggestions for things we might do to figure out to give you more information, and find out if this is redb issue, or if we're just hitting performance limitations of the box we're on?

@casey casey mentioned this issue Sep 3, 2022
8 tasks
@cberner
Copy link
Owner

cberner commented Sep 3, 2022

Hmm, interesting. Ya, a few ideas:

  • are you running it on Linux? If so, it'd be useful to collect a flamegraph of the perf of a couple insertions that are slow. I use the flamegraph crate (not cargo-flamegraph)
  • try printing the database stats to see if anything looks terribly wrong. Like if the height is more than log2(number of keys) then something is very wrong. It should be around log base x, where x is around 2000/size_of()
  • try using atop to see if disk bandwidth or CPU time is saturated. If disk bandwidth is low and CPU time is at 100%, there's probably a bug

@casey
Copy link
Contributor Author

casey commented Sep 6, 2022

Here's the output of ord info on a fully synced mainnet database. The two main tables are block height -> hash and outpoint -> ordinal ranges, and the number of keys in both looks right, so blocks indexed is the number of keys in the first table and outputs indexed is the number of keys in the second table.

blocks indexed: 752920
outputs indexed: 135512466
tree height: 7
free pages: 255940053
stored: 25.6 GiB
overhead: 1.38 GiB
fragmented: 20.63 GiB
index size: 49.25 GiB

Is there anything in this that jumps out at you as suspicious?

This is from an index that hasn't hit the performance cliff yet, so if there's a bug, for example in balancing, it might not be manifesting it yet.

@cberner
Copy link
Owner

cberner commented Sep 7, 2022

Nope, that all looks quite reasonable. Fragmented space being a little less than 50% is expected, since full pages split into two half full pages.

It's interesting that your sync takes so long. My benchmark writes 1M keys in 1.4secs. So 24+ hours for 135M keys seems kinda slow. How do I run this sync myself?

@casey
Copy link
Contributor Author

casey commented Sep 7, 2022

The 135M keys are just the outputs that remain as of the current chain tip. During sync though, there will be a bunch of insertions and deletions, probably orders of magnitude more than 135M.

It's pretty easy to run! You'll need bitcoin core: https://bitcoincore.org/en/download/

Start and sync a bitcoin node by running bitcoind. Then as the same user, run ord server. ord should automatically find and auth with bitcoin. bitcoind will take about 1-2 days to sync, and ord can run and will sync at the same time.

@casey
Copy link
Contributor Author

casey commented Sep 7, 2022

If it's reasonably fast, try uncommenting out the cfg in index.rs:

    #[cfg(any())]
    for input in &tx.input {
      outpoint_to_txid.insert(&serialize(&input.previous_output), &txid)?;
    }

On our production server, this makes everything much slower.

@cberner
Copy link
Owner

cberner commented Sep 7, 2022

When I run ord I get this error:

cberner@vermeer:(master)~/Code/ord$ cargo run --release -- server
    Finished release [optimized] target(s) in 0.10s
     Running `target/release/ord server`
error: Permission denied (os error 13)

Do I need to do something to grant it permission to talk to bitcoind, or use something else?

@cberner
Copy link
Owner

cberner commented Sep 7, 2022

The 135M keys are just the outputs that remain as of the current chain tip. During sync though, there will be a bunch of insertions and deletions, probably orders of magnitude more than 135M.

Ah right, ok that makes sense. When bitcoind is sync'ing does the tx=10180481 part of each line that it prints indicate the number of transactions that have occurred up to that height?

@casey
Copy link
Contributor Author

casey commented Sep 7, 2022

When bitcoind is sync'ing does the tx=10180481 part of each line that it prints indicate the number of transactions that have occurred up to that height?

Yup, I think so. My synced node just printed tx=762688384, which comports with this graph.

@veryordinally
Copy link

veryordinally commented Sep 7, 2022

  • are you running it on Linux? If so, it'd be useful to collect a flamegraph of the perf of a couple insertions that are slow. I use the flamegraph crate (not cargo-flamegraph)

Here's the flamegraph
from indexing a few hundred blocks from around block 150000 on the bitcoin main chain. Does this look reasonable to you in terms of redb behavior? I couldn't see anything that stood out - seems to spend about 75% of index_ranges time on database operations, thereof 1/3 on insertions, 1/3 on deletes, 1/3 on commit, and the rest on getting blocks from bitcoin core - which seems reasonable.

@cberner
Copy link
Owner

cberner commented Sep 8, 2022

Ya, that looks quite reasonable. Is that before or after the big slowdown?

@veryordinally
Copy link

This is with the "big slowdown" changes backed out, so before the big slowdown.

@cberner
Copy link
Owner

cberner commented Sep 8, 2022

If it's reasonably fast, try uncommenting out the cfg in index.rs:

    #[cfg(any())]
    for input in &tx.input {
      outpoint_to_txid.insert(&serialize(&input.previous_output), &txid)?;
    }

On our production server, this makes everything much slower.

I uncommented that line and also turned on info level logs. I ran the sync up to block height 392844, and I think I see the slowdown you're talking about. Each transaction has gone from taking 1-10ms to more like 1000ms. However, it looks expected, the number of ordinals being committed in each transaction has gone up to around a million!

[2022-09-08T14:35:39Z INFO  ord::index] Wrote 1146834 ordinal ranges in 1786ms

Could the problem be that around that block height people started using Bitcoin more, and so the ordinals are getting mixed around and so each transaction that gets spent includes a large number of ordinal ranges?

@veryordinally
Copy link

veryordinally commented Sep 8, 2022

A million+ ranges per TX should be a rare exception. We will collect some more stats.

@casey
Copy link
Contributor Author

casey commented Sep 8, 2022

Could the problem be that around that block height people started using Bitcoin more, and so the ordinals are getting mixed around and so each transaction that gets spent includes a large number of ordinal ranges?

That's definitely true, but I don't think it's the source of the slowdown. When this for loop:

for input in &tx.input {
  outpoint_to_txid.insert(&serialize(&input.previous_output), &txid)?;
}

…is enabled it will perform one additional insert per spent output, which doesn't depend on the number of ordinal ranges moving per transaction. So uncommenting it shouldn't cause a slowdown that occurs when the number of ordinal ranges moving per block gets large. Also, without it, the ordinals.com instance syncs reasonably quickly all the way to the chain tip.

Here's a graph of the full mainnet sync, without the slowdown:

all

So it definitely slows down as blocks get full, but the average stays around 200ms. Here's a zoomed in view of the end:

Figure_1

It's hard to tell, but it looks like the average stays around 200ms/commit.

@cberner
Copy link
Owner

cberner commented Sep 9, 2022

Ah yes, ok I see. I added a counter for the number of outpoints, and it's only a few thousand per block

@cberner
Copy link
Owner

cberner commented Sep 10, 2022

Ok, so the good news is I don't think it's a bug in redb, but that's also the bad news =P

I hacked my benchmark to simulate your workload, it assumes each value is 12 bytes * 100 ordinal range = 1.2kb, and after an initial load of millions of pairs, it writes transactions of 5000 outpoints.

I tested against lmdb, with varying sizes for the initial load, and redb performs on par with lmdb (redb seems to scale better actually). However, both of them start to get slow around 8M entries, which is probably around the time that it no longer fits in memory (I have 32GB, but it looks like the OS is only using ~14GB for buff/cache according to free):

1M
redb: Bulk loaded 1000000 items in 4704ms
redb: Wrote 100 individual items in 886ms
redb: Wrote 100 x 5000 items in 13178ms
redb: Removed 500000 items in 2804ms
lmdb-rkv: Bulk loaded 1000000 items in 4892ms
lmdb-rkv: Wrote 100 individual items in 876ms
lmdb-rkv: Wrote 100 x 5000 items in 7687ms
lmdb-rkv: Removed 500000 items in 2618ms

2M
redb: Bulk loaded 2000000 items in 10707ms
redb: Wrote 100 individual items in 912ms
redb: Wrote 100 x 5000 items in 15037ms
redb: Removed 1000000 items in 6295ms
lmdb-rkv: Bulk loaded 2000000 items in 14506ms
lmdb-rkv: Wrote 100 individual items in 971ms
lmdb-rkv: Wrote 100 x 5000 items in 8659ms
lmdb-rkv: Removed 1000000 items in 6173ms

4M
redb: Bulk loaded 4000000 items in 25327ms
redb: Wrote 100 individual items in 898ms
redb: Wrote 100 x 5000 items in 14864ms
redb: Removed 2000000 items in 20850ms
lmdb-rkv: Bulk loaded 4000000 items in 36919ms
lmdb-rkv: Wrote 100 individual items in 869ms
lmdb-rkv: Wrote 100 x 5000 items in 8482ms
lmdb-rkv: Removed 2000000 items in 40537ms

8M
redb: Bulk loaded 8000000 items in 71191ms
redb: Wrote 100 individual items in 1077ms
redb: Wrote 100 x 5000 items in 34156ms
redb: Removed 4000000 items in 510961ms
lmdb-rkv: Bulk loaded 8000000 items in 152380ms
lmdb-rkv: Wrote 100 individual items in 933ms
lmdb-rkv: Wrote 100 x 5000 items in 35490ms
lmdb-rkv: Removed 4000000 items in 515297ms

16M
redb: Bulk loaded 16000000 items in 1029296ms
redb: Wrote 100 individual items in 962ms
redb: Wrote 100 x 5000 items in 117697ms
lmdb-rkv: Bulk loaded 16000000 items in 1267875ms
lmdb-rkv: Wrote 100 individual items in 1042ms
lmdb-rkv: Wrote 100 x 5000 items in 118925ms

There's a bunch of weird stuff here, like remove is much slower than insert, which doesn't really make sense to me. Will need to look into it further.

@cberner
Copy link
Owner

cberner commented Sep 10, 2022

Ok, actually I think the insert performance does make sense. I tested throughput on my disk (Samsung 980 Pro) with fio and got about 30k IOPS: fio --filename ./junk.dat --eta-newline=5s --name write-test --size=1g --io_size=1g --blocksize=4k --ioengine=libaio --fsync=1000000000000 --iodepth=1 --direct=1 --numjobs=1 --runtime=20 --fallocate=none --rw=randwrite

We can estimate the insert speed by assuming every write goes directly to disk:

  • assuming the branch pages are half full, each key takes up 52 bytes (key, offset, child ptr, and child checksum [zeros are stored even for 2PC]), then the tree order will be around 40. The exact value isn't that important, since this is a b-tree it'll only change the height by +/- 1
  • using that we get a tree height of around 5
  • and now we calculate the total insert time: 16M * 5 / IOPS = 2,666secs. Whereas the measured time is 1,029secs -- presumably faster due to the page cache.
  • similarly, the per transaction time is expected to be around 833ms. Whereas measured time is around 3secs -- presumably slower due to overhead, such as zeroing out freed pages

@casey
Copy link
Contributor Author

casey commented Sep 11, 2022

Thanks for looking into this!

Ahh, damn, I was hoping for an ez fix bug in redb 😭

I tried to simulate the database running out of memory by building an index that was > 1 GiB of memory on a machine with 1 GiB of memory, but didn't run into a cliff. However, that was a small database, so it seems likely that I might just not have seen enough of a performance difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants