Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RocksDB massively exceeds memory limits. Potential memory leak #3216

Closed
christian-esken opened this issue Dec 1, 2017 · 43 comments
Closed

Comments

@christian-esken
Copy link

christian-esken commented Dec 1, 2017

I have a rocksdb of 400GB size. It uses 40GB RSS and keeps growing, eating up all the Swap. Finally it eats up all memory. The rocksdb version is 5.8.6.

Expected behavior

According to https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB the RocksDB instance should only use about 8GB - I have done calculations with a worst-case scenario.

Index size: 4,61 GB
Bloom Filters: 3.9 GB
Memtable: 10*128MB = 1,28GB
Block Cache: 0.5 GB
Blocks pinned by iterators: 14MB
Sum: Approximately 10,3GB

I also added memory not mentioned on the Wiki, namely:
Memtables pinned by iterators: 1,28GB (estimation: presuming they pin some older memtables)
Compaction (I understand that they could double Index and Bloom): 4.61 + 3.9GB
Sum: Approximately 9,8GB

All together, taking the worst case scenario, this is 20 GB.

Actual behavior

RocksDB takes way more memory than expected and finally eats up all memory. The memory consumption is far above expectations: Even shortly after start it requires 33GB RSS.

Detail: Swap is filled before buffers/cache. As you see below the swap is full (7GB) but there is still lots of data cached (14GB), so I guess RocksDB is pinning data in memory. Memory loss happens during reading from the DB via prefix iterator (seek, next), because when just writing (on the average 100MB/s to SSD) we do not lose any memory.

dbdir # free -m
total used free shared buff/cache available
Mem: 64306 49196 293 12 14816 14440
Swap: 7710 7710 0

RocksDB is being used embedded in Java via rocksdbjni. The Java part requires less than 1GB as Heap. Memory usage of that process, as taken from top:
VIRT 0.450t
RES 0.048t
SHR 0.0.12t

I have run pmap on the process, and summed up RSS for the *.sst files: The sum for *.sst is 34GB.

Steps to reproduce the behavior

We can reproduce the behavior in our application quickly. It never happens if we just write. As soon as we allow clients to read, the memory loss happens. The only operations on this DB are put, prefix-iterator seek() + next(), and deleteRange.
One important observation: We use itertators for a short time, then close them to avoid resource locking by the iterator and create a new Iterator. If we chose a shorter time until discarding an Iterator, then we create more iterators and the memory loss is faster.

@christian-esken
Copy link
Author

Additional info: I tried with deactivating block cache via table_options.setNoBlockCache(true) , and found out that this does not change anything. The process still eats up memory.

@ajkr
Copy link
Contributor

ajkr commented Dec 1, 2017

Have you tried https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/memory_util.h ? I am not sure if it gives a complete picture but may be worth trying.

@christian-esken
Copy link
Author

christian-esken commented Dec 3, 2017

I saw that that but could not find it on the Java Layer. Did I miss it? I also already tried to get memory usage/estimation via calls to db.getProperty("rocksdb.estimate-table-readers-mem") and similar but only get back 0 or low values like 230. I can only guess that the latter is a bug or omission in the Rocks Java layer. Here is an output. I am always getting the same values.

"rocksdb.estimate-table-readers-mem" => 0
"rocksdb.size-all-mem-tables" => 696
"rocksdb.cur-size-all-mem-tables" => 696

I am pretty sure my code uses getProperty() correctly, as other values vary. For example "rocksdb.num-running-compactions" is typically 0 or 1.

I might try to work around that. Can I open the same DB in read-only mode with a C++ program while the Java program is running the DB r/w. Can I then get the memory statistics, or won't they be supplied in read-only mode? (Update: I now do presume that the statistics are per client and thus won't tell me much).

@christian-esken
Copy link
Author

christian-esken commented Dec 4, 2017

I found how to properly get the memory statistics on the Java layer by digging in the JNI code. I will gather statistics with it and report back.
edit: If someone stumbles across this and also has a problem with memory statistics. The solution was to use the overlaoded getProperty() to specify the correct column family.

@christian-esken
Copy link
Author

"rocksdb.size-all-mem-tables" => 1234436376 (1,2GB)
"rocksdb.cur-size-all-mem-tables" => 89890656 (0.1GB)
Pinned memtables is (IMO) the difference: => 1234436376 - 89890656 = 1.144.545.720 (1,1GB)

The values above are no surprise. Everything fine with memtables. :-)

@christian-esken
Copy link
Author

"rocksdb.estimate-table-readers-mem" => 35045385959 (35GB)
"rocksdb.estimate-num-keys" => 9659039471 (9659 Million = 9,7 Billion)

These estimates are surprising to me.

  • An estimation of nearly 10 Billion keys? My hand calculated estimation was 3 Billion as worst-case. (600GB DB with 200 Byte per value = 3 Billion)
  • 35GB for "estimate-table-readers-mem"? I wonder for what. Even with 10 Billion keys I am calculating half (17,6 GB) for index (5,7GB) and bloom (11,9GB).

@christian-esken
Copy link
Author

I am wondering whether the key estimations could be very wrong - both mine and the one from RocksDB. MIne is based on DB size and average UNCOMPRESSED value size. Thus the number of keys could be actually massively higher.

@christian-esken
Copy link
Author

christian-esken commented Dec 6, 2017

Here is a graph on the memory usage of this RocksDB based service.

  • Red = Used
  • Yellow = Cached
  • Green = Free
  • Swap is not in the graph, but it usually is eaten up within 1h after start of the service

memory on rocksdb prefix-seek and deleterange

Timeline

  • 11/30 16:00 Start of service, configured to use Iterators for a short amount of time =>Many iterators are created. I see much memory loss
  • 12/1 Some manual restarts. At 16:00 starting the service, configured to use Iterators for a longer amount of time =>Fewer iterators are created. I see less memory loss
  • 12/3 and after: Two more automatic restarts, as memory ran out

I will give the service more headroom by adding memory to the machine next week. After that I will post information whether memory consumption stops at some point or if it is a real leak.

@yiwu-arbug
Copy link
Contributor

yiwu-arbug commented Dec 7, 2017

Since table reader is using large amount of memory, have you try setting cache_index_and_filter_blocks=true, which makes the memory used for index and filters bounded by block cache size?

The rocksdb.estimate-num-keys can be very wrong. It uses a simple formula (total keys - 2 * total deletes) and doesn't take duplicated keys into consideration.

@christian-esken
Copy link
Author

christian-esken commented Dec 7, 2017

cache_index_and_filter_blocks is currently false. For details read the rest of this comment.

I used cache_index_and_filter_blocks=true until migrating to RocksDB 5.7.2. After upgrade to 5.7.2 our read latency increased from 30ms to 5 minutes (yes: MINUTES!). I never found what is the root cause of the read problems and presumed there was a regression for our use case in 5.7.2. The issue was fully reproducible on the same DB by changing back and forth between 5.4.5 and 5.7.2 (5.4.5 => OK, 5.7.2 => Issue).

This could be fixed by using cache_index_and_filter_blocks=false. I haven't dared yet to test to enable it again.

@christian-esken
Copy link
Author

christian-esken commented Dec 7, 2017

About the memory usage: I have shut down all other processes on the machine. RocksDB began eating more memory and then stopped at 57GB RSS (DB-Files on disk are 784GB). The SWAP is not completely filled, and that could be a good sign - possibly the allocations are not endless.

# free -m
              total        used        free      shared  buff/cache   available
Mem:          64306       48694         323           2       15288       14954
Swap:          7710         751        6959

I also did a breakdown on memory consumption on the process. I ran pmap -X and grouped the memory consumption. Resident in RAM (RES) are

  • 16,78GB mmap'ed *.sst files
  • 15,23GB HEAP (this should be the native heap, but exactly what is that? We also have plenty of anonymous allocations)
  • 30,39GB anonymous allocations
    • 14GB should be the Java heap (this is not 100% confirmed, but highly likely correct)
    • 16GB (the rest) should be native RocksDB allocations

Summarized this should be for RocksDB 16GB +15,23 GB + 16,78 = 48GB

ANON           : size    47,61GB, rss    30,39GB, rssReferenced    30,00GB, anonymous    30,39GB, swap     5,37MB, locked         0B
FILE           : size     2,15MB, rss    32,77kB, rssReferenced    32,77kB, anonymous         0B, swap         0B, locked         0B
FILE_ROCKS_SST : size   800,58GB, rss    16,78GB, rssReferenced    16,42GB, anonymous         0B, swap         0B, locked         0B
HEAP           : size    15,25GB, rss    15,23GB, rssReferenced    15,10GB, anonymous    15,23GB, swap         0B, locked         0B
JAVA           : size     8,19kB, rss     4,10kB, rssReferenced     4,10kB, anonymous     4,10kB, swap         0B, locked         0B
LIBRARY_JAR    : size     4,54MB, rss         0B, rssReferenced         0B, anonymous         0B, swap         0B, locked         0B
LIBRARY_SO     : size    68,74MB, rss     9,90MB, rssReferenced     9,76MB, anonymous   851,97kB, swap   327,68kB, locked         0B
STACK          : size   139,26kB, rss    12,29kB, rssReferenced    12,29kB, anonymous    12,29kB, swap    24,58kB, locked         0B
VMISC          : size    20,48kB, rss     4,10kB, rssReferenced     4,10kB, anonymous         0B, swap         0B, locked         0B

@siying
Copy link
Contributor

siying commented Dec 7, 2017

How large are your index and bloom filter blocks and how large is your block cache?

@yiwu-arbug
Copy link
Contributor

From the symptoms you experience, it seems like you have huge index or bloom filter in your DB. With cache_index_and_filter_blocks=false, the index and filter blocks will be loaded in heap, eating memory; with cache_index_and_filter_blocks=true, the huge blocks have chances to be evict from cache, contribute to large read latency when they are being load again.

You can find out index and filter size by running sst_dump --file=<sst_file> --command=noop --show_properties with sst files in your DB.

@christian-esken
Copy link
Author

christian-esken commented Dec 8, 2017

This sst_dump is from an old *.sst file which is still present in the DB and uses large values (200 bytes).

du -sh 023043.sst
268M	023043.sst
./sst_dump --file=023043.sst --command=noop --show_properties
from [] to []
Process 023043.sst
Sst file format: block-based
Table Properties:
------------------------------
  # data blocks: 271525
  # entries: 4845731
  raw key size: 155063392
  raw average key size: 32.000000
  raw value size: 973971030
  raw average value size: 200.995687
  data block size: 268436030
  index block size: 10532611
  filter block size: 7950387
  (estimated) table size: 286919028
  filter policy name: rocksdb.BuiltinBloomFilter
  column family ID: 1
  column family name: cache
  comparator name: leveldb.BytewiseComparator
  merge operator name: nullptr
  property collectors names: []
  SST file compression algo: Snappy
  creation time: 1509103180
  time stamp of earliest key: 0
  # deleted keys: 0
  # merge operands: 0
Raw user collected properties
------------------------------
  # rocksdb.block.based.table.index.type: 0x00000000
  # rocksdb.block.based.table.prefix.filtering: 0x31
  # rocksdb.block.based.table.whole.key.filtering: 0x31
  # rocksdb.deleted.keys: 0x00
  # rocksdb.merge.operands: 0x00

@christian-esken
Copy link
Author

christian-esken commented Dec 8, 2017

This sst_dump is from a new *.sst file which is still present in the DB (Level L0) and uses small values (26 bytes).

du -sh 690256.sst 
122M	690256.sst
./sst_dump --file=690256.sst --command=noop --show_properties
from [] to []
Process 690256.sst
Sst file format: block-based
Table Properties:
------------------------------
  # data blocks: 15092
  # entries: 2678161
  raw key size: 85701152
  raw average key size: 32.000000
  raw value size: 69396830
  raw average value size: 25.912120
  data block size: 123085798
  index block size: 577665
  filter block size: 4017349
  (estimated) table size: 127680812
  filter policy name: rocksdb.BuiltinBloomFilter
  column family ID: 1
  column family name: cache
  comparator name: leveldb.BytewiseComparator
  merge operator name: nullptr
  property collectors names: []
  SST file compression algo: NoCompression
  creation time: 1512734971
  time stamp of earliest key: 1512734901
  # deleted keys: 0
  # merge operands: 0
Raw user collected properties
------------------------------
  # rocksdb.block.based.table.index.type: 0x00000000
  # rocksdb.block.based.table.prefix.filtering: 0x31
  # rocksdb.block.based.table.whole.key.filtering: 0x31
  # rocksdb.deleted.keys: 0x00
  # rocksdb.merge.operands: 0x00

@christian-esken
Copy link
Author

christian-esken commented Dec 8, 2017

I see that filter + index seem to use a "sensbile" amount in the sst file, as shown below. My new data format fares better (smaller values 26 vs 200 bytes, 8KB vs 4KB blocks). This is file usage, but how big would be the data in RAM? It cannot be literally translated. IIRC the filter block is compressed, while index block is not. I guess that 10% of whole SST table size should be more enough for filters and index according to my former calculation and the sst_dump.

  023043.sst (old values) Percentage of the sum   690256.sst (new values) Percentage of the sum
data_block_size 268436030 93,5581135455401   123085798 96,4011710702466
index_block_size 10532611 3,67093499285101   577665 0,452428983612667
filter_block_size 7950387 2,77095146160888   4017349 3,14639994614069
Sum of Filter + Index 18482998 6,4418864544599   4595014 3,59882892975336
Sum of the 3 blocks 286919028 100   127680812 100

@christian-esken
Copy link
Author

christian-esken commented Dec 8, 2017

The machine with the service in question was upgraded from 64GB to 256GB today, and now after an hour the process takes 178GB RSS.

@christian-esken
Copy link
Author

christian-esken commented Dec 8, 2017

I do not see over-large index or filters. For example for 690256.sst the size is slightly higher than calculated, but only 20%:

  • Index
    • Index expected size: 15092 (data_blocks) * 32 (avg key size) = 482944
    • index actual size from sst_dump statistics: 577665 (20% bigger than expected).
  • Bloom filters (6 bits per entry):
    • Bloom filters expected size: 2678161 (entries) * 6 (bloom bits) / 8 = 3570881
    • Bloom filters from sst_dump statistics: 4017349 (12,5% bigger)

Also from statistics I see not much pinned:
MemtablesSize = 157458080; // 157MB
MemtableUnflushedSize = 34534368; // 34MB
MemtablePinnedSize = 122923712; // 123 MB
EstimatedKeyCount = 15872519822; // 15,8 Billion
BloomAndFilterSizeEstimation = 45643281266; // 45GB

@christian-esken
Copy link
Author

christian-esken commented Dec 8, 2017

@siying

  • Block filter is disabled, to make size calculations easier.
  • index and bloom filter blocks:
    • For in-memory-usage there is the RocksDB estimatom of 45GB.
    • In *.sst files I added two examples (Search for 023043.sst and 690256.sst).

@christian-esken
Copy link
Author

For comparison reasons I started to run two differently configured instances. I will check after the weekend how they are performing.

  • One with no block cache: table_options.setNoBlockCache(true);
2017/12/08-11:11:48.747656 7f3cf31f6700            table_factory options:   flush_block_policy_factory: FlushBlockBySizePolicyFactory (0x7f3ced495490)
  cache_index_and_filter_blocks: 0
  cache_index_and_filter_blocks_with_high_priority: 0
  pin_l0_filter_and_index_blocks_in_cache: 0
  index_type: 0
  hash_index_allow_collision: 1
  checksum: 1
  no_block_cache: 1
  block_cache: (nil)
  block_cache_compressed: (nil)
  persistent_cache: (nil)
  block_size: 8192
  block_size_deviation: 10
  block_restart_interval: 16
  index_block_restart_interval: 1
  filter_policy: rocksdb.BuiltinBloomFilter
  whole_key_filtering: 1
  format_version: 0
  • One with block cache:
            table_options
                .setPinL0FilterAndIndexBlocksInCache(true)
                .setCacheNumShardBits(globals.getInt("rocks.table.cacheNumShardBits", 6))
                .setNoBlockCache(false);
                table_options.setCacheIndexAndFilterBlocks(true);
                table_options.setBlockCacheSize(blockCacheSize)); // 2.5 GiB
2017/12/08-14:53:36.977174 7f7bb5115700            table_factory options:   flush_block_policy_factory: FlushBlockBySizePolicyFactory (0x7f7bad36dd60)
  cache_index_and_filter_blocks: 1
  cache_index_and_filter_blocks_with_high_priority: 0
  pin_l0_filter_and_index_blocks_in_cache: 1
  index_type: 0
  hash_index_allow_collision: 1
  checksum: 1
  no_block_cache: 0
  block_cache: 0x7f7bad36d750
  block_cache_name: LRUCache
  block_cache_options:
    capacity : 1610612736
    num_shard_bits : 6
    strict_capacity_limit : 0
    high_pri_pool_ratio: 0.000
  block_cache_compressed: (nil)
  persistent_cache: (nil)
  block_size: 8192
  block_size_deviation: 10
  block_restart_interval: 16
  index_block_restart_interval: 1
  filter_policy: rocksdb.BuiltinBloomFilter
  whole_key_filtering: 1
  format_version: 0

@christian-esken
Copy link
Author

The over-the-weekend test is done. Differences are blatant. The amount of "OS buffered/cached" is nearly identical, but the used and free values vary heavily:

  • Used: 12227 vs 51232
  • Free: 14240 vs 904

With block cache:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:         225586       12277       14240          17      199068      211739
Swap:          7710          74        7636

Without block cache:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:         257842       51232         904          55      205705      204747
Swap:          7710         272        7438

@christian-esken
Copy link
Author

christian-esken commented Dec 11, 2017

My pmap analysis shows:

  • ANON
    • With block cache: using less ANON (10,76GB). This is expected, as the Cache should be ANON allocations.
    • Without block cache: using 47,67GB ANON. This is acceptable for a DB of 1,2TB
  • FILE_ROCKS_SST
    • With block cache: using 18,72GB rss / rssReferenced. This is higher than expected, because there are no (application) reads/seeks on the DB.
    • Without block cache: using 112,74GB rss / rssReferenced. This is very high.

With block cache:

  size rss rssReferenced anonymous swap locked
ANON 55,63GB, 10,76GB, 10,60GB, 10,76GB, 0B, 0B
FILE 2,15MB, 397,31kB, 397,31kB, 0B, 0B, 0B
FILE_ROCKS_SST 809,85GB, 18,72GB, 18,72GB, 0B, 0B, 0B
HEAP 143,84MB, 136,54MB, 136,54MB, 136,54MB, 0B, 0B
JAVA 8,19kB, 8,19kB, 8,19kB, 4,10kB, 0B, 0B
LIBRARY_JAR 4,55MB, 798,72kB, 790,53kB, 0B, 0B, 0B
LIBRARY_SO 68,74MB, 16,78MB, 16,45MB, 1,18MB, 0B, 0B
STACK 139,26kB, 36,86kB, 12,29kB, 36,86kB, 0B, 0B
VMISC 20,48kB, 4,10kB, 4,10kB, 0B, 0B, 0B

Without block cache:

  size rss rssReferenced anonymous swap locked
ANON 79,57GB, 49,18GB, 47,67GB, 49,18GB, 0B, 0B
FILE 2,15MB, 434,18kB, 434,18kB, 0B, 0B, 0B
FILE_ROCKS_SST 1260,80GB, 112,74GB, 112,74GB, 0B, 0B, 0B
HEAP 2,16GB, 2,16GB, 2,14GB, 2,16GB, 0B, 0B
JAVA 8,19kB, 8,19kB, 8,19kB, 4,10kB, 0B, 0B
LIBRARY_JAR 4,54MB, 2,29MB, 2,29MB, 0B, 0B, 0B
LIBRARY_SO 68,74MB, 22,18MB, 21,82MB, 1,18MB, 0B, 0B
STACK 139,26kB, 40,96kB, 12,29kB, 40,96kB, 0B, 0B
VMISC 20,48kB, 4,10kB, 4,10kB, 0B, 0B, 0B

@christian-esken
Copy link
Author

Summary

ANON allocations look fine. They seem to be in line with the RocksDB memory usage wiki page. FILE_ROCKS_SST are the SST files mapped into memory, and I see an issue here. mmap-ing into memory is not an issue, but very much is in RAM (as seen in the rss column). I assume that any file block read at least ONCE stays in RAM forever, and the reason is that it is in the OS-Cache.

Questions

  • Is my assumption right that every block from a file stays in RAM that was read at least one time?
  • Why would the OS (Linux) prefer swapping out other applications instead of dropping blocks from its own cache. This is my biggest concern, and I have no answer to that. /proc/sys/vm/swappiness is 0, so Linux should really try its best to not swap.
  • Is there a way to "unreference" cold blocks, so that the OS could easier reclaim that memory? For example unmap parts of the file. Does RocksDB do anything like this?

@christian-esken
Copy link
Author

christian-esken commented Dec 21, 2017

I observed the behavior for a week, digged deep into Linux memory management and got insight in the issue. Answering my own questions now:

  • Is my assumption right that every block from a file stays in RAM that was read at least one time?
    => The answer is: Yes (until the OS discards it from block cache)
  • Why would the OS (Linux) prefer swapping out other applications instead of dropping blocks from its own cache. This is my biggest concern, and I have no answer to that. /proc/sys/vm/swappiness is 0, so Linux should really try its best to not swap.
    => The answer is: cgroups can have different swappiness, and newer Debian versions make use of it.
  • Is there a way to "unreference" cold blocks, so that the OS could easier reclaim that memory? For example unmap parts of the file. Does RocksDB do anything like this?
    => he answer is: munmap() could do it, but it would likely be a big effort in RocksDB to do that and not worth

@christian-esken
Copy link
Author

Swap usage: The mystery of SWAP being used though there is free memory is solved. Debian 9 puts services in a cgroup, and that cgroup does not follow /proc/sys/vm/swappiness . If others have similar issues, here is how you find your service/cgroup specific settings:

# cat /proc/1052/cgroup | grep memory
4:memory:/system.slice/my-svc.service

# cat  /sys/fs/cgroup/memory/system.slice/my-svc.service/memory.swappiness
60

# cat  /proc/sys/vm/swappiness
0

We see that the default swappiness diverges from the swappiness of the cgroup.

@christian-esken
Copy link
Author

Memory leakage: There may still be memory leakage, as I see that about 10GB memory is "lost" over the course of 6 days. I need to follow this over a longer period and see where and if memory usage settles.
rocksdb memory usage on prefix-seek

@christian-esken
Copy link
Author

christian-esken commented Dec 21, 2017

Summarizing a long ticket

  • Swap is used because swappiness is (re)set by Debain 9 in a cgroup from 0 to 60.
    => "Watch out for cgroups". I hope this insight will help other people out there. :-)
  • Memory usage is way more than expectated and calculated. I think it is too high, and will observe it.
  • There may be a memory leak. If there is a leak, then it is likely in RocksDB (native memory) and not the (Java) application on top of it, as the Java heap is limited. I will observe it.

I am now closing this ticket, as the first aspect ("swap usage") is not a RocksDB issue. If I have more insight about memory consumption or leaks I will open a new ticket. Thanks for the help @ajkr and @yiwu-arbug

@jianwei1216
Copy link

hi, nice to meet you

Do you resolve this problem?

@christian-esken
Copy link
Author

I still see high memory usage, but I am not sure if it is a true leak. I am still in the process of finding root cause and proof. The most likely cause of high memory usage in my case are iterators. In my opinion Iterators hold way more memory than documented in section "Blocks pinned by iterators" on the page https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB.

If you want a possible answer, this is what can contribute to the problem of high memory usage.

  • Close your iterators as soon as possible. Otherwise memory usage can explode.
  • Don't pool iterators, unless you understood the effects e.g. from LSM tree versioning.

A more detailed description follows below.

Effectively Iterators likely hold:

The latter could make your memory usage explode. If you use an Iterator pool, there are 2 issues:

  • Iterators keep all SSTables open of the LSM tree version of the time of creation of the Iterator. In the worst case a single iterator has a completely different set of SSTables. Thus means, bloom and filters may require duplicate size. 10 iterators could mean 10 times the size.
  • The iterator from a pool may NOT see your most recent data, as the newer data is in a newer LSM tree version.

@iFA88
Copy link

iFA88 commented Aug 18, 2020

Hey @christian-esken, thanks for your very detailed debugging, is very interesting. I face similar issue when I use my DB with many CFs and 1+ trillion keys. Several times my DB memory usage increases for 20-23 GB(all what i have free) in few hours, very randomly.
My issue is like what you describe, very slow memory increasing day for day.
I have dumped the memory from the process, and i have several 256MB memory pointers filled up with zeroes.
pmap returns for that these lines:

         Address Perm   Offset Device     Inode    Size     Rss     Pss Referenced Anonymous ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked Mapping
    7f2130681000 rw-s 00000000  00:14 275866516       4       4       4          4         0              0              0               0    0       0      0 KaE2af (deleted)
    7f2130682000 rw-s 00000000  00:14 275866515       4       0       0          0         0              0              0               0    0       0      0 uoVzy9 (deleted)
    7f2130683000 rw-s 00000000  00:14 275866514       4       0       0          0         0              0              0               0    0       0      0 Li57V3 (deleted)
    7f2130684000 rw-s 00000000  00:14 275866513       4       0       0          0         0              0              0               0    0       0      0 Nw6GjY (deleted)
    7f2130685000 rw-s 00000000  00:14 275866512       4       4       4          4         0              0              0               0    0       0      0 2k4gHS (deleted)
    7f2130686000 rw-s 00000000  00:14 275866511       4       4       4          4         0              0              0               0    0       0      0 Fc7R4M (deleted)
    7f2130687000 rw-s 00000000  00:14 275866510       4       4       4          4         0              0              0               0    0       0      0 dZdusH (deleted)
    7f2130688000 rw-s 00000000  00:14 275866509       4       4       4          4         0              0              0               0    0       0      0 KZn7PB (deleted)
    7f2130689000 rw-s 00000000  00:14 275866508       4       4       4          4         0              0              0               0    0       0      0 wSvLdw (deleted)
    7f213068a000 rw-s 00000000  00:14 275866507       4       4       4          4         0              0              0               0    0       0      0 p0GqBq (deleted)
    7f213068b000 rw-s 00000000 103:08   3407899       4       4       0          4         0              0              0               0    0       0      0 pym-6301-pj6joa5c (deleted)
    7f213068c000 rw-p 00000000  00:00         0    1816    1816    1409       1748      1816              0              0               0    0       0      0 
    7f2130852000 rwxp 00000000  00:00         0       4       4       0          0         4              0              0               0    0       0      0 
    7f2130853000 rw-s 00000000  00:14 237344868       4       4       0          4         0              0              0               0    0       0      0 g3ghBt (deleted)
    7f2130854000 rw-s 00000000  00:14 237344867       4       4       4          4         0              0              0               0    0       0      0 LpPrSA (deleted)
    7f2130855000 rw-s 00000000  00:14 237344866       4       4       4          4         0              0              0               0    0       0      0 M7ZC9H (deleted)
    7f2130856000 rw-s 00000000  00:14 237344865       4       4       4          4         0              0              0               0    0       0      0 VXOOqP (deleted)
    7f2130857000 rw-s 00000000  00:14 237344864       4       4       0          4         0              0              0               0    0       0      0 75b1HW (deleted)
    7f2130858000 rw-s 00000000  00:14 237344863       4       4       4          4         0              0              0               0    0       0      0 7QReZ3 (deleted)

I think that was used by the compiler. But I don't know why not free it... My most CF's have option no_block_cache=true
I use python layer for the RocksDB (version: 6.8.1).

For your suggestions:
I do iterate very rare and after use is closed.

@linas
Copy link

linas commented Apr 11, 2021

FYI, RAM usage seems to be directly correlated with the number of *.sst files. RAM usage can be clamped by limiting the number of open *.sst files, by setting options max_open_files to a suitable number. I am seeing approx 1GB RAM for every 15 sst files. RAM usage appears to be about 3x disk usage.

Details:

  • Ubuntu focal 20.04 + rocksdb 5.17
  • My app is in C++.
  • I'm rocks-naive, so am using mostly-all default settings, no tuning, no special options, nothing sophisticated. Just blunt read-write of mostly short and some long records. (Zipf distribution, short being most common) Almost no iterators used.
  • My database is about 500 MB (half a GB) containing about a million records.
  • Without resource constraints, it takes about 15 minutes to shoot up to about 60GB disk usage, 180GB RAM usage and about 1700 sst files.
  • Closing the app triggers compaction, and the database compacts back down to it's original size.
  • When options.max_open_files = -1 (the rocks default) then the number of sst files shoots up to just under the ulimit -n setting (1024 is the bash default), after which I get silent data corruption. Closing the app, and re-opening it again seems to give a database with the pre-corruption data in it.
  • Setting ulimit -n 4096 allows more open sst files, and vast amounts of RAM and disk to be consumed.
  • Setting options.max_open_files = 312 seems to limit the number of sst files to about 170, limits disk usage to about 6.6GB, and RAM usage to about 20GB.

The root cause of the large explosion of the number of sst file appears to be due to the app performing a rapid number of edits to some large values. The values in the 50KB range, and these are being updated hundreds of times a second. This appears to cause rocks to write out tens of MBytes/sec of log records, which turns into the GB/minute or so increase in file storage space. I don't understand why compaction isn't mashing these back down to a manageable size.

The fact that RAM is 3x the disk usage suggests that each sst file is memory-mapped three times!?

My app also has a postgres backend, so I can compare. In synthetic benchmarks, rocks is 2x or 3x faster than postgres. Yayy! That's a lot! For the production data runs, its about 10x slower. Booo! I don't understand the root cause of this, either.

@mrambacher
Copy link
Contributor

What is the symptom of the "silent corruption"? When you close and re-open the DB, is the original data there or is it still "corrupted"?

I am not that familiar with RocksDB 5.17 so I have no idea if anything has been fixed in this area since that time.

@linas
Copy link

linas commented Apr 12, 2021

What is the symptom of the "silent corruption"?

Data that I expect to be present goes missing, as if the entire record is absent. To catch it in the act, I also wrote a log of everything written to as an ascii file. I can see that a write occurs to that key, with the correct data, whereas a later read returns nothing. At this point, the app is check-stopped, and I can manually verify that the expected key-value pair really is missing.

When you close and re-open the DB, is the original data there or is it still "corrupted"?

When I exit the app and restart, I can see that the the original write had gone through, as it is now holding the expected data. I cannot close and reopen without exiting the app, because, despite closing, there are 980 file descriptors that remain open to the 980 sst files, and ulimit prevents the open from succeeding, with a "too many open files" error. My guess is that rocksdb hits the ulimit max open files limit, doesn't notice, stops returning data to Get() and iterators. Then, upon close, it forgets that it has all these other files open, and leaves them open.

A more elegant response would have been to run a compaction cycle and reduce the number of open files, but perhaps it is trying to avoid error amplification, and just refusing to do anything more.

I hadn't noticed the RAM usage until I'd increased ulimit -n 4096 after which I ran out of RAM (on a 256 GB box ... while read-modify-writing a 500MB db... Wow!)

@linas
Copy link

linas commented Apr 13, 2021

The leaking file descriptor problem persists into rockdb version 6.19 github master branch as of yesterday. After closing rocksdb, lsof -p pid |grep sst | wc shows that there are 71 file descriptors still open. Reopening and closing again causes compaction to run, which removes many of these files; lsof -p pid |grep sst |grep deleted |wc shows that there are 67 open file descriptors to sst files that have been deleted. Leaving the application idle overnight (12 hours) does not change things; those descriptors appear to have been permanently lost.

@siying
Copy link
Contributor

siying commented Apr 13, 2021

Any change you have something holding old files from being deleted, e.g. very old iterators not destroyed, some scheduled compactions not being run for compaction scheduling, etc? Some DB properties to check:
"rocksdb.num-live-versions" - old objects that might hold old files
"rocksdb.total-sst-files-size" and "rocksdb.live-sst-files-size". The later is Live SST file in the latest LSM-tree, while the first one includes stale files holding by something.

@linas
Copy link

linas commented Apr 13, 2021

@siying Wow! Thank you! I'm coding in C++ and made the apparently faulty and foolish assumption that iterators were smart pointers that self-deleted when they went out of block scope. Looking at <rocksdb/db.h> I see that is not the case. Yikes! Thank you, yes, you are right, all my ram/disk problems are almost surely due to failure to delete iterators. I will remeasure and report findings shortly.... done. Wow! all RAM and file usage problems appear to be completely resolved!

Its been a decade since I last coded in a C++ project that required an explicit delete. Never even occurred to me. Based on my 100% newbie error, I humbly suggest that future versions of the C++ bindings to rocks might convert to using smart pointers for iterators. In C++, its not hard, a few lines of code in the header files. Mem usage of smart pointers is about 3 long words. CPU overhead is low: under the covers, they increment a counter when created, they decrement when the pointer goes out of scope. The only issue is how to make this change in a backwards-compatible way: all the old users will be calling delete, which will confuse the shared-pointer control block when it also goes to delete. Hmm.

@lcoon815
Copy link

lcoon815 commented Oct 4, 2021

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Would like to try your method if it can solve my problem. Thanks.

@linas
Copy link

linas commented Oct 4, 2021

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Try ulimit -a at the bash prompt. Also man getrlimit

@lcoon815
Copy link

lcoon815 commented Oct 4, 2021

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Try ulimit -a at the bash prompt. Also man getrlimit

I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out.
put error: IO error: No space left on deviceWhile appending to file: /data/pmem0/db/001049.sst: No space left on device

@lcoon815
Copy link

lcoon815 commented Oct 5, 2021

@linas Could you guide me how to set options.max_open_files = 300?

@siying
Copy link
Contributor

siying commented Oct 5, 2021

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Try ulimit -a at the bash prompt. Also man getrlimit

I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out. put error: IO error: No space left on deviceWhile appending to file: /data/pmem0/db/001049.sst: No space left on device

Default benchmark.sh appears to work on a data set of 1.5TB. With some compaction styles it can go as high as 3TB. If you are OK with running smaller size, you can overwrite number of keys with environment variable NUM_KEYS. Even our official benchmark now runs with much smaller number of keys: https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks

@lcoon815
Copy link

lcoon815 commented Oct 5, 2021

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Try ulimit -a at the bash prompt. Also man getrlimit

I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out. put error: IO error: No space left on deviceWhile appending to file: /data/pmem0/db/001049.sst: No space left on device

Default benchmark.sh appears to work on a data set of 1.5TB. With some compaction styles it can go as high as 3TB. If you are OK with running smaller size, you can overwrite number of keys with environment variable NUM_KEYS. Even our official benchmark now runs with much smaller number of keys: https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks

@siying Okay, i will try to reduce the NUM_KEYS. Do you know how to calculate so that we can know what number of keys the disk can take with its space?

@Robin3910
Copy link

Robin3910 commented Mar 13, 2024

I want to konw rocksdb itself close the iterators or not? I found the memory grow unlimitly and never decrese. I didn't use any iterators in my code and I just use the Get() interface only, so I guess rocksdb use lots of iterators by Get(), and not close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants