-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RocksDB massively exceeds memory limits. Potential memory leak #3216
Comments
Additional info: I tried with deactivating block cache via table_options.setNoBlockCache(true) , and found out that this does not change anything. The process still eats up memory. |
Have you tried https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/memory_util.h ? I am not sure if it gives a complete picture but may be worth trying. |
I saw that that but could not find it on the Java Layer. Did I miss it? I also already tried to get memory usage/estimation via calls to db.getProperty("rocksdb.estimate-table-readers-mem") and similar but only get back 0 or low values like 230. I can only guess that the latter is a bug or omission in the Rocks Java layer. Here is an output. I am always getting the same values.
I am pretty sure my code uses getProperty() correctly, as other values vary. For example I might try to work around that. Can I open the same DB in read-only mode with a C++ program while the Java program is running the DB r/w. Can I then get the memory statistics, or won't they be supplied in read-only mode? (Update: I now do presume that the statistics are per client and thus won't tell me much). |
I found how to properly get the memory statistics on the Java layer by digging in the JNI code. I will gather statistics with it and report back. |
"rocksdb.size-all-mem-tables" => 1234436376 (1,2GB) The values above are no surprise. Everything fine with memtables. :-) |
"rocksdb.estimate-table-readers-mem" => 35045385959 (35GB) These estimates are surprising to me.
|
I am wondering whether the key estimations could be very wrong - both mine and the one from RocksDB. MIne is based on DB size and average UNCOMPRESSED value size. Thus the number of keys could be actually massively higher. |
Here is a graph on the memory usage of this RocksDB based service.
Timeline
I will give the service more headroom by adding memory to the machine next week. After that I will post information whether memory consumption stops at some point or if it is a real leak. |
Since table reader is using large amount of memory, have you try setting cache_index_and_filter_blocks=true, which makes the memory used for index and filters bounded by block cache size? The rocksdb.estimate-num-keys can be very wrong. It uses a simple formula (total keys - 2 * total deletes) and doesn't take duplicated keys into consideration. |
cache_index_and_filter_blocks is currently false. For details read the rest of this comment. I used cache_index_and_filter_blocks=true until migrating to RocksDB 5.7.2. After upgrade to 5.7.2 our read latency increased from 30ms to 5 minutes (yes: MINUTES!). I never found what is the root cause of the read problems and presumed there was a regression for our use case in 5.7.2. The issue was fully reproducible on the same DB by changing back and forth between 5.4.5 and 5.7.2 (5.4.5 => OK, 5.7.2 => Issue). This could be fixed by using cache_index_and_filter_blocks=false. I haven't dared yet to test to enable it again. |
About the memory usage: I have shut down all other processes on the machine. RocksDB began eating more memory and then stopped at 57GB RSS (DB-Files on disk are 784GB). The SWAP is not completely filled, and that could be a good sign - possibly the allocations are not endless.
I also did a breakdown on memory consumption on the process. I ran pmap -X and grouped the memory consumption. Resident in RAM (RES) are
Summarized this should be for RocksDB 16GB +15,23 GB + 16,78 = 48GB
|
How large are your index and bloom filter blocks and how large is your block cache? |
From the symptoms you experience, it seems like you have huge index or bloom filter in your DB. With cache_index_and_filter_blocks=false, the index and filter blocks will be loaded in heap, eating memory; with cache_index_and_filter_blocks=true, the huge blocks have chances to be evict from cache, contribute to large read latency when they are being load again. You can find out index and filter size by running |
This sst_dump is from an old *.sst file which is still present in the DB and uses large values (200 bytes).
|
This sst_dump is from a new *.sst file which is still present in the DB (Level L0) and uses small values (26 bytes).
|
I see that filter + index seem to use a "sensbile" amount in the sst file, as shown below. My new data format fares better (smaller values 26 vs 200 bytes, 8KB vs 4KB blocks). This is file usage, but how big would be the data in RAM? It cannot be literally translated. IIRC the filter block is compressed, while index block is not. I guess that 10% of whole SST table size should be more enough for filters and index according to my former calculation and the sst_dump.
|
The machine with the service in question was upgraded from 64GB to 256GB today, and now after an hour the process takes 178GB RSS. |
I do not see over-large index or filters. For example for 690256.sst the size is slightly higher than calculated, but only 20%:
Also from statistics I see not much pinned: |
|
For comparison reasons I started to run two differently configured instances. I will check after the weekend how they are performing.
|
The over-the-weekend test is done. Differences are blatant. The amount of "OS buffered/cached" is nearly identical, but the used and free values vary heavily:
With block cache:
Without block cache:
|
My pmap analysis shows:
With block cache:
Without block cache:
|
SummaryANON allocations look fine. They seem to be in line with the RocksDB memory usage wiki page. FILE_ROCKS_SST are the SST files mapped into memory, and I see an issue here. mmap-ing into memory is not an issue, but very much is in RAM (as seen in the rss column). I assume that any file block read at least ONCE stays in RAM forever, and the reason is that it is in the OS-Cache. Questions
|
I observed the behavior for a week, digged deep into Linux memory management and got insight in the issue. Answering my own questions now:
|
Swap usage: The mystery of SWAP being used though there is free memory is solved. Debian 9 puts services in a cgroup, and that cgroup does not follow /proc/sys/vm/swappiness . If others have similar issues, here is how you find your service/cgroup specific settings:
We see that the default swappiness diverges from the swappiness of the cgroup. |
Summarizing a long ticket
I am now closing this ticket, as the first aspect ("swap usage") is not a RocksDB issue. If I have more insight about memory consumption or leaks I will open a new ticket. Thanks for the help @ajkr and @yiwu-arbug |
hi, nice to meet you Do you resolve this problem? |
I still see high memory usage, but I am not sure if it is a true leak. I am still in the process of finding root cause and proof. The most likely cause of high memory usage in my case are iterators. In my opinion Iterators hold way more memory than documented in section "Blocks pinned by iterators" on the page https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB. If you want a possible answer, this is what can contribute to the problem of high memory usage.
A more detailed description follows below.Effectively Iterators likely hold:
The latter could make your memory usage explode. If you use an Iterator pool, there are 2 issues:
|
Hey @christian-esken, thanks for your very detailed debugging, is very interesting. I face similar issue when I use my DB with many CFs and 1+ trillion keys. Several times my DB memory usage increases for 20-23 GB(all what i have free) in few hours, very randomly.
I think that was used by the compiler. But I don't know why not free it... My most CF's have option For your suggestions: |
FYI, RAM usage seems to be directly correlated with the number of Details:
The root cause of the large explosion of the number of The fact that RAM is 3x the disk usage suggests that each sst file is memory-mapped three times!? My app also has a postgres backend, so I can compare. In synthetic benchmarks, rocks is 2x or 3x faster than postgres. Yayy! That's a lot! For the production data runs, its about 10x slower. Booo! I don't understand the root cause of this, either. |
What is the symptom of the "silent corruption"? When you close and re-open the DB, is the original data there or is it still "corrupted"? I am not that familiar with RocksDB 5.17 so I have no idea if anything has been fixed in this area since that time. |
Data that I expect to be present goes missing, as if the entire record is absent. To catch it in the act, I also wrote a log of everything written to as an ascii file. I can see that a write occurs to that key, with the correct data, whereas a later read returns nothing. At this point, the app is check-stopped, and I can manually verify that the expected key-value pair really is missing.
When I exit the app and restart, I can see that the the original write had gone through, as it is now holding the expected data. I cannot close and reopen without exiting the app, because, despite closing, there are 980 file descriptors that remain open to the 980 sst files, and A more elegant response would have been to run a compaction cycle and reduce the number of open files, but perhaps it is trying to avoid error amplification, and just refusing to do anything more. I hadn't noticed the RAM usage until I'd increased |
The leaking file descriptor problem persists into rockdb version 6.19 github master branch as of yesterday. After closing rocksdb, |
Any change you have something holding old files from being deleted, e.g. very old iterators not destroyed, some scheduled compactions not being run for compaction scheduling, etc? Some DB properties to check: |
@siying Wow! Thank you! I'm coding in C++ and made the apparently faulty and foolish assumption that iterators were smart pointers that self-deleted when they went out of block scope. Looking at Its been a decade since I last coded in a C++ project that required an explicit |
@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files Would like to try your method if it can solve my problem. Thanks. |
Try |
I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out. |
@linas Could you guide me how to set options.max_open_files = 300? |
Default benchmark.sh appears to work on a data set of 1.5TB. With some compaction styles it can go as high as 3TB. If you are OK with running smaller size, you can overwrite number of keys with environment variable NUM_KEYS. Even our official benchmark now runs with much smaller number of keys: https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks |
@siying Okay, i will try to reduce the NUM_KEYS. Do you know how to calculate so that we can know what number of keys the disk can take with its space? |
I want to konw rocksdb itself close the iterators or not? I found the memory grow unlimitly and never decrese. I didn't use any iterators in my code and I just use the Get() interface only, so I guess rocksdb use lots of iterators by Get(), and not close it. |
I have a rocksdb of 400GB size. It uses 40GB RSS and keeps growing, eating up all the Swap. Finally it eats up all memory. The rocksdb version is 5.8.6.
Expected behavior
According to https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB the RocksDB instance should only use about 8GB - I have done calculations with a worst-case scenario.
Index size: 4,61 GB
Bloom Filters: 3.9 GB
Memtable: 10*128MB = 1,28GB
Block Cache: 0.5 GB
Blocks pinned by iterators: 14MB
Sum: Approximately 10,3GB
I also added memory not mentioned on the Wiki, namely:
Memtables pinned by iterators: 1,28GB (estimation: presuming they pin some older memtables)
Compaction (I understand that they could double Index and Bloom): 4.61 + 3.9GB
Sum: Approximately 9,8GB
All together, taking the worst case scenario, this is 20 GB.
Actual behavior
RocksDB takes way more memory than expected and finally eats up all memory. The memory consumption is far above expectations: Even shortly after start it requires 33GB RSS.
Detail: Swap is filled before buffers/cache. As you see below the swap is full (7GB) but there is still lots of data cached (14GB), so I guess RocksDB is pinning data in memory. Memory loss happens during reading from the DB via prefix iterator (seek, next), because when just writing (on the average 100MB/s to SSD) we do not lose any memory.
dbdir # free -m
total used free shared buff/cache available
Mem: 64306 49196 293 12 14816 14440
Swap: 7710 7710 0
RocksDB is being used embedded in Java via rocksdbjni. The Java part requires less than 1GB as Heap. Memory usage of that process, as taken from top:
VIRT 0.450t
RES 0.048t
SHR 0.0.12t
I have run pmap on the process, and summed up RSS for the *.sst files: The sum for *.sst is 34GB.
Steps to reproduce the behavior
We can reproduce the behavior in our application quickly. It never happens if we just write. As soon as we allow clients to read, the memory loss happens. The only operations on this DB are put, prefix-iterator seek() + next(), and deleteRange.
One important observation: We use itertators for a short time, then close them to avoid resource locking by the iterator and create a new Iterator. If we chose a shorter time until discarding an Iterator, then we create more iterators and the memory loss is faster.
The text was updated successfully, but these errors were encountered: