Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RocksDB is not freeing up space after a delete. #8041

Closed
ilkoiliev951 opened this issue Mar 8, 2021 · 7 comments
Closed

RocksDB is not freeing up space after a delete. #8041

ilkoiliev951 opened this issue Mar 8, 2021 · 7 comments

Comments

@ilkoiliev951
Copy link

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Hello,

Most of our services are using a Kafka store, which as you know is using RocksDB under the hood. We are trying to delete outdated and wrongly formatted records every 6 hours, in order to free up space. Even though the record gets deleted from RocksDB (a tombstone gets added and the record is no longer available), we see no changes in space.

I suppose, that a compaction needs to be triggered, in order compact away the deleted records. However, as far as I know, a leveled compaction gets triggered only when number of L0 files reaches level0_file_num_compaction_trigger. Because of the fact, that my service is consuming almost no data (on dev environment), I believe, that a compaction cannot be triggered and therefore the "deleted" records remain.

Please note, that we are using only the default RocksDB configuration. I also noticed, that when I use #options.setDeleteObsoleteFilesPeriodMicros() in a custom RocksDB config, the size of the local store drops dramatically. However, I am not sure, what the method does exactly. I also read, that there is an option for a periodic compaction.

Any help would be appreciated. Thank you in advance.

@riversand963
Copy link
Contributor

riversand963 commented Mar 17, 2021

Have you considered deletion-triggered compaction ()? Some simple code example as below.

Options options;
options.create_if_missing = true;
options.table_properties_collector_factories.emplace_back(NewCompactOnDeletionCollectorFactory(100, 90, /*deletion_ratio=*/0.5));
DestroyAndReopen(options);
for (int i = 0; i < 100; ++i) {
  ASSERT_OK(Put("key" + std::to_string(i), "value"));
}
ASSERT_OK(Flush());
for (int i = 0; i < 50; ++i) {
 ASSERT_OK(Delete("key" + std::to_string(i)));
}
ASSERT_OK(Flush());
ASSERT_OK(dbfull()->TEST_WaitForCompact());
ASSERT_EQ(1, NumTableFilesAtLevel(1));

btw, some clarification will be needed.

Even though the record gets deleted from RocksDB (a tombstone gets added and the record is no longer available)

my service is consuming almost no data.

These two sound contradictory, since deletion still writes tombstones to your db that can trigger compaction.

You can also look at TTL compaction which compacts data to the bottommost level even if there is no write to trigger compaction (https://github.com/facebook/rocksdb/blob/6.18.fb/include/rocksdb/advanced_options.h#L721).

@ilkoiliev951
Copy link
Author

Alright, I am going to explain in more detail. What I mean by "the record is deleted" is, that it is not longer available in the local Kafka KeyValueStore, when we try to retrieve it. By "my service is consuming almost no data" I mean, that my local instance consumes almost no data from the Kafka broker. However, this is becoming a huge problem on our production environment, where some of the stores exceed 2 GB.

The main problem is, that the old SST files do not get deleted. When I checked, I saw, that there are still file handles from my Java application to the old SST files. I explicitly closed all the iterators, however, the SST files remain. After setting the LOG_LEVEL of RocksDB to DEBUG, I saw, that a compaction happened. Is it possible, that the opened Kafka KeyValue store still holds reference to the old SST files and therefore preventing them from being deleted?

Is there a way to implement a Java-based deletion-triggered compaction?

@riversand963
Copy link
Contributor

Do you see these lines for the files in the log?
Something like:

2021/03/22-13:10:38.661781 7f58421fe700 [le/delete_scheduler.cc:77] Deleted file /tmp/rocksdbtest-148062/dbstress/000185.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000

or something like

[JOB 34] Delete /tmp/rocksdbtest-148062/dbstress/000185.sst type=2 number=..

@ilkoiliev951
Copy link
Author

ilkoiliev951 commented Mar 22, 2021

Yes, but only for the deletion of the manifest file. I see no jobs scheduled for the deletion of sst files.

@riversand963
Copy link
Contributor

If the file is not being used, then after compaction, the compaction job will try to delete them. Something should be holding references to the files...

@linas
Copy link

linas commented Apr 13, 2021

FYI, there appears to be a file descriptor leak. I describe this in issue #3216 and in #4112. In unix/linux, you can delete a file, but as long as there is an open file descriptor to that file, those disk blocks will not be freed. This is because the OS assumes that the process will continue to access this file, and so the disk blocks cannot be freed, even though the inode entries are removed (so that ls -la will not show that file any longer).

You can view deleted-but-still open files by saying lsof -p <pid> |grep deleted -- in my case, I see hundreds or more of these.

Resolved: See my own comment #8041 (comment) immediately below. There is a file descriptor leak, because my own code had an iterator leak. Failing to delete iterators results in deleted files whose descriptor remains open (thus eating disk space) and the deleted file is still memory-mapped (thus eating RAM).

You can "watch this happen" -- run rocks for a while, wait until lsof -p <pid> |grep sst| wc gets large. Force compaction to happen (I can do this by closing rocks, reopening, and closing again, without exiting the app.) Verify that lsof -p <pid> |grep sst | grep deleted | wc is a large number. Verify the actual number of undeleted sst files is small: find your-rocks.rdb |grep sst |wc is "reasonable" for your dataset.

Now do df to look at your filesystem. Now exit the app that is using rocks, and do df again. Notice that dozens or hundreds of GBytes of disk space are now free. Notice that find your-rocks.rdb |grep sst |wc has not changed at all! All of that freed disk space is coming from the deleted-but-still-mapped sst files.

@linas
Copy link

linas commented Apr 13, 2021

Update: after this comment: #3216 (comment) I came to realize that I have made a complete newbie mistake in my c++ code: rocks iterators are NOT smart pointers that self-delete when they go out of scope. They must be explicitly deleted! Upon fixing this complete-newbie mistake, all my disk and RAM usage problems go away! Wow!

I suggest that anyone else reading this take a good close look at their iterators, and review <rocksdb/db.h> for anything else that needs an explicit delete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants