-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Atomic flush
RocksDB supports atomic flush of multiple column families if the DB option atomic_flush
is set to true
. The execution result of flushing multiple column families is written to the MANIFEST with 'all-or-nothing' guarantee (logically). With atomic flush, either all or no memtables of the column families of interest are persisted to SST files and added to the database.
This can be desirable if data in multiple column families must be consistent with each other. For example, imagine there is one metadata column family meta_cf
, and a data column family data_cf
. Every time we write a new record to data_cf
, we also write its metadata to meta_cf
. meta_cf
and data_cf
must be flushed atomically. Database becomes inconsistent if one of them is persisted but the other is not. Atomic flush provides a good guarantee. Suppose at a certain time, kv1 exists in the memtables of meta_cf
and kv2 exists in the memtables of data_cf
. After atomically flushing these two column families, both kv1 and kv2 are persistent if the flush succeeds. Otherwise neither of them exist in the database.
Since atomic flush also goes through the write_thread
, it is guaranteed that no flush can occur in the middle of write batch.
It's easy to enable/disable atomic flush as a DB option. To open the DB with atomic flush enabled,
Options options;
... // Set other options
options.atomic_flush = true;
DBOptions db_opts(options);
DB* db = nullptr;
Status s = DB::Open(db_opts, dbname, column_families, &handles, &db);
Currently the user is responsible for specifying the column families to flush atomically in the case of manual flush.
w_opts.disable_wal = true;
db->Put(w_opts, cf_handle1, key1, value1);
db->Put(w_opts, cf_handle2, key2, value2);
FlushOptions flush_opts;
Status s = db->Flush(flush_opts, {cf_handle1, cf_handle2});
In the case automatic flushes triggered internally by RocksDB, we currently flush all column families in the database for simplicity.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc