Atomic flush

RocksDB supports atomic flush of multiple column families if the DB option atomic_flush is set to true. The execution result of flushing multiple column families is written to the MANIFEST with 'all-or-nothing' guarantee (logically). With atomic flush, either all or no memtables of the column families of interest are persisted to SST files and added to the database.

This can be desirable if data in multiple column families must be consistent with each other. For example, imagine there is one metadata column family meta_cf, and a data column family data_cf. Every time we write a new record to data_cf, we also write its metadata to meta_cf. meta_cf and data_cf must be flushed atomically. Database becomes inconsistent if one of them is persisted but the other is not. Atomic flush provides a good guarantee. Suppose at a certain time, kv1 exists in the memtables of meta_cf and kv2 exists in the memtables of data_cf. After atomically flushing these two column families, both kv1 and kv2 are persistent if the flush succeeds. Otherwise neither of them exist in the database.

Since atomic flush also goes through the write_thread, it is guaranteed that no flush can occur in the middle of write batch.

Note that it is not necessary to use the Atomic flush option if WAL is always enabled. When WAL is enabled, a single WAL file is used to capture writes to all column families; hence, the recovered database (by replaying the WAL logs in crash/recovery path) is guaranteed to be consistent across all column families.

It's easy to enable/disable atomic flush as a DB option. To open the DB with atomic flush enabled:

Options options;
... // Set other options
options.atomic_flush = true;
DBOptions db_opts(options);
DB* db = nullptr;
Status s = DB::Open(db_opts, dbname, column_families, &handles, &db);

For auto-triggered flush, RocksDB atomically flushes ALL column families.

For manual flush, application has to specify the list of column families to flush atomically in DB::Flush():

w_opts.disable_wal = true;
db->Put(w_opts, cf_handle1, key1, value1);
db->Put(w_opts, cf_handle2, key2, value2);
FlushOptions flush_opts;
Status s = db->Flush(flush_opts, {cf_handle1, cf_handle2});

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomic flush

Clone this wiki locally