diff --git a/content/issue-1/minilsm.md b/content/issue-1/minilsm.md new file mode 100644 index 0000000..455f264 --- /dev/null +++ b/content/issue-1/minilsm.md @@ -0,0 +1,250 @@ +MiniLSM is a working-in-progress tutorial series for building a storage engine. Based on the experience of building +[AgateDB](https://github.com/tikv/agatedb) and [RisingWave](https://github.com/risingwavelabs/risingwave), MiniLSM +introduces you the basics of an LSM-tree storage engine with a step-by-step tutorial. + +The starter code is accessible on [GitHub](https://github.com/skyzh/mini-lsm), and the tutorial is available on +[GitHub Pages](https://skyzh.github.io/mini-lsm/). + +```urlpreview +https://github.com/skyzh/mini-lsm +``` + +## What is LSM, and Why LSM? + +Log-structured merge tree is a data structure to maintain key-value pairs. This data structure is widely used in +distributed database systems like [TiDB](https://www.pingcap.com) and [CockroachDB](https://www.cockroachlabs.com) as +their underlying storage engine. [RocksDB](http://rocksdb.org), based on [LevelDB](https://github.com/google/leveldb), +is an implementation of LSM-tree storage engine. It provides a wide range of key-value access functionalities and is +used in a lot of production systems. + +LSM-tree storage engine usually views keys and values as byte array encoded by users. For example, TiDB encodes a row +in the table [in the following way](https://docs.pingcap.com/tidb/dev/tidb-computing#mapping-table-data-to-key-value): + +``` +| row_id | col1 | col2 | col3 | +| 0 | 1 | 2 | 3 | + +key: table1_record00000000 +value: encoded([0, 1, 2, 3]) +``` + +where the row is encoded into a byte string, so that LSM engine can easily process it. + +Generally speaking, LSM Tree is an append-friendly data structure. It is more intuitive to compare LSM to other +key-value data structure like RB-Tree and B-Tree. For RB-Tree and B-Tree, all data operations are in-place. That is to +say, when you update the value corresponding to the key, the value will be overwritten at its original memory or disk +space. But writes in an LSM Tree, i.e., insertions, updates, deletions, are performed in an append-only way. We will +never change data already written to disk. LSM-tree engine always writes changes to new files (which are called +sorted-string tables, aka. SSTs) and then merge existing files (via compaction). + +This architectural design makes LSM tree easy to work with. + +1. Data are immutable on persistent storage, which means that it is easier to offload the background tasks (compaction) + to remote servers. It is also feasible to directly store and serve data from cloud-native storage systems like S3. +2. An LSM tree can balance between read, write and space amplification by changing the compaction algorithm. The data + structure itself is super versatile and can be optimized for different workloads. + +In this tutorial, we will learn how to build an LSM-tree-based storage engine in the Rust programming language. + +## Overview of LSM + +An LSM storage engine generally contains 3 parts: + +1. Write-ahead log to persist temporary data for recovery. +2. SSTs on the disk for maintaining a tree structure. +3. Mem-tables in memory for batching small writes. + +The storage engine generally provides the following interfaces: + +* `Put(key, value)`: store a key-value pair in the LSM tree. +* `Delete(key)`: remove a key and its corresponding value. +* `Get(key)`: get the value corresponding to a key. +* `Scan(range)`: get a range of key-value pairs. + +To ensure persistence, + +* `Sync()`: ensure all the operations before `sync` are persisted to the disk. + +Some engines choose to combine `Put` and `Delete` into a single operation called `WriteBatch`, which accepts a batch +of key value pairs. + +In this tutorial, we assume the LSM tree is using leveled compaction algorithm, which is commonly used in real-world +systems. + +## Write Flow + +![](https://skyzh.github.io/mini-lsm/figures/lsm-tutorial/00-lsm-write-flow.svg) + +The write flow of LSM contains 4 steps: + +1. Write the key-value pair to write-ahead log, so that it can be recovered after the storage engine crashes. +2. Write the key-value pair to memtable. After (1) and (2) completes, we can notify the user that the write operation + is completed. +3. When a memtable is full, we will flush it to the disk as an SST file in the background. +4. We will compact some files in some level into lower levels to maintain a good shape for the LSM tree, so that read + amplification is low. + +## Read Flow + +![](https://skyzh.github.io/mini-lsm/figures/lsm-tutorial/00-lsm-read-flow.svg) + +When we want to read a key, + +1. We will first probe all the memtables from latest to oldest. +2. If the key is not found, we will then search the entire LSM tree containing SSTs to find the data. + +## Tutorial Overview + +![](https://skyzh.github.io/mini-lsm/figures/lsm-tutorial/00-lsm-tutorial-overview.svg) + +### Day 1 - Block Encoding + +Block is the minimum read unit in LSM. It is of 4KB size in general, similar database pages. In each block, we will +store a sequence of sorted key value pairs. In day 1, we will explore how to use the `bytes` library to do manual data +encoding. Also, you will learn how to write RocksDB-style iterator in Rust. + +**What you will implement:** + +```rust +impl BlockBuilder { + /// Adds a key-value pair to the block. Returns false when the block is full. + #[must_use] + pub fn add(&mut self, key: &[u8], value: &[u8]) -> bool { + unimplemented!() + } + + /// Finalize the block. + pub fn build(self) -> Block {} +} + +impl BlockIterator { + /// Returns the key of the current entry. + pub fn key(&self) -> &[u8] {} + + /// Returns the value of the current entry. + pub fn value(&self) -> &[u8] {} + + /// Move to the next key in the block. + pub fn next(&mut self) {} + + /// Seek to the first key that >= `key`. + pub fn seek_to_key(&mut self, key: &[u8]) {} +} +``` + +### Day 2 - Table Encoding + +SST (sorted string table) is composed of data blocks and index blocks stored on the disk. Usually, data blocks are +lazily loaded -- they will not be loaded into the memory until a user requests it. Index blocks can also be loaded +on-demand, but in this tutorial, we make simple assumptions that all SST index blocks (meta blocks) can fit in memory. +Generally, an SST file is of 256MB size. In day 2, we will implement SST encoding and learn how to compose RocksDB-style +iterators together in Rust. + +**What you will implement:** + +```rust +impl SsTableBuilder { + /// Adds a key-value pair to SSTable + pub fn add(&mut self, key: &[u8], value: &[u8]) {} + + /// Builds the SSTable and writes it to the given path. No need to actually write to disk until + /// chapter 4 block cache. + pub fn build(self, ...) -> Result {} +} + +impl StorageIterator for SsTableIterator { + fn value(&self) -> &[u8] {} + + fn key(&self) -> &[u8] {} + + fn is_valid(&self) -> bool {} + + fn next(&mut self) -> Result<()> {} +} +``` + +### Day 3 - Mem Table and Merge Iterators + +In this tutorial, we use [crossbeam-skiplist](https://docs.rs/crossbeam-skiplist) as the implementation of memtable. You +will learn how to use self-referential struct in Rust to erase lifetime from a Rust-style iterator object, and therefore +being easily composed with other RocksDB-style iterators in the system. + +Also, we will see how to implement high-performance merge iterators with correct error handling with Rust's `BinaryHeap`. +We intentionally avoid using dynamic dispatch (i.e. `Box`) to achieve higher performance. You will need to +implement two different merge iterators: a two-merge iterator to merge different types of iterators, and a multi-way +merge iterator to merge multiple iterators of the same type. You will have a better understanding of lifetime and trait +system in Rust. + +**What you will implement:** + +```rust +/// Merge multiple iterators of the same type. If the same key occurs multiple times in some +/// iterators, perfer the one with smaller index. +pub struct MergeIterator { + iters: BinaryHeap>, + current: HeapWrapper, +} + +/// Merges two iterators of different types into one. If the two iterators have the same key, only +/// produce the key once and prefer the entry from A. +pub struct TwoMergeIterator { + a: A, + b: B, +} + +/// A basic mem-table based on crossbeam-skiplist +pub struct MemTable { + map: crossbeam_skiplist::SkipMap, +} +``` + +### Day 4 - Storage Engine and Block Cache + +In this part, you will need to compose everything you have implemented in previous parts to build a fully functional +storage engine. You will explore how surprisingly easy for us to maintain object references and ownerships in a +multi-thread program with the help of borrow checker. + + +**What you will implement:** + +```rust +impl LsmStorage { + /// Get a key from the storage. In day 7, this can be further optimized by using a bloom filter. + pub fn get(&self, key: &[u8]) -> Result> {} + + /// Put a key-value pair into the storage by writing into the current memtable. + pub fn put(&self, key: &[u8], value: &[u8]) -> Result<()> {} + + /// Remove a key from the storage by writing an empty value. + pub fn delete(&self, _key: &[u8]) -> Result<()> {} + + /// Persist data to disk. + /// + /// In day 3: flush the current memtable to disk as L0 SST. + /// In day 6: call `fsync` on WAL. + pub fn sync(&self) -> Result<()> {} + + /// Create an iterator over a range of keys. + pub fn scan( + &self, + _lower: Bound<&[u8]>, + _upper: Bound<&[u8]>, + ) -> Result> {} +} +``` + +## Day 5 - Compaction + +We now have an LSM structure, and we can merge some files in the background so as to clean garbage data and improve +read efficiency. We will use RocksDB's leveled compaction algorithm to do the compaction. + +## Day 6 - Recovery + +Storage engine should have durability, which means that data won't be lost after the engine crashes. In LSM-tree, we +achieve durability by using a WAL (write-ahead log). In day 6, you will modify the write path to persist in-memory +data to disk, and implement recovery process on startup. + +## Day 7 - Optimizations + +Two key optimizations are widely adapted in today's LSM-tree engines -- prefix key compression and bloom filter. We +will implement the new key encoding and the bloom filter in the last day. diff --git a/content/issue-1/zine.toml b/content/issue-1/zine.toml index 4089b62..ed44025 100644 --- a/content/issue-1/zine.toml +++ b/content/issue-1/zine.toml @@ -22,6 +22,15 @@ pub_date = "2022-12-24" publish = false featured = true +[[article]] +file = "minilsm.md" +title = "MiniLSM: A Tutorial of Building Storage Engine in a Week using Rust" +author = "skyzh" +topic = ["tutorial"] +pub_date = "2022-12-30" +publish = false +featured = true + [[article]] file = "contribute-to-rustc.md" title = "How I contribute to Rust Compiler" diff --git a/static/avatar/skyzh.jpeg b/static/avatar/skyzh.jpeg new file mode 100644 index 0000000..c115c42 Binary files /dev/null and b/static/avatar/skyzh.jpeg differ diff --git a/zine-data.json b/zine-data.json new file mode 100644 index 0000000..9e49427 --- /dev/null +++ b/zine-data.json @@ -0,0 +1,9 @@ +{ + "urlPreviews": { + "https://github.com/skyzh/mini-lsm": [ + "skyzh/mini-lsm: A tutorial of building an LSM-Tree storage engine in a week! (WIP)", + "A tutorial of building an LSM-Tree storage engine in a week! (WIP) - GitHub", + "https://opengraph.githubassets.com/83c82d82a7ab5be32dc4bf96da23f2d066f84fc56aadcfc8e3e866e1472171a9/skyzh/mini-lsm" + ] + } +} \ No newline at end of file diff --git a/zine.toml b/zine.toml index abf3956..e0dafd1 100644 --- a/zine.toml +++ b/zine.toml @@ -21,6 +21,7 @@ The author of [Rust Search Extension](https://github.com/huhu/rust-search-extens lcrossman = { name = "LCrossman", editor = true } macko76 = { editor = true } zhanghandong = { editor = true } +skyzh = { avatar = "/static/avatar/skyzh.jpeg" } yukang = { name = "Yukang", bio = """ - [Github](https://github.com/chenyukang) - [Blog](https://catcoding.me) @@ -46,3 +47,4 @@ announcement = {} game = { name = "Game development", description = "Topic of rust game development." } pr-demystifying = { name = "PR Demystifying", description = "Topic of PR demystifying in Rust community."} optimization = {} +tutorial = {}