Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add doc about the local disk based WAL #140

Merged
merged 4 commits into from
Oct 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions content/cn/docs/design/wal_on_disk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
title: "基于本地磁盘的 WAL"
---

## 架构

本节将介绍基于本地磁盘的单机版 WAL(Write-Ahead Log,以下简称日志)的实现。在此实现中,日志按 region 级别进行管理。

```
┌────────────────────────────┐
│ HoraeDB │
│ │
│ ┌────────────────────────┐ │
│ │ WAL │ │ ┌────────────────────────┐
│ │ │ │ │ │
│ │ ...... │ │ │ File System │
│ │ │ │ │ │
│ │ ┌────────────────────┐ │ │ manage │ ┌────────────────────┐ │
Write ─────┼─┼─► Region ├─┼─┼─────────┼─► Region Dir │ │
│ │ │ │ │ │ │ │ │ │
Read ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │
│ │ │ │ Segment 0 ├───┼─┼─┼─────────┼─┼─► Segment File 0 │ │ │
│ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │
Delete ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │
│ │ │ │ Segment 1 ├───┼─┼─┼─────────┼─┼─► SegmenteFile 1 │ │ │
│ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │
│ │ │ ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │
│ │ │ │ Segment 2 ├───┼─┼─┼─────────┼─┼─► SegmenteFile 2 │ │ │
│ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │
│ │ │ ...... │ │ │ │ │ ...... │ │
│ │ └────────────────────┘ │ │ │ └────────────────────┘ │
│ │ ...... │ │ │ ...... │
│ └────────────────────────┘ │ └────────────────────────┘
└────────────────────────────┘
```

## 数据模型

### 文件路径

每个 region 都拥有一个目录,用于管理该 region 的所有 segment。目录名为 region 的 ID。每个 segment 的命名方式为 `seg_<id>`,ID 从 0 开始递增。

### Segment 的格式

一个 region 中所有表的日志都存储在 segments 中,并按照 sequence number 从小到大排列。segment 文件的结构如下:

```
Segment0 Segment1
┌────────────┐ ┌────────────┐
│ Magic Num │ │ Magic Num │
├────────────┤ ├────────────┤
│ Record │ │ Record │
├────────────┤ ├────────────┤
│ Record │ │ Record │
├────────────┤ ├────────────┤ ....
│ Record │ │ Record │
├────────────┤ ├────────────┤
│ ... │ │ ... │
│ │ │ │
└────────────┘ └────────────┘
seg_0 seg_1
```

在内存中,每个 segment 还会存储一些额外的信息以供读写和删除操作使用:

```
pub struct Segment {
/// A hashmap storing both min and max sequence numbers of records within
/// this segment for each `TableId`.
table_ranges: HashMap<TableId, (SequenceNumber, SequenceNumber)>,

/// An optional vector of positions within the segment.
record_position: Vec<Position>,

...
}
```

### 日志格式

segment 中的日志格式如下:

```
+---------+--------+------------+--------------+--------------+-------+
| version | crc | table id | sequence num | value length | value |
| (u8) | (u32) | (u64) | (u64) | (u32) |(bytes)|
+---------+--------+------------+--------------+--------------+-------+
```

字段说明:

1. `version`:日志版本号。

2. `crc`:用于确保数据一致性。计算从 table id 到该记录结束的 CRC 校验值。

3. `table id`:表的唯一标识符。

4. `sequence num`:记录的序列号。

5. `value length`:value 的字节长度。

6. `value`:通用日志格式中的值。

日志中不存储 region ID,因为可以通过文件路径获取该信息。

## 主要流程

### 打开 Wal

1. 识别 Wal 目录下的所有 region 目录。

2. 在每个 region 目录下,识别所有 segment 文件。

3. 打开每个 segment 文件,遍历其中的所有日志,记录其中每个日志开始和结束的偏移量和每个 `TableId` 在该 segment 中的最小和最大序列号,然后关闭文件。

4. 如果不存在 region 目录或目录下没有任何 segment 文件,则自动创建相应的目录和文件。

### 读日志

1. 根据 segment 的元数据,确定本次读取操作涉及的所有 segment。
2. 按照 id 从小到大的顺序,依次打开这些 segment,将原始字节解码为日志。

### 写日志

1. 将待写入的日志序列化为字节数据,追加到 id 最大的 segment 文件中。
2. 每个 segment 创建时预分配固定大小的 64MB,不会动态改变。当预分配的空间用完后,创建一个新的 segment,并切换到新的 segment 继续追加。

3. 每次追加后不会立即调用 flush;默认情况下,每写入十次或在 segment 文件关闭时才执行 flush。

4. 在内存中更新 segment 的元数据 `table_ranges`。

### 删除日志

假设需要将 id 为 `table_id` 的表中,序列号小于 seq_num 的日志标记为删除:

1. 在内存中更新相关 segment 的 `table_ranges` 字段,将该表的最小序列号更新为 seq_num + 1。

2. 如果修改后,该表在此 segment 中的最小序列号大于最大序列号,则从 `table_ranges` 中删除该表。

3. 如果一个 segment 的 `table_ranges` 为空,且不是 id 最大的 segment,则删除该 segment 文件。
142 changes: 142 additions & 0 deletions content/en/docs/design/wal_on_disk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: "WAL on Disk"
---

## Architecture

This section introduces the implementation of a standalone Write-Ahead Log (WAL, hereinafter referred to as "the log") based on a local disk. In this implementation, the log is managed at the region level.

```
┌────────────────────────────┐
│ HoraeDB │
│ │
│ ┌────────────────────────┐ │
│ │ WAL │ │ ┌────────────────────────┐
│ │ │ │ │ │
│ │ ...... │ │ │ File System │
│ │ │ │ │ │
│ │ ┌────────────────────┐ │ │ manage │ ┌────────────────────┐ │
Write ─────┼─┼─► Region ├─┼─┼─────────┼─► Region Dir │ │
│ │ │ │ │ │ │ │ │ │
Read ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │
│ │ │ │ Segment 0 ├───┼─┼─┼─────────┼─┼─► Segment File 0 │ │ │
│ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │
Delete ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │
│ │ │ │ Segment 1 ├───┼─┼─┼─────────┼─┼─► Segment File 1 │ │ │
│ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │
│ │ │ ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │
│ │ │ │ Segment 2 ├───┼─┼─┼─────────┼─┼─► Segment File 2 │ │ │
│ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │
│ │ │ ...... │ │ │ │ │ ...... │ │
│ │ └────────────────────┘ │ │ │ └────────────────────┘ │
│ │ ...... │ │ │ ...... │
│ └────────────────────────┘ │ └────────────────────────┘
└────────────────────────────┘
```

## Data Model

### File Paths

Each region has its own directory to manage all segments for that region. The directory is named after the region's ID. Each segment is named using the format `seg_<id>`, with IDs starting from 0 and incrementing.

### Segment Format

Logs for all tables within a region are stored in segments, arranged in ascending order of sequence numbers. The structure of the segment files is as follows:

```
Segment0 Segment1
┌────────────┐ ┌────────────┐
│ Magic Num │ │ Magic Num │
├────────────┤ ├────────────┤
│ Record │ │ Record │
├────────────┤ ├────────────┤
│ Record │ │ Record │
├────────────┤ ├────────────┤ ....
│ Record │ │ Record │
├────────────┤ ├────────────┤
│ ... │ │ ... │
│ │ │ │
└────────────┘ └────────────┘
seg_0 seg_1
```

In memory, each segment stores additional information used for read, write, and delete operations:

```rust
pub struct Segment {
/// A hashmap storing both min and max sequence numbers of records within
/// this segment for each `TableId`.
table_ranges: HashMap<TableId, (SequenceNumber, SequenceNumber)>,

/// An optional vector of positions within the segment.
record_position: Vec<Position>,

...
}
```

### Log Format

The log format within a segment is as follows:

```
+---------+--------+------------+--------------+--------------+-------+
| version | crc | table id | sequence num | value length | value |
| (u8) | (u32) | (u64) | (u64) | (u32) |(bytes)|
+---------+--------+------------+--------------+--------------+-------+
```

Field Descriptions:

1. `version`: Log version number.

2. `crc`: Used to ensure data consistency. Computes the CRC checksum from the table id to the end of the record.

3. `table id`: The unique identifier of the table.

4. `sequence num`: The sequence number of the record.

5. `value length`: The byte length of the value.

6. `value`: The value in the general log format.

The region ID is not stored in the log because it can be obtained from the file path.

## Main Processes

### Opening the WAL

1. Identify all region directories under the WAL directory.

2. In each region directory, identify all segment files.

3. Open each segment file, traverse all logs within it, record the start and end offsets of each log, and record the minimum and maximum sequence numbers of each `TableId` in the segment, then close the file.

4. If there is no region directory or there are no segment files under the directory, automatically create the corresponding directory and files.

### Reading Logs

1. Based on the metadata of the segments, determine all segments involved in the current read operation.

2. Open these segments in order of their IDs from smallest to largest, and decode the raw bytes into logs.

### Writing Logs

1. Serialize the logs to be written into byte data and append them to the segment file with the largest ID.

2. When a segment is created, it pre-allocates a fixed size of 64MB and will not change dynamically. When the pre-allocated space is used up, a new segment is created, and appending continues in the new segment.

3. After each append, `flush` is not called immediately; by default, `flush` is performed every ten writes or when the segment file is closed.

4. Update the segment's metadata `table_ranges` in memory.

### Deleting Logs

Suppose logs in the table with ID `table_id` and sequence numbers less than `seq_num` need to be marked as deleted:

1. Update the `table_ranges` field of the relevant segments in memory, updating the minimum sequence number of the table to `seq_num + 1`.

2. If after modification, the minimum sequence number of the table in this segment is greater than the maximum sequence number, remove the table from `table_ranges`.

3. If a segment's `table_ranges` is empty and it is not the segment with the largest ID, delete the segment file.
Loading