- Author(s): Wish
Currently TiFlash may easily meet write stalls when ingesting snapshots. The ingestions come from:
- BR / Lightning Importing: Ingest SST
- Adding / Removing TiFlash replica: ApplySnapshot
Additionally, TiKV is introducing Big Regions in the Dynamic Region RFC. Ingesting a big snapshot from the big region will also be problematic.
This RFC improves the process of ingesting snapshots.
Improvements come from a thorough study of what's wrong in the current code base, which will be covered in this section.
TiFlash will ingest "data snapshots" in the form of SST files in these cases:
-
SET TIFLASH REPLICA [N]
: Raft Snapshots are sent to TiFlash for each region. Each snapshot contains everything so far in the region. We call this process ApplySnapshot.Note: When TiFlash learner falls behind a lot, ApplySnapshot will also happen, since incremental raft logs are compacted and cannot used to send to TiFlash.
-
Using BR / Lightning to restore data into a table with TiFlash replica: SST files are sent to TiFlash in a Raft Message "Ingest SST". We call this process IngestSST.
The IngestSST operation provided by RocksDB (and thus TiFlash) is to bulk load data without erasing existing data, so that we need to keep this behavior.
This process is bounded by the following rate limits:
- PD schedule threshold
- TiKV Snapshot generation thread size
- TiKV Snapshot write rate limit
- TiFlash Snapshot decode thread size
- TiFlash Snapshot write rate limit
- Hard limit: snapshot-worker is single threaded.
Next, we will analyze the bottleneck of this process.
In this case, the speed is limited by PD schedule threshold.
pending-region is low, means TiFlash ingests snapshots quickly:
After relaxing all soft-coded restrictions by adjusting configurations, ingestion will be ultimately blocked by the single-threaded snapshot worker write stall caused by segment too large.
"Snapshot Flush Duration" is the time of each task spent in snapshot worker:
The major reason for slow Snapshot Flush is foreground segment split:
Typical event timeline happened in one segment during the workload:
- (Segment is split out)
- Delta is large, Prepare merge delta
- snapshot-worker: Ingest × N
- snapshot-worker: Write stall...
- Apply merge delta
- Prepare split (started by apply merge delta)
- snapshot-worker: Ingest × N
- snapshot-worker: Write stall...
- Apply split
- (Back to 3)
Note, when snapshots are applied slower using default config, there will be no write stalls. However the overall write amplification will be larger because "merge delta" operation is likely to be triggered for every ingested snapshots.
Appendix: Adjusted configurations for TiKV and TiFlash:
tikv:
import.num-threads: 16
raftstore.snap-generator-pool-size: 12
server.snap-max-write-bytes-per-sec: 1GiB
tiflash-learner:
import.num-threads: 16
raftstore.apply-low-priority-pool-size: 40
raftstore.snap-handle-pool-size: 40
server.snap-max-write-bytes-per-sec: 1GiB
Unlike ApplySnapshot, IngestSST is bounded by different rate limits:
- BR / Lightning configuration / bandwidth
- TiKV Importer thread size (default=8)
- TiFlash Importer thread size (default=4)
- TiFlash low-priority Apply thread size (default=1)
We can see that the performance is limited by the low-priority Apply thread.
Typical event timeline happened in one segment when using TiDB Lightning:
- (Segment is split out)
- Delta is large, Prepare merge delta
- Ingest
- Apply merge delta
- Delta is large, Prepare merge delta
- Apply merge delta
- Ingest
- Delta is large, Prepare merge delta
- Apply merge delta
- Prepare split (started by apply merge delta)
- Ingest
- Apply split
TiFlash only lags 1min for 100GB data (1 TiKV : 1 TiFlash) in this case. The write amplification is also reduced from 1.64 to 1.02, because repeated merge delta is greatly reduced (500%). Typical event timeline:
- (Segment is split out)
- Delta is large, Prepare merge delta
- Ingest × N
- Apply merge delta
- Prepare split (started by apply merge delta)
- Ingest × N
- Apply split
The new bottleneck in this case is the TiFlash write stall due to delta too large and segment too large, multiple threads need to wait for the same segment to merge delta or split.
-
To improve ApplySnapshot performance, we need to ensure that ingesting snapshots finishes instantly and will not easily cause foreground write stalls. In this way, the single-threaded apply worker will not be blocked. Write stall should happen only when background workers are too busy.
-
To improve IngestSST performance, in addition to the configuration change, we can further reduce write stalls for hot segments. Otherwise the multi-threaded pool will be easily drained out quickly, waiting for the single segment.
This further improvement will be covered in future RFCs, not in this RFC.
Low Priority Apply Pool size will be increased from 1 to, for example, 4 or 8, which is chosen automatically according to the number of CPU cores that TiFlash is allowed to use, so that IngestSST can be parallelized.
During BR restore, TiKV utilizes full resources and causes business impacts. At this moment TiFlash should also utilize more resources in order to speed up the restore and minimize the impact time.
Parallelized ingestion also reduces the overall write amplification.
Consider the case that we are applying a snapshot to a segment. The segment may contain overlapped data:
Previously, the snapshot is ingested into the delta layer as a ColumnFileBig. In this RFC, we will do the following instead:
-
Write DeleteRange to the target ingestion range, as what we did previously.
-
Logical-split the segment according to the range of the snapshot:
-
Replace the segment with the snapshot by placing the snapshot DTFile into the stable directly, then clear the delta layer:
Remember that, for ApplySnapshot, we always clear all data in the range, so that substituting everything in the segment with the snapshot content will not cause correctness issues.
The snapshots may be very small or even empty. For these small snapshots (<16MiB), we will still ingest them into the delta layer, without doing logical split or segment replacement.
After change 2 is applied, there will be many small-sized segments, e.g. around 96MB or 0. To maintain better DeltaTree structure, we can merge these segments into one in the background.
-
Currently segment merge only takes two segments at a time. In this RFC, merging operation will try to take as many segments as possible to reduce the write amplification.
-
Currently segment merge happens in the background task thread, triggered by write or ingest thread. However, as these small segments do not notably affect performance, we can defer it to later. In this RFC, we will change it to be triggered by the GC thread. This could also be helpful to reduce repeated segment split & merge.
When snapshot size is larger than segment size, for example, consider it as 10GiB, which may happen when dynamic region is enabled. In change 2, we will have a standalone segment of size 10GiB first, and then it will be split and become smaller segments. During this process, write stall happens as it exceeds the max segment size.
We can write into multiple DTFiles sequentially, cut by the size of segment size, when transforming from row to column. In this way, the newly formed segment will not be too large and will not cause write stalls.
None
See background section for current bottlenecks.
None