From b029e0585a6dc628342d1ae0e0e38311fdb78afc Mon Sep 17 00:00:00 2001 From: Yingchun Lai Date: Tue, 6 Feb 2024 14:24:34 +0800 Subject: [PATCH] Update usage-scenario docs (#75) --- _docs/en/administration/usage-scenario.md | 79 +++++++++++++- _docs/zh/administration/usage-scenario.md | 126 +++++++++++----------- 2 files changed, 139 insertions(+), 66 deletions(-) diff --git a/_docs/en/administration/usage-scenario.md b/_docs/en/administration/usage-scenario.md index dc0d47fc..6e46c9fc 100644 --- a/_docs/en/administration/usage-scenario.md +++ b/_docs/en/administration/usage-scenario.md @@ -2,4 +2,81 @@ permalink: administration/usage-scenario --- -TRANSLATING +> Since 1.8.1, Pegasus supports the Usage Scenario function. + +# Principle + +The Usage Scenario function refers to specifying the Pegasus table's _usage scenario_. By optimizing RocksDB options for different scenarios, better read and write performance can be achieved. + +RocksDB adopts the LSM tree storage architecture, [Compaction](https://github.com/facebook/rocksdb/wiki/Compaction) haa a significant impact on read-write performance. Pegasus adopts the _Classic Level_ algorithm, and its comparison principle is referenced to [Leveled-Compaction](https://github.com/facebook/rocksdb/wiki/Leveled-Compaction). + +RocksDB is a highly configurable engine, where various flush and compact operations can be adjusted through configurations, and some configurations can be modified at runtime. Here are several key configurations: +> (The configuration instructions are from the RocksDB source code) +> * write_buffer_size: Amount of data to build up in memory before converting to a sorted on-disk file. +> * level0_file_num_compaction_trigger: Number of files to trigger level-0 compaction. A value <0 means that level-0 compaction will not be triggered by number of files at all. +> * level0_slowdown_writes_trigger:Soft limit on number of level-0 files. We start slowing down writes at this point. A value <0 means that no writing slow down will be triggered by number of files in level-0. +> * level0_stop_writes_trigger: Maximum number of level-0 files. We stop writes at this point. +> * max_bytes_for_level_base: Control maximum total data size for a level. max_bytes_for_level_base is the max total for level-1. +> * max_bytes_for_level_multiplier: Maximum number of bytes for level L can be calculated as (max_bytes_for_level_base) * (max_bytes_for_level_multiplier ^ (L-1)). + +When providing read and write services, Pegasus needs to consider these factors: +* The faster the write operations, the faster the memtable can be filled up, and the faster it flushes to generate new sstable files on level-0 +* As the sstable files accumulate on level-0, compaction operations are triggered, and it propagates layer by layer from lower to higher levels +* The more compaction operations, the higher the CPU and disk IO load consumed, thereby affecting the performance of read and write operations +* If the speed of the compaction operations from level-0 to level-1 is lower than the speed of data writing, the number of files on level-0 accumulates more, eventually reaching the `level0_slowdown_writes_trigger` threshold, causing the latency of the write operations increase sharply, and even further reaching the `level0_stop_writes_trigger` threshold, causing the write operations fail, affecting system stability and service availability +* It is difficult to meet both high throughput and low latency for read and write operations requirements simultaneously, it needs a trade-off. + * The faster the compaction operations performed, the fewer files accumulated on level-0, and the fewer files that need to be consulted during the read operations, resulting in higher read performance + * But the faster the compaction operations performed, the greater the write amplification they bring, and the higher the CPU and disk IO load, which also affect read and write performance + +Fortunately, RocksDB has also provided some solutions to this issue, for example in [RocksDB-FAQ](https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ): + +> Q: What's the fastest way to load data into RocksDB? +> +> A: A fast way to direct insert data to the DB: +> +> 1. using single writer thread and insert in sorted order +> 2. batch hundreds of keys into one write batch +> 3. use vector memtable +> 4. make sure options.max_background_flushes is at least 4 +> 5. before inserting the data, disable automatic compaction, set options.level0_file_num_compaction_trigger, options.level0_slowdown_writes_trigger and options.level0_stop_writes_trigger to very large value. After inserting all the data, issue a manual compaction. +> +> 3-5 will be automatically done if you call Options::PrepareForBulkLoad() to your option + +Pegasus's solution is to set different RocksDB options for different usage scenarios and adjust the behavior of RocksDB to provide better read and write performance. Specifically: +* Use the [Table environment](table-env) to set `rocksdb.usage_scenario` to specify the corresponding usage scenario +* When the replicas of each table detect the environment variable changes, they will modify the RocksDB options according to the usage scenario. +> Refer to the `set_usage_scenario()` function in [src/server/pegasus_server_impl.cpp](https://github.com/apache/incubator-pegasus/blob/master/src/server/pegasus_server_impl.cpp) to check which options are modified. + +# Supported scenarios + +Currently, Pegasus supports three scenarios: +* normal: Normal scenario, balancing reading and writing. This is also the default scenario for tables, which does not require special optimization for writing and is suitable for most read-write balanced applications +* prefer_write: Write more and read less scenario. Mainly increase the size of `write_buffer_size` and `level0_file_num_compaction_trigger` to slow down the memtable flush operations and the compaction operations from level-0 to level-1 +* bulk_load: The scenario of bulk loading data (Note: this is not [bulk-load](https://pegasus.apache.org/zh/2020/02/18/bulk-load-design.html)). Use the optimization mentioned in RocksDB-FAQ above, disable the compaction operations. Now, all newly written data accumulates on level-0, which is not read friendly. Therefore, the `bulk_load` scenario is usually used in conjunction with [Manual Compact](manual-compact). After the data loading is completed, perform a _Manual Compact_ to garbage collection quickly, full sorting to improve read performance, and then restore to `normal` scenario. Typical batch data import process: + * Set the table `rocksdb.usage_scenario` to `bulk_load` + * load data: In `bulk_load` scenario, the TPS will be higher and traffic will be more stable + * Execute _Manual Compact_: A significant amount of CPU and disk IO resources will be consumed in this process, which may have an impact on the cluster read and write performance + * Reset the table `rocksdb.usage_scenario` to `normal` + +# How to use it ? + +## Through shell tools + +Use the [set_app_envs](/docs/tools/shell/#set_app_envs) command in shell tools, for example, set table `temp` to `bulk_load` scenario: +``` +>>> use temp +>>> set_app_envs rocksdb.usage_scenario bulk_load +``` + +> The environment variables of table doesn't take effect immediately and will take about a few seconds (depends on `[replication]config_sync_interval_ms` option) to take effect on all replicas. + +## Through an assisted script + +Pegasus provides an assisted script [scripts/pegasus_set_usage_scenario.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_set_usage_scenario.sh) to set environment variables conveniently, usage: +``` +$ ./scripts/pegasus_set_usage_scenario.sh +This tool is for set usage scenario of specified table(app). +USAGE: ./scripts/pegasus_set_usage_scenario.sh +``` + +This script sets the table environment variables through the commands in the shell tools, then check if it has taken effect on all replicas. It is considered complete only if it has taken effect on all replicas. diff --git a/_docs/zh/administration/usage-scenario.md b/_docs/zh/administration/usage-scenario.md index 191218e3..536b9f87 100644 --- a/_docs/zh/administration/usage-scenario.md +++ b/_docs/zh/administration/usage-scenario.md @@ -2,85 +2,81 @@ permalink: administration/usage-scenario --- -注:Usage Scenario功能从v1.8.1版本开始支持。 +> 从 1.8.1 版本开始,Pegasus 支持了 Usage Scenario 功能。 # 原理 -Usage Scenario功能,是指对于Pegasus的表,可以指定其使用场景。针对不同的场景,通过优化底层RocksDB的配置,以获得更好的读写性能。 - -我们知道,RocksDB的LSM-Tree设计具有明显的写放大效应。数据先是写入到memtable中,当memtable满了,就会flush到sstable中,并放在level-0层。当level-0层的文件个数达到预设的限制时,就会触发compaction操作,将level-0层的所有文件merge到level-1层。同理,当level-1层的数据量达到预设的限制时,也会触发level-1层的compaction,将挑选一些文件merge到level-2层。这样,数据会从低层往高层逐层上移。 - -RocksDB是一个可配置性很强的引擎,上面的各种行为都可以通过配置参数来调节,并且有很多参数都是可以动态修改的。这里给出几个比较关键的配置参数: -* write_buffer_size:memtable的大小限制,配置得越小,写入同样数据量产生的sstable文件数就越多。 -* level0_file_num_compaction_trigger:level-0层的文件个数限制,当达到这个限制时,就会触发compaction。 -* level0_slowdown_writes_trigger:当level-0层的文件个数超过这个值的时候,就会触发slowdown-writes行为,通过主动提升写操作延迟,来降低写入速度。 -* level0_stop_writes_trigger:当level-0层的文件个数超过这个值的时候,就会触发stop-writes行为,拒绝写入新的数据。 -* max_bytes_for_level_base:level-1层的数据量限制,当达到这个限制时,就会挑选一些文件merge到level-2层。 -* max_bytes_for_level_multiplier:数据量随层数的增长因子,譬如如果设为10,表示level-2的数据量限制就是level-1的10倍,level-3的数据量限制也是level-2的10倍。 - -Pegasus在提供读写服务的时候,需要考虑这些因素: -* 当写数据的速度较大时,memtable很快就会写满,就会不断flush memtable产生新的sstable文件; -* 新sstable文件的产生就会触发compaction,并从低层向高层逐层蔓延; -* compaction需要耗费大量的CPU和IO,造成**机器的CPU和IO负载居高不下**,影响读写操作的性能; -* 如果compaction速度赶不上数据写入的速度,level-0的文件数就会越堆越多,最终达到`level0_slowdown_writes_trigger`的限制,**使写操作的延迟陡增**;甚至进一步达到`level0_stop_writes_trigger`的限制,**使写操作失败**,影响系统的稳定性和服务的可用性。 -* **读写需求很难同时得到满足**,二者需要权衡。compaction进行得越快,level-0的文件数维持得越少,读的时候需要读取的文件个数就越少,读性能就越高;但是compaction越快,带来的写放大效应就越大,CPU和IO的负载就越重,也会影响读写性能。 - -所幸的是,RocksDB针对这个问题也给出了一些解决方案,譬如在[RocksDB-FAQ](https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ)中给出的方案: -``` -Q: What's the fastest way to load data into RocksDB? - -A: A fast way to direct insert data to the DB: - 1. using single writer thread and insert in sorted order - 2. batch hundreds of keys into one write batch - 3. use vector memtable - 4. make sure options.max_background_flushes is at least 4 - 5. before inserting the data, disable automatic compaction, set options.level0_file_num_compaction_trigger, - options.level0_slowdown_writes_trigger and options.level0_stop_writes_trigger to very large. After inserting all the - data, issue a manual compaction. - -3-5 will be automatically done if you call Options::PrepareForBulkLoad() to your option -``` -而我们的思路正是:通过针对不同业务场景,设置不同的RocksDB参数,调节RocksDB的行为,以提供更好的读写性能。具体来说: -* 通过[Table环境变量](table-env)设置`rocksdb.usage_scenario`来指定当前的业务场景。 -* Replica在检测到该环境变量发生变化时,就会根据业务场景,动态修改RocksDB的配置参数。具体设置了哪些参数,请参见[src/server/pegasus_server_impl.cpp](https://github.com/apache/incubator-pegasus/blob/master/src/server/pegasus_server_impl.cpp)中的`set_usage_scenario()`方法。 +Usage Scenario 功能,是指对 Pegasus 表指定 _使用场景_。针对不同的场景,通过优化底层 RocksDB 的配置,以获得更好的读写性能。 + +RocksDB 采用了 LSM tree 存储架构,其中的 [Compaction](https://github.com/facebook/rocksdb/wiki/Compaction) 会较大地影响读写性能,Pegasus 采用了 Classic Leveled 算法,他的 Compaction 原理参考 [Leveled-Compaction](https://github.com/facebook/rocksdb/wiki/Leveled-Compaction)。 + +RocksDB 是一个可配置性很强的引擎,各种 flush 操作和 compaction 操作都可以通过配置调节,并且有部分配置是可以运行时修改的。这里给出几个比较关键的配置: +> (配置说明源自 RocksDB 源码) +> * write_buffer_size: Amount of data to build up in memory before converting to a sorted on-disk file. +> * level0_file_num_compaction_trigger: Number of files to trigger level-0 compaction. A value <0 means that level-0 compaction will not be triggered by number of files at all. +> * level0_slowdown_writes_trigger:Soft limit on number of level-0 files. We start slowing down writes at this point. A value <0 means that no writing slow down will be triggered by number of files in level-0. +> * level0_stop_writes_trigger: Maximum number of level-0 files. We stop writes at this point. +> * max_bytes_for_level_base: Control maximum total data size for a level. max_bytes_for_level_base is the max total for level-1. +> * max_bytes_for_level_multiplier: Maximum number of bytes for level L can be calculated as (max_bytes_for_level_base) * (max_bytes_for_level_multiplier ^ (L-1)). + +Pegasus 在提供读写服务时,需要考虑这些因素: +* 写操作越快,memtable 就会更快写满,就会更快地 flush memtable 产生 level-0 上新的 sstable 文件 +* 随着 level-0 上 sstable 文件的累积,就会触发 compaction 操作,并从低层向高层逐层蔓延 +* Compaction 操作越多,耗费的 CPU 和磁盘 IO 负载越高,从而影响读写操作的性能 +* 如果 level-0 到 level-1 的 compaction 操作速度低于数据写入的速度,level-0 的文件数就会累积得越多,最终达到 `level0_slowdown_writes_trigger` 阈值,使写操作的延迟陡增,甚至进一步达到 `level0_stop_writes_trigger` 阈值,使写操作失败,影响系统的稳定性和服务的可用性 +* 高吞吐且低延迟的读写需求很难同时得到满足,二者需要权衡。 + * Compaction 操作进行得越快,level-0 累积得文件数越少,读操作需要读取的文件个数就越少,读性能就越高 + * 但是 compaction 越快,带来的写放大就越大,CPU 和磁盘 IO 负载就越高,也会影响读写性能 + +所幸的是,RocksDB 针对这个问题也给出了一些解决方案,例如在 [RocksDB-FAQ](https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ) 中给出的方案: + +> Q: What's the fastest way to load data into RocksDB? +> +> A: A fast way to direct insert data to the DB: +> +> 1. using single writer thread and insert in sorted order +> 2. batch hundreds of keys into one write batch +> 3. use vector memtable +> 4. make sure options.max_background_flushes is at least 4 +> 5. before inserting the data, disable automatic compaction, set options.level0_file_num_compaction_trigger, options.level0_slowdown_writes_trigger and options.level0_stop_writes_trigger to very large value. After inserting all the data, issue a manual compaction. +> +> 3-5 will be automatically done if you call Options::PrepareForBulkLoad() to your option + +Pegasus 的解决方案是,针对不同应用场景,设置不同的 RocksDB 参数,调节 RocksDB 的行为,以提供更好的读写性能。具体来说: +* 通过 [Table 环境变量](table-env) 设置 `rocksdb.usage_scenario` 来指定对应的应用场景 +* 各表的 replica 在检测到该环境变量发生变化时,就会根据业务场景修改 RocksDB 的配置参数。 +> 具体设置了哪些参数,可参考 [src/server/pegasus_server_impl.cpp](https://github.com/apache/incubator-pegasus/blob/master/src/server/pegasus_server_impl.cpp) 中的 `set_usage_scenario()` 方法。 # 支持场景 -目前支持三种场景: -* normal:正常场景,读写兼顾。这也是表的默认场景,该场景不会对写进行特别的优化,适合大部分读多写少或者读写均衡的应用。 -* prefer_write:写较多的场景。主要是增大`write_buffer_size`以降低sstable的产生速度。 -* bulk_load:灌数据场景。应用上面RocksDB-FAQ中提到的优化,避免compaction过程。因为Bulk load模式停止compaction,所以写入的数据都会堆放在level-0层,对读不友好。因此,Bulk load模式通常与[Manual Compact功能](manual-compact)配合使用,在数据加载完成后进行一次Manual Compact,以去除垃圾数据、提升读性能(参见后面的[应用示例](#应用示例))。另外,当不需要加载数据时,应当恢复为Normal模式。典型的灌数据流程: - * 设置表的Usage Scenario模式为bulk_load - * 灌数据:在bulk load模式下灌数据的QPS会更高,流量更稳定 - * 执行Manual Compact:该过程消耗大量的CPU和IO,可能对集群读写性能有影响 - * 恢复表的Usage Scenario模式为normal - -# 如何设置 -## 通过shell设置 -通过shell的[set_app_envs命令](/overview/shell#set_app_envs)来设置,譬如设置temp表为bulk_load模式: +目前 Pegasus 支持三种场景: +* normal:正常场景,读写兼顾。这也是表的默认场景,该场景不会对写进行特别的优化,适合大部分读写均衡的应用 +* prefer_write:写多读少的场景。主要是增大 `write_buffer_size` 和 `level0_file_num_compaction_trigger` 以降低 memtable 的 flush 操作和 level-0 到 level-1 的 compaction 操作的速度 +* bulk_load:批量导入数据的场景(注意这不是 [bulk-load](https://pegasus.apache.org/zh/2020/02/18/bulk-load-design.html))。使用上面 RocksDB-FAQ 中提到的优化,禁用 compaction 操作。此时,所有新写入的数据都会堆积在 level-0 层,对读不友好。因此,`bulk_load` 场景通常与 [Manual Compact 功能](manual-compact) 结合使用,在数据导入完成后,进行一次 Manual Compact,以快速进行垃圾回收、全局排序来提升读性能,然后恢复为 normal 模式。典型的批量数据导入流程: + * 设置表的 `rocksdb.usage_scenario` 模式为 `bulk_load` + * 导入数据:在 `bulk_load` 场景下数据写入 TPS 会更高,流量更稳定 + * 执行 Manual Compact:该过程会消耗大量的 CPU 和磁盘 IO 资源,可能对集群读写性能有影响 + * 恢复表的 `rocksdb.usage_scenario` 模式为 `normal` + +# 如何使用 + +## 通过 shell 工具设置 + +通过 shell 工具的 [set_app_envs](/docs/tools/shell/#set_app_envs) 命令来设置,例如设置 temp 表为 bulk_load 模式: ``` >>> use temp >>> set_app_envs rocksdb.usage_scenario bulk_load ``` -Table环境变量不会立即生效,大约需要等几十秒后才能在所有replica上生效。 +> Table 的环境变量不会立即生效,大约需要等几十秒(取决于配置项 `[replication]config_sync_interval_ms`)后才能在所有 replica 上生效。 -## 通过脚本设置 -我们提供了一个脚本工具[scripts/pegasus_set_usage_scenario.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_set_usage_scenario.sh)来方便地设置,用法: +## 通过辅助脚本设置 + +Pegasus 提供了一个辅助脚本 [scripts/pegasus_set_usage_scenario.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_set_usage_scenario.sh) 来方便地设置环境变量,用法: ``` -$ ./scripts/pegasus_set_usage_scenario.sh +$ ./scripts/pegasus_set_usage_scenario.sh This tool is for set usage scenario of specified table(app). USAGE: ./scripts/pegasus_set_usage_scenario.sh ``` -该工具会调用shell命令设置Table环境变量,然后还会检测是否在所有replica上都已经生效,只有所有都生效了才算执行完成。 - -## 应用示例 - -bulk_load模式通常用于灌数据,但是在灌数据过程中因为消耗大量的CPU和IO,对读性能会产生较大影响,造成读延迟陡增、超时率升高等。如果业务对读性能要求比较苛刻,可以考虑**读写分离的双集群方案**。 - -假设两个集群分别为A和B,最初线上流量访问A集群,灌数据流程: -* 第一步:​设置B模式为bulk_load -> 灌数据至B -> Manual Compact B -> 设置B模式为normal​​ -> 切线上流量至B -* 第二步:设置A模式为bulk_load -> 灌数据至A -> Manual Compact A -> 设置A模式为normal -> 切线上流量至A - -关于如何Manual Compact,请参考[Manual-Compact功能](manual-compact)。 +该脚本会调用 shell 工具中设置 Table 环境变量的命令,然后检测是否在所有 replica 上都已经生效,只有所有 replica 上都生效了才算执行完成。