diff --git a/docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md b/docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md new file mode 100644 index 0000000000000..6d612ff9622e9 --- /dev/null +++ b/docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md @@ -0,0 +1,710 @@ +# RFC: TiDB 内置 SQL 诊断 + +## Summary + +目前 TiDB 获取诊断信息主要依赖外部工具(perf/iosnoop/iotop/vmstat/sar/...)、监控系统(Prometheus/Grafana)、日志文件、HTTP API 和 TiDB 提供的系统表。分散的工具链和繁杂的获取方式导致 TiDB 的集群的使用门槛高、运维难度大、不能提前发现问题以及遇到问题不能及时排查、诊断和恢复集群等。 + +本提案提出一种新的方法,在 TiDB 中内置获取诊断信息的功能,并将诊断信息使用系统表的形式对外暴露,使用户可以使用 SQL 的方式进行查询。 + +## Motivation + +本提案主要解决 TiDB 在获取诊断信息过程中的以下问题: + +- 工具链分散,需要在不同工具之间来回切换,且部分 Linux 发行版未内置相应工具或内置工具的版本不一致。 +- 信息获取方式不一致,比如有 SQL、HTTP、导出监控、登录各个节点查看日志等。 +- TiDB 集群组件较多,对不同组件的监控进行对比和关联低效且繁琐。 +- TiDB 没有集中日志管理组件,没有高效的手段对整个集群的日志进行过滤、检索、分析、聚合。 +- 系统表只包含当前节点信息,不能体现整个集群的状态,如:SLOW_QUERY, PROCESSLIST, STATEMENTS_SUMMARY。 + +在通过提供多维度集群级别系统表和集群诊断规则框架之后,提高全集群信息查询、状态获取、日志检索、一键巡检、故障诊断几个使用场景中的效率,并为后续异常预警功能提供基础数据。 + +## Detailed Design + +### 系统整体概览 + +本提案的实现分为四层: + +- L1: 最底层在各个节点实现信息采集模块,包括 TiDB/TiKV/PD 监控信息、硬件信息、内核中记录的网络 IO、磁盘 IO 信息、CPU 使用率、内存使用率等。 +- L2: 第二层通过调用底层信息采集模块并通过对外服务接口(HTTP API/gRPC Service)向上层提供数据,使 TiDB 可以获取当前节点采集到的信息。 +- L3: 第三层由 TiDB 拉取各个节点的信息进行聚合和汇总,并以系统表的形式对上层提供数据。 +- L4: 第四层实现诊断框架,诊断框架通过查询系统表获取整个集群的状态,并根据诊断规则得到诊断结果。 + +如下从信息采集到使用使用诊断规则对采集的信息进行分析的数据流向图: + +``` ++-L1--------------+ +-L3-----+ +| +-------------+ | | | +| | Metrics | | | | +| +-------------+ | | | +| +-------------+ | | | +| | Disk IO | +---L2:gRPC-->+ | +| +-------------+ | | | +| +-------------+ | | TiDB | +| | Network IO | | | | +| +-------------+ | | | +| +-------------+ | | | +| | Hardware | +---L2:HTTP-->+ | +| +-------------+ | | | +| +-------------+ | | | +| | System Info | | | | +| +-------------+ | | | ++-----------------+ +---+----+ + | + +---infoschema---+ + | + v ++-L4---------------+---------------------+ +| | +| Diagnosis Framework | +| | +| +---------+ +---------+ +---------+ | +| | rule1 | | rule2 | | rule3 | | +| +---------+ +---------+ +---------+ | ++----------------------------------------+ +``` + +### 系统信息收集 + +TiDB/TiKV/PD 三个组件都需要实现系统信息采集模块,其中 TiDB/PD 使用 Golang 实现并复用逻辑,TiKV 需要使用 Rust 单独实现。 + +#### 节点硬件信息 + +各个节点需要获取的硬件信息包括: + +- CPU 信息:物理核心数、逻辑核心数量、NUMA 信息、CPU 频率、CPU 供应商、L1/L2/L3 缓存大小 +- 网卡信息:网卡设备名、网卡是否启用、生产厂商、型号、带宽、驱动版本、接口队列数(可选) +- 磁盘信息:磁盘名、磁盘容量、磁盘使用量、磁盘分区、挂载信息 +- USB 设备列表 +- 内存信息 + +#### 节点系统信息 + +各个节点需要获取的系统信息包括: + +- CPU 使用率、1/5/15 分钟负载 +- 内存:Total/Free/Available/Buffers/Cached/Active/Inactive/Swap +- 磁盘 IO: + - tps: 该设备每秒的传输次数 + - rrqm/s: 每秒这个设备相关的读取请求有多少被 Merge + - wrqm/s: 每秒这个设备相关的写入请求有多少被 Merge + - r/s: 每秒从设备读取的数据量 + - w/s: 每秒从设备写入的数据量 + - rsec/s: 每秒读取的扇区数 + - wsec/s: 每秒写取的扇区数 + - avgrq-sz: 平均请求扇区的大小 + - avgqu-sz: 是平均请求队列的长度 + - await: 每一个IO请求的处理的平均时间(单位是微秒毫秒) + - svctm: 表示平均每次设备I/O操作的服务时间(以毫秒为单位) + - %util: 在统计时间内所有处理IO时间,除以总共统计时间 +- 网络 IO + - IFACE:LAN接口 + - rxpck/s:每秒钟接收的数据包 + - txpck/s:每秒钟发送的数据包 + - rxbyt/s:每秒钟接收的字节数 + - txbyt/s:每秒钟发送的字节数 + - rxcmp/s:每秒钟接收的压缩数据包 + - txcmp/s:每秒钟发送的压缩数据包 + - rxmcst/s:每秒钟接收的多播数据包 +- 常用的系统配置:sysctl -a + +#### 节点配置信息 + +所有节点都包含当前节点的生效配置,不需要额外的步骤既可拿到配置信息。 + +#### 节点日志信息 + +TiDB/TiKV/PD 产生的日志都保存在各自的节点上,并且 TiDB 集群部署过程中没有部署额外的日志收集组件,所以在日志检索中有以下问题: + +- 日志分布在各个节点,需要单独登陆到每一个节点使用关键字进行搜索 +- 日志文件会每天 rotate,所以在单个节点也需要对多个日志文件进行搜索 +- 没有简单的方式对多个节点的日志按照时间排序整合到同一个文件 + +本提案提供以下两种思路来解决以上问题: + +- 引入第三方日志收集组件对所有节点的日志进行收集 + - 优势:统一的日志管理,日志可以长时间保存,并易于检索,并且多个组件的日志可以按照时间排序归并 + - 劣势:增加集群运维难度,第三方组件不容易与 TiDB 内部 SQL 集成;日志收集工具会收集全量日志,收集过程占用各个系统资源(磁盘 IO、网络 IO) +- 各个节点提供日志服务,TiDB 通过各个节点的接口将谓词下推到日志检索接口,直接对各个节点返回的日志进行归并 + - 优势:不引入三方组件,谓词下推后只返回过滤后的日志,能轻易的与 TiDB SQL 进行集成,并能复用 SQL 引擎的过滤、聚合等 + - 劣势:如果节点日志删除后,不能检索到对应日志 + +根据以上的优劣势分析,本提案使用第二种方案,即各个节点提供日志搜索接口,TiDB 将日志搜索的 SQL 中谓词下推到各个节点,日志搜索接口的语义为:搜索本地日志文件,并使用谓词进行过滤,匹配的结果返回。 + +- `start_time`: 日志检索的开始时间(unix 时间戳,单位毫秒),如果没有该谓词,则默认为 0。 +- `end_time`: 日志检索的开始时间(unix 时间戳,单位毫秒),如果没有该谓词,则默认为 `int64::MAX`。 +- `pattern`: 如 SELECT * FROM tidb_cluster_log WHERE pattern LIKE "%gc%" 中的 %gc% 即为过滤的关键字 +- `level`: 日志等级,可以选为 DEBUG/INFO/WARN/WARNING/TRACE/CRITICAL/ERROR +- `limit`: 返回日志的条数,如果没有指定,则限制为 64k 条,防止日质量太大占用大量网络 + +#### 节点性能采样数据 + +当前 TiDB 集群中,发现有性能瓶颈时,需要快速定位问题。火焰图 (Flame Graph)是由 Brendan Gregg 发明的,与其他的 trace 和 profiling 方法不同的是,Flame Graph 以一个全局的视野来看待时间分布,它从底部往顶部,列出所有可能的调用栈。其他的呈现方法,一般只能列出单一的调用栈或者非层次化的时间分布。 + +目前 TiKV 和 TiDB 获取火焰图的方式不同,并且都需要依赖外部工具。 + +- TiKV 获取火焰图 + + ``` + perf record -F 99 -p proc_pid -g -- sleep 60 + perf script > out.perf + /opt/FlameGraph/stackcollapse-perf.pl out.perf > out.folded + /opt/FlameGraph/flamegraph.pl out.folded > cpu.svg + ``` + +- TiDB 获取火焰图 + + ``` + curl http://127.0.0.1:10080/debug/pprof/profile > cpu.pprof + go tool pprof -svg cpu.svn cpu.pprof + ``` + +目前存在的两个主要问题: + +- 生产环境中不一定包含对应的外部工具(perf/flamegraph.pl/go) +- TiKV 和 TiDB 没有统一的方式 + +为了解决以上两个问题,本提案将获取火焰图的方法内置到 TiDB 中,统一使用 SQL 触发采样并将采样数据转换为火焰图作为查询结果显示,一方面降低对外部工具的依赖,同时也极大的提升效率。各个节点实现采样数据采集功能并提供采样接口,对上层输出指定格式的采样数据。暂定输出为 `[pprof](github.com/google/pprof)` 定义的 ProtoBuf 格式。 + +采样数据获取方式: + +- TiDB/PD: 使用 Golang Runtime 内置的采样数据获取接口 +- TiKV: 使用 `[pprof-rs](github.com/tikv/pprof-rs)` 库采集采样数据 + +#### 节点监控信息 + +监控信息主要是各个组件内部定义的监控指标。目前 TiDB/TiKV/PD 都会提供 `/metrics` HTTP API,然后通过部署的 Prometheus 组件定时(默认配置 15s)的拉取集群各个节点的监控指标。并且部署了 Grafana 组件用于从 Prometheus 拉取监控数据,进行可视化展示。 + +监控信息不同于实时获取的系统信息,监控数据是一个时序数据。包含各个节点在各个时间点的数据,对于排查问题和诊断问题有非常重要的用途,所以监控信息的保存和查询对于本提案实现 TiDB 内置 SQL 诊断非常重要。为了能够在 TiDB 内使用 SQL 查询监控数据,目前有以下备选方案: + +- 使用 Prometheus client 和 PromQL 查询 Prometheus server 的数据 + - 优势:有现成解决方案,只需要将 Prometheus server 的地址注册到 TiDB 即可,实现简单 + - 劣势:增强了 TiDB 对 Prometheus 的依赖,为后续完全移除 Prometheus 增加了困难 +- 将最近一段时间内(暂定 1 天)的监控数据保存到 PD,从 PD 中查询监控数据 + - 优势:该方案不依赖 Prometheus server,为后续移除 Prometheus 组件有一定帮助 + - 劣势:需要实现时序保存逻辑,并实现对应的查询引擎,实现难度和工作量大 + +本提案倾向于方案二,虽然实现难度更大,但是对后续的工作有帮助。为了解决实现 PromQL 和时序数据保存的实现难度大和周期长的问题,将这个功能分为三个阶段实现(第三阶段视具体情况是否实现): + +1. PD 中添加 `remote-metrics-storage` 配置,暂时配置为 Prometheus Server 的地址。PD 作为 proxy,将请求转移到 Prometheus 上执行,主要有以下考量: + - 后续 PD 实现查询接口实现自举,TiDB 不需要做其他改动 + - 用户不使用 TiDB 部署的 Prometheus 而使用自建的监控服务,依然可以使用 SQL 查询监控信息以及诊断框架 +2. 将 Prometheus 时序数据保存和查询相应的模块抽离出来,并嵌入到 PD 中 +3. PD 内部实现自己的时序保存与查询(目前 CockroachDB 的方案) + +##### PD 性能分析 + +PD 目前主要承载 TiDB 集群的调度和 TSO 服务,其中: + +1. TSO 获取仅对 Leader 内存中的一个原子变量进行累加 +2. 调度生成的 Operator 和 OperatorStep 仅保存在内存中,根据 Region 的心跳信息更新内存中的状态 + +由以上信息可以得出在 PD 上新增监控功能对 PD 的性能影响在绝大部分情况下可以忽略不计。 + +### 系统信息获取 + +由于 TiDB/TiKV/PD 组件之前已经可以通过 HTTP API 对外暴露部分系统信息,并且 PD 主要通过 HTTP API 对外提供服务,所以本提案的部分接口会复用已有逻辑,使用 HTTP API 从各个组件获取数据,比如配置信息获取。 + +由于 TiKV 后续计划完全移除 HTTP API,所以除了已有接口复用之外,不再额外添加新的 HTTP API,所有日志检索、硬件信息、系统信息获取统一定义 gRPC Service,各个组件实现对应的 Service 并在启动过程中注册到 gRPC Server 中。 + +#### gRPC Service 定义 + +```proto +// Diagnostics service for TiDB cluster components. +service Diagnostics { + // Searchs log in the target node + rpc search_log(SearchLogRequest) returns (SearchLogResponse) {}; + // Retrieves server info in the target node + rpc server_info(ServerInfoRequest) returns (ServerInfoResponse) {}; +} + +enum LogLevel { + Debug = 0; + Info = 1; + Warn = 2; + Trace = 3; + Critical = 4; + Error = 5; +} + +message SearchLogRequest { + int64 start_time = 1; + int64 end_time = 2; + LogLevel level = 3; + string pattern = 4; + int64 limit = 5; +} + +message SearchLogResponse { + repeated LogMessage messages = 1; +} + +message LogMessage { + int64 time = 1; + LogLevel level = 2; + string message = 3; +} + +enum ServerInfoType { + All = 0; + HardwareInfo = 1; + SystemInfo = 2; + LoadInfo = 3; +} + +message ServerInfoRequest { + ServerInfoType tp = 1; +} + +message ServerInfoItem { + // cpu, memory, disk, network ... + string tp = 1; + // eg. network: lo1/eth0, cpu: core1/core2, disk: sda1/sda2 + string name = 2; + string key = 3; + string value = 4; +} + +message ServerInfoResponse { + repeated ServerInfoItem items = 1; +} +``` + +#### 可复用的 HTTP API + +目前 TiDB/TiKV/PD 包含部分可复用 HTTP API,本提案暂不将对应接口迁移至 gRPC Service,迁移工作由后续其他计划完成。所有 HTTP API 需要以 JSON 格式返回数据,以下是提案中可能用到的 HTTP API 列表: + +- 获取配置信息 + - PD: /pd/api/v1/config + - TiDB/TiKV: /config +- 性能采样接口: TiDB/PD 包含以下所有接口,TiKV 暂时只包含 CPU 性能采样接口 + - CPU: /debug/pprof/profile + - Memory: /debug/pprof/heap + - Allocs: /debug/pprof/allocs + - Mutex: /debug/pprof/mutex + - Block: /debug/pprof/block + +### 集群信息系统表 + +每个 TiDB 实例均可以通过前两层提供的 HTTP API 或 gRPC Service 访问其他节点的信息,从而实现集群的 Global View。本提案中通过新建一系列相关系统表将采集到的集群信息向上层提供数据,上层包括不限于: + +- 终端用户:用户直接通过 SQL 查询获取集群信息排查问题 +- 运维系统:TiDB 的使用环境比较多样,客户可以通过 SQL 获取集群信息将 TiDB 集成到自己的运维系统中 +- 生态工具:外部工具通过 SQL 拿到集群信息实现功能定制,比如 `[sqltop](https://github.com/ngaut/sqltop)` 可以直接通过集群 `events_statements_summary_by_digest` 获取整个集群的 SQL 采样信息 + +#### 集群拓扑系统表 + +要为 TiDB 实例提供一个 **Global View**,首先需要为 TiDB 实例提供一个拓扑系统表,可以从拓扑系统表中获取各个节点的 HTTP API Address 和 gRPC Service Address,从而方便的构造出各个远程 API 的 Endpoint,进一步获取目标节点采集的信息。 + +本提案实现完成可以通过 SQL 查询以下结果: + +``` +mysql> use information_schema; +Database changed + +mysql> desc TIDB_CLUSTER_INFO; ++----------------+---------------------+------+------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++----------------+---------------------+------+------+---------+-------+ +| TYPE | varchar(64) | YES | | NULL | | +| ADDRESS | varchar(64) | YES | | NULL | | +| STATUS_ADDRESS | varchar(64) | YES | | NULL | | +| VERSION | varchar(64) | YES | | NULL | | +| GIT_HASH | varchar(64) | YES | | NULL | | ++----------------+---------------------+------+------+---------+-------+ +5 rows in set (0.00 sec) + +mysql> select TYPE, ADDRESS, STATUS_ADDRESS,VERSION from TIDB_CLUSTER_INFO; ++------+-----------------+-----------------+-----------------------------------------------+ +| TYPE | ADDRESS | STATUS_ADDRESS | VERSION | ++------+-----------------+-----------------+-----------------------------------------------+ +| tidb | 127.0.0.1:4000 | 127.0.0.1:10080 | 5.7.25-TiDB-v4.0.0-alpha-793-g79eef48a3-dirty | +| pd | 127.0.0.1:2379 | 127.0.0.1:2379 | 4.0.0-alpha | +| tikv | 127.0.0.1:20160 | 127.0.0.1:20180 | 4.0.0-alpha | ++------+-----------------+-----------------+-----------------------------------------------+ +3 rows in set (0.00 sec) +``` + +#### 监控信息系统表 + +由于监控指标会随着程序的迭代添加和删除监控指标,对于同一个监控指标,可能有不同的表达式获取监控不同维度的信息。鉴于以上两个需求,需要设计一个有弹性的监控系统表框架,本提案暂时才采取以下方案:将表达式映射为 `metrics_schema` 数据库中的系统表,表达式与系统表的关系可以通过以下方式关联: + +- 定义在配置文件 + + ``` + # tidb.toml + [metrics_schema] + qps = `sum(rate(tidb_server_query_total[$STEP])) by (result)` + memory_usage = `process_resident_memory_bytes{job="tidb"}` + goroutines = `rate(go_gc_duration_seconds_sum{job="tidb"}[$STEP])` + ``` + +- HTTP API 注入 + + ``` + curl -XPOST http://host:port/metrics_schema?name=distsql_duration&expr=`histogram_quantile(0.999, + sum(rate(tidb_distsql_handle_query_duration_seconds_bucket[$STEP])) by (le, type))` + ``` + +- 特殊 SQL 命令 + + ``` + mysql> admin metrics_schema add parse_duration `histogram_quantile(0.95, sum(rate(tidb_session_parse_duration_seconds_bucket[$STEP])) by (le, sql_type))` + ``` + +- 从文件中加载 + + ``` + mysql> admin metrics_schema load external_metrics.txt + #external_metrics.txt + execution_duration = `histogram_quantile(0.95, sum(rate(tidb_session_execute_duration_seconds_bucket[$STEP])) by (le, sql_type))` + pd_client_cmd_ops = `sum(rate(pd_client_cmd_handle_cmds_duration_seconds_count{type!="tso"}[$STEP])) by (type)` + ``` + +添加以上表之后就可以在 `metrics_schema` 库中查看对应的表: + +``` +mysql> use metrics_schema; +Database changed + +mysql> show tables; ++-------------------------------------+ +| Tables_in_metrics_schema | ++-------------------------------------+ +| qps | +| memory_usage | +| goroutines | +| distsql_duration | +| parse_duration | +| execution_duration | +| pd_client_cmd_ops | ++-------------------------------------+ +7 rows in set (0.00 sec) +``` + +表达式映射到系统表时字段的确定方式主要取决与表达式执行结果的数据。以表达式 `sum(rate(pd_client_cmd_handle_cmds_duration_seconds_count{type!="tso"}[1m]offset 0)) by (type)` 为例,查询的结果为: + + +| Element | Value | +|---------|-------| +| {type="update_gc_safe_point"} | 0 | +| {type="wait"} | 2.910521666666667 | +| {type="get_all_stores"} | 0 | +| {type="get_prev_region"} | 0 | +| {type="get_region"} | 0 | +| {type="get_region_byid"} | 0 | +| {type="scan_regions"} | 0 | +| {type="tso_async_wait"} | 2.910521666666667 | +| {type="get_operator"} | 0 | +| {type="get_store"} | 0 | +| {type="scatter_region"} | 0 | + +映射为表结构以及查询结果为: + +``` +mysql> desc pd_client_cmd_ops; ++------------+-------------+------+-----+-------------------+-------+ +| Field | Type | Null | Key | Default | Extra | ++------------+-------------+------+-----+-------------------+-------+ +| address | varchar(32) | YES | | NULL | | +| type | varchar(32) | YES | | NULL | | +| value | float | YES | | NULL | | +| interval | int | YES | | 60 | | +| start_time | int | YES | | CURRENT_TIMESTAMP | | +| end_time | int | YES | | | | +| end_time | int | YES | | | | +| step | int | YES | | | | ++------------+-------------+------+-----+-------------------+-------+ +3 rows in set (0.02 sec) + +mysql> select address, type, value from pd_client_cmd_ops; ++------------------+----------------------+---------+ +| address | type | value | ++------------------+----------------------+---------+ +| 172.16.5.33:2379 | update_gc_safe_point | 0 | +| 172.16.5.33:2379 | wait | 2.91052 | +| 172.16.5.33:2379 | get_all_stores | 0 | +| 172.16.5.33:2379 | get_prev_region | 0 | +| 172.16.5.33:2379 | get_region | 0 | +| 172.16.5.33:2379 | get_region_byid | 0 | +| 172.16.5.33:2379 | scan_regions | 0 | +| 172.16.5.33:2379 | tso_async_wait | 2.91052 | +| 172.16.5.33:2379 | get_operator | 0 | +| 172.16.5.33:2379 | get_store | 0 | +| 172.16.5.33:2379 | scatter_region | 0 | ++------------------+----------------------+---------+ +11 rows in set (0.00 sec) + +mysql> select address, type, value from pd_client_cmd_ops where start_time='2019-11-14 10:00:00' and end_time='2019-11-14 10:05:00'; ++------------------+----------------------+---------+ +| address | type | value | ++------------------+----------------------+---------+ +| 172.16.5.33:2379 | update_gc_safe_point | 0 | +| 172.16.5.33:2379 | wait | 0.82052 | +| 172.16.5.33:2379 | get_all_stores | 0 | +| 172.16.5.33:2379 | get_prev_region | 0 | +| 172.16.5.33:2379 | get_region | 0 | +| 172.16.5.33:2379 | get_region_byid | 0 | +| 172.16.5.33:2379 | scan_regions | 0 | +| 172.16.5.33:2379 | tso_async_wait | 0.82052 | +| 172.16.5.33:2379 | get_operator | 0 | +| 172.16.5.33:2379 | get_store | 0 | +| 172.16.5.33:2379 | scatter_region | 0 | ++------------------+----------------------+---------+ +11 rows in set (0.00 sec) +``` + +对于多个 label 的 PromQL 就会有多个列的数据,可以方便的使用已有的 SQL 执行引擎对数据过滤、聚合得到期望的结果。 + +#### 节点性能剖析系统表 + +通过各个节点的 `/debug/pprof/profile` 拿到对应节点性能采样数据,然后对采样数据进行聚合,最终使用 SQL 查询结果的方式向用户输出性能剖析结果。由于 SQL 查询结果不能以 svg 的格式输出,所以需要解决输出内容展示的问题。 + +火焰图快速定位问题的核心点是: + +- 提供全局视野 +- 展示全部调用路径 +- 层次化展示 + +本提案提出的解决方案聚焦在解决核心问题的点上,而未拘泥于是图形展示形式。最终的方案为:对采样数据进行聚合,并将所有的调用路径使用树形结构逐行进行展示。 + +解决方案是通过以下方式契合三个核心点: + +- 提供全局视野:对每一个聚合结果使用单独的一列展示在全局的使用比例,可以方便过滤排序 +- 展示全部调用路径:将所有的调用路径都作为查询结果,并使用单独的列对各个调用路径的子树进行编号,可以方便的通过过滤只查看某一个子树 +- 层次化展示:使用树形结构展示堆栈,使用单独的列记录栈的深度,可以方便的对不同栈的深度进行过滤 + +本提案需要实现以下性能剖析表: + + +| 表名 | 描述 | +|------|-----| +| tidb_profile_cpu | TiDB CPU 火焰图 | +| tikv_profile_cpu | TiKV CPU 火焰图 | +| tidb_profile_block | TiDB 阻塞情况火焰图 | +| tidb_profile_memory | TiDB 内存对象火焰图 | +| tidb_profile_allocs | 内存分配火焰图 | +| tidb_profile_mutex | 锁的争用情况火焰图 | +| tidb_profile_goroutines | 系统中已有的 goroutines,排查 goroutine 泄漏、阻塞 | + +#### 内存表全局化 + +目前 `slow_query`/`events_statements_summary_by_digest`/`processlist` 只包含单节点数据,本提案通过添加以下三张集群级别系统表使任何一个 TiDB 实例可以查看整个集群的信息: + +| 表名 | 描述 | +|------|-----| +| tidb_cluster_slow_query | 所有 TiDB 节点的 slow_query 表数据 | +| tidb_cluster_statements_summary | 所有 TiDB 节点的 statements summary 表数据 | +| tidb_cluster_processlist | 所有 TiDB 节点的 processlist 表数据 | + +#### 所有节点的配置信息 + +对于一个大集群,通过 HTTP API 去每一个节点获取配置的方式较为繁琐和低效,本提案提供全集群配置信息系统表,简化整个集群配置信息的获取、过滤、聚合。 + +如下示例是实现本提案后的预期结果: + +``` +mysql> use information_schema; +Database changed + +mysql> select * from tidb_cluster_config where `key` like 'log%'; ++------+-----------------+-----------------------------+---------------+ +| TYPE | ADDRESS | KEY | VALUE | ++------+-----------------+-----------------------------+---------------+ +| pd | 127.0.0.1:2379 | log-file | | +| pd | 127.0.0.1:2379 | log-level | | +| pd | 127.0.0.1:2379 | log.development | false | +| pd | 127.0.0.1:2379 | log.disable-caller | false | +| pd | 127.0.0.1:2379 | log.disable-error-verbose | true | +| pd | 127.0.0.1:2379 | log.disable-stacktrace | false | +| pd | 127.0.0.1:2379 | log.disable-timestamp | false | +| pd | 127.0.0.1:2379 | log.file.filename | | +| pd | 127.0.0.1:2379 | log.file.log-rotate | true | +| pd | 127.0.0.1:2379 | log.file.max-backups | 0 | +| pd | 127.0.0.1:2379 | log.file.max-days | 0 | +| pd | 127.0.0.1:2379 | log.file.max-size | 0 | +| pd | 127.0.0.1:2379 | log.format | text | +| pd | 127.0.0.1:2379 | log.level | | +| pd | 127.0.0.1:2379 | log.sampling | | +| tidb | 127.0.0.1:4000 | log.disable-error-stack | | +| tidb | 127.0.0.1:4000 | log.disable-timestamp | | +| tidb | 127.0.0.1:4000 | log.enable-error-stack | | +| tidb | 127.0.0.1:4000 | log.enable-timestamp | | +| tidb | 127.0.0.1:4000 | log.expensive-threshold | 10000 | +| tidb | 127.0.0.1:4000 | log.file.filename | | +| tidb | 127.0.0.1:4000 | log.file.max-backups | 0 | +| tidb | 127.0.0.1:4000 | log.file.max-days | 0 | +| tidb | 127.0.0.1:4000 | log.file.max-size | 300 | +| tidb | 127.0.0.1:4000 | log.format | text | +| tidb | 127.0.0.1:4000 | log.level | info | +| tidb | 127.0.0.1:4000 | log.query-log-max-len | 4096 | +| tidb | 127.0.0.1:4000 | log.record-plan-in-slow-log | 1 | +| tidb | 127.0.0.1:4000 | log.slow-query-file | tidb-slow.log | +| tidb | 127.0.0.1:4000 | log.slow-threshold | 300 | +| tikv | 127.0.0.1:20160 | log-file | | +| tikv | 127.0.0.1:20160 | log-level | info | +| tikv | 127.0.0.1:20160 | log-rotation-timespan | 1d | ++------+-----------------+-----------------------------+---------------+ +33 rows in set (0.00 sec) + +mysql> select * from tidb_cluster_config where type='tikv' and `key` like 'raftdb.wal%'; ++------+-----------------+---------------------------+--------+ +| TYPE | ADDRESS | KEY | VALUE | ++------+-----------------+---------------------------+--------+ +| tikv | 127.0.0.1:20160 | raftdb.wal-bytes-per-sync | 512KiB | +| tikv | 127.0.0.1:20160 | raftdb.wal-dir | | +| tikv | 127.0.0.1:20160 | raftdb.wal-recovery-mode | 2 | +| tikv | 127.0.0.1:20160 | raftdb.wal-size-limit | 0KiB | +| tikv | 127.0.0.1:20160 | raftdb.wal-ttl-seconds | 0 | ++------+-----------------+---------------------------+--------+ +5 rows in set (0.01 sec) +``` + +#### 节点硬件/系统/负载信息系统表 + +根据 `gRPC Service` 的协议定义,每一个 `ServerInfoItem` 包含信息的名字以及对应的键值对,在向用户展示时,需要添加节点的类型以及节点地址。 + +``` +mysql> use information_schema; +Database changed + +mysql> select * from tidb_cluster_hardware ++------+-----------------+----------+----------+-------------+--------+ +| TYPE | ADDRESS | HW_TYPE | HW_NAME | KEY | VALUE | ++------+-----------------+----------+----------+-------------+--------+ +| tikv | 127.0.0.1:20160 | cpu | cpu-1 | frequency | 3.3GHz | +| tikv | 127.0.0.1:20160 | cpu | cpu-2 | frequency | 3.6GHz | +| tikv | 127.0.0.1:20160 | cpu | cpu-1 | core | 40 | +| tikv | 127.0.0.1:20160 | cpu | cpu-2 | core | 48 | +| tikv | 127.0.0.1:20160 | cpu | cpu-1 | vcore | 80 | +| tikv | 127.0.0.1:20160 | cpu | cpu-2 | vcore | 96 | +| tikv | 127.0.0.1:20160 | network | memory | capacity | 256GB | +| tikv | 127.0.0.1:20160 | network | lo0 | bandwidth | 10000M | +| tikv | 127.0.0.1:20160 | network | eth0 | bandwidth | 1000M | +| tikv | 127.0.0.1:20160 | disk | /dev/sda | capacity | 4096GB | ++------+-----------------+----------+----------+-------------+--------+ +10 rows in set (0.01 sec) + +mysql> select * from tidb_cluster_systeminfo ++------+-----------------+----------+--------------+--------+ +| TYPE | ADDRESS | MODULE | KEY | VALUE | ++------+-----------------+----------+--------------+--------+ +| tikv | 127.0.0.1:20160 | sysctl | ktrace.state | 0 | +| tikv | 127.0.0.1:20160 | sysctl | hw.byteorder | 1234 | +| ... | ++------+-----------------+----------+--------------+--------+ +20 rows in set (0.01 sec) + +mysql> select * from tidb_cluster_load ++------+-----------------+----------+-------------+--------+ +| TYPE | ADDRESS | MODULE | KEY | VALUE | ++------+-----------------+----------+-------------+--------+ +| tikv | 127.0.0.1:20160 | network | rsec/s | 1000Kb | +| ... | ++------+-----------------+----------+-------------+--------+ +100 rows in set (0.01 sec) +``` + +#### 全链路日志系统表 + +当前日志搜索需要登陆多台机器分别进行检索,并且没有简单的办法对多个机器的检索结果按照时间全排序。本提案新建一个 `tidb_cluster_log` 系统表用于提供全链路日志,简化通过日志排查问题的方式以及提高效率。实现方式为:通过 gRPC Diagnosis Service 的 `search_log` 接口,将日志过滤的谓词下推到各个节点,并最终按照时间进行归并。 + +如下示例是实现本提案后的预期结果: + +``` +mysql> use information_schema; +Database changed + +mysql> desc tidb_cluster_log; ++---------+-------------+------+------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++---------+-------------+------+------+---------+-------+ +| type | varchar(16) | YES | | NULL | | +| address | varchar(32) | YES | | NULL | | +| time | varchar(32) | YES | | NULL | | +| level | varchar(8) | YES | | NULL | | +| message | text | YES | | NULL | | ++---------+-------------+------+------+---------+-------+ +5 rows in set (0.00 sec) + +mysql> select * from tidb_cluster_log where content like '%412134239937495042%'; -- 查询 TSO 为 412134239937495042 全链路日志 ++------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| TYPE | ADDRESS | LEVEL | CONTENT | ++------+------------------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:501.60574ms txnStartTS:412134239937495042 region_id:180 store_addr:10.9.82.29:20160 kv_process_ms:416 scan_total_write:340807 scan_processed_write:340806 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:698.095216ms txnStartTS:412134239937495042 region_id:88 store_addr:10.9.1.128:20160 kv_process_ms:583 scan_total_write:491123 scan_processed_write:491122 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.529574387s txnStartTS:412134239937495042 region_id:112 store_addr:10.9.1.128:20160 kv_process_ms:945 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.55722114s txnStartTS:412134239937495042 region_id:100 store_addr:10.9.82.29:20160 kv_process_ms:1000 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.608597018s txnStartTS:412134239937495042 region_id:96 store_addr:10.9.137.171:20160 kv_process_ms:1048 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.614233631s txnStartTS:412134239937495042 region_id:92 store_addr:10.9.137.171:20160 kv_process_ms:1000 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.67587146s txnStartTS:412134239937495042 region_id:116 store_addr:10.9.137.171:20160 kv_process_ms:950 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.693188495s txnStartTS:412134239937495042 region_id:108 store_addr:10.9.1.128:20160 kv_process_ms:949 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.693383633s txnStartTS:412134239937495042 region_id:120 store_addr:10.9.1.128:20160 kv_process_ms:951 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.731990066s txnStartTS:412134239937495042 region_id:128 store_addr:10.9.82.29:20160 kv_process_ms:1035 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.744524732s txnStartTS:412134239937495042 region_id:104 store_addr:10.9.137.171:20160 kv_process_ms:1030 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.786915459s txnStartTS:412134239937495042 region_id:132 store_addr:10.9.82.29:20160 kv_process_ms:1014 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.786978732s txnStartTS:412134239937495042 region_id:124 store_addr:10.9.82.29:20160 kv_process_ms:1002 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tikv | 10.9.82.29:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=17] [block_read_count=1810] [block_read_byte=114945337] [scan_first_range="Some(start: 74800000000000002B5F728000000000130A96 end: 74800000000000002B5F728000000000196372)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.001s] [peer_id=ipv4:10.9.120.251:47968] [region_id=100] | +| tikv | 10.9.82.29:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=19] [block_read_count=1793] [block_read_byte=96014381] [scan_first_range="Some(start: 74800000000000002B5F728000000000393526 end: 74800000000000002B5F7280000000003F97A6)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.002s] [peer_id=ipv4:10.9.120.251:47994] [region_id=124] | +| tikv | 10.9.82.29:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=17] [block_read_count=1811] [block_read_byte=96620574] [scan_first_range="Some(start: 74800000000000002B5F72800000000045F083 end: 74800000000000002B5F7280000000004C51E4)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.014s] [peer_id=ipv4:10.9.120.251:47998] [region_id=132] | +| tikv | 10.9.137.171:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=17] [block_read_count=1779] [block_read_byte=95095959] [scan_first_range="Some(start: 74800000000000002B5F7280000000004C51E4 end: 74800000000000002B5F72800000000052B456)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=2ms] [total_process_time=1.025s] [peer_id=ipv4:10.9.120.251:34926] [region_id=136] | +| tikv | 10.9.137.171:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=15] [block_read_count=1793] [block_read_byte=114024055] [scan_first_range="Some(start: 74800000000000002B5F728000000000196372 end: 74800000000000002B5F7280000000001FC628)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=2ms] [total_process_time=1.03s] [peer_id=ipv4:10.9.120.251:34954] [region_id=104] | +| tikv | 10.9.82.29:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831930] [internal_delete_skipped_count=0] [block_cache_hit_count=18] [block_read_count=1796] [block_read_byte=96116255] [scan_first_range="Some(start: 74800000000000002B5F7280000000003F97A6 end: 74800000000000002B5F72800000000045F083)"] [scan_ranges=1] [scan_iter_processed=831930] [scan_iter_ops=831932] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.035s] [peer_id=ipv4:10.9.120.251:47996] [region_id=128] | +| tikv | 10.9.137.171:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=15] [block_read_count=1792] [block_read_byte=113958562] [scan_first_range="Some(start: 74800000000000002B5F7280000000000CB1BA end: 74800000000000002B5F728000000000130A96)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.048s] [peer_id=ipv4:10.9.120.251:34924] [region_id=96] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.841528722s txnStartTS:412134239937495042 region_id:140 store_addr:10.9.137.171:20160 kv_process_ms:991 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.410650751s txnStartTS:412134239937495042 region_id:144 store_addr:10.9.82.29:20160 kv_process_ms:1000 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.930478221s txnStartTS:412134239937495042 region_id:136 store_addr:10.9.137.171:20160 kv_process_ms:1025 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.26929792s txnStartTS:412134239937495042 region_id:148 store_addr:10.9.82.29:20160 kv_process_ms:901 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.116672983s txnStartTS:412134239937495042 region_id:152 store_addr:10.9.82.29:20160 kv_process_ms:828 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.642668083s txnStartTS:412134239937495042 region_id:156 store_addr:10.9.1.128:20160 kv_process_ms:888 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.537375971s txnStartTS:412134239937495042 region_id:168 store_addr:10.9.137.171:20160 kv_process_ms:728 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.602765417s txnStartTS:412134239937495042 region_id:164 store_addr:10.9.82.29:20160 kv_process_ms:871 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.583965975s txnStartTS:412134239937495042 region_id:172 store_addr:10.9.1.128:20160 kv_process_ms:933 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.712528952s txnStartTS:412134239937495042 region_id:160 store_addr:10.9.1.128:20160 kv_process_ms:959 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.664343044s txnStartTS:412134239937495042 region_id:220 store_addr:10.9.1.128:20160 kv_process_ms:976 scan_total_write:865647 scan_processed_write:865646 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.713342373s txnStartTS:412134239937495042 region_id:176 store_addr:10.9.1.128:20160 kv_process_ms:950 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | ++------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +31 rows in set (0.01 sec) + +mysql> select * from tidb_cluster_log where type='pd' and content like '%scheduler%'; -- 查询 PD 的调度日志 + +mysql> select * from tidb_cluster_log where type='tidb' and content like '%ddl%'; -- 查询 TiDB 的 DDL 日志 +``` + +### 集群诊断 + +在当前的集群拓扑下,各个组件分散,数据源和数据格式异构,不便于通过程序化的手段进行集群诊断,所以需要人工进行问题诊断。通过前面几层提供的数据系统表,每一个 TiDB 节点都有了一个稳定的全集群 Global View,所以可以在这个基础上实现一个问题诊断框架。通过定义诊断规则能够快速发现集群的已有问题和潜在问题。 + +**诊断规则定义**:诊断规则是通过读入各个系统表的数据,并通过检测异常数据发现问题的逻辑。 + +诊断规则可以分为三个层次: + +- 发现潜在问题:比如通过判断磁盘容量和磁盘使用量的比例发现磁盘容量不足 +- 发现已有问题:比如通过查看负载情况,发现 Coprocessor 的线程池已经跑满 +- 给出修复建议:比如通过分析磁盘 IO 发现延迟过高,可以给出更换磁盘的建议 + +本提案主要负责实现诊断框架和部分诊断规则,更多的诊断规则需要根据使用经验逐步沉淀,最终形成一个专家系统,降低使用门槛和运维难度。后续内容不详细探讨具体某条的诊断规则,主要聚焦诊断框架的实现。 + +#### 诊断框架设计 + +诊断框架的设计需要考虑多种用户使用场景,包括不限于: + +- 用户选择固定版本后,不会轻易升级 TiDB 集群版本 +- 用户自定义诊断规则 +- 不重启集群加载新的诊断规则 +- 诊断框架需要能方便的与已有运维系统集成 +- 用户可能会屏蔽部分诊断,比如用户预期是一个异构系统,那么会屏蔽异构诊断规则 +- ... + +需要实现一个支持规则热加载的诊断系统,目前有以下备选方案: + +- Golang Plugin:使用不同的插件来定义诊断规则,并且加载到 TiDB 的进程中 + - 优势:使用 Golang 开发,开发门槛低 + - 劣势:版本管理容易出错,需要和宿主 TiDB 使用同样的版本编译插件 +- 内嵌 Lua:在运行时或启动过程中加载 Lua 脚本,脚本从 TiDB 读取系统表数据,并根据诊断规则判断并反馈结果 + - 优势:Lua 是一个完全依赖宿主的语言,语法简单,容易与宿主集成 + - 劣势:依赖另一个脚本语言 +- Shell Script:Shell 具备流程控制功能,所以可以用 Shell 定义诊断规则 + - 优势:易于编写、加载和执行,对 TiDB 内部无侵入,只需要外部 Shell 执行对应 SQL 即可 + - 劣势:需要在安装 mysql client 的机器上运行 + +本提案暂时采用第三种方案,使用 Shell 编写诊断规则。对 TiDB 没有侵入,同时也为后续实现更好的方案提供扩展性。 diff --git a/docs/design/2019-11-14-tidb-builtin-diagnostics.md b/docs/design/2019-11-14-tidb-builtin-diagnostics.md new file mode 100644 index 0000000000000..98a2e7757858e --- /dev/null +++ b/docs/design/2019-11-14-tidb-builtin-diagnostics.md @@ -0,0 +1,703 @@ +# RFC: TiDB Built-in SQL Diagnostics + +## Summary + +Currently, TiDB obtains diagnostic information mainly relying on external tools (perf/iosnoop/iotop/iostat/vmstat/sar/...), monitoring systems (Prometheus/Grafana), log files, HTTP APIs, and system tables provided by TiDB. The decentralized toolchains and cumbersome acquisition methods lead to high barriers to the use of TiDB clusters, difficulty in operation and maintenance, failure to detect problems in advance, and failure to timely investigate,diagnose, and recover clusters. + +This proposal proposes a new method of acquiring diagnostic information in TiDB for exposure in system tables so that users can query diagnostic information using SQL statements. + +## Motivation + +This proposal mainly solves the following problems in TiDB's process of obtaining diagnostic information: + +- The toolchains are scattered, so TiDB needs to switch back and forth between different tools. In addition, some Linux distributions don't have corresponding built-in tools; even if they do, the versions of the tools aren't as expected. +- The information acquisition methods are inconsistent, for example, SQL, HTTP, exported monitoring, viewing logs by logging into each node, and so on. +- There are many TiDB cluster components. and the comparison and correlation of the monitoring information between different components are inefficient and cumbersome. +- TiDB does not have centralized log management components, so there are no efficient ways to filter, retrieve, analyze, and aggregate logs of the entire cluster. +- The system table only contains the information of the current node, which does not reflect the state of the entire cluster, such as: SLOW_QUERY, PROCESSLIST, STATEMENTS_SUMMARY. + +The efficiency of cluster-scale information query, state acquisition, log retrieval, one-click inspection, and fault diagnosis can be improved after the multi-dimensional cluster-level system table and the cluster's diagnostic rule framework are provided. And basic data can be provided for subsequent exception alarming. + +## Detailed design + +### System overview + +The implementation of this proposal is divided into four layers: + +- L1: The lowest level implements the information collection module at each node, including monitoring information, hardware information, network IO recorded in the kernel, Disk IO information, CPU usage, memory usage, etc, of TiDB/TiKV/PD. +- L2: The second layer can obtain the information collected by the current node by calling the underlying information collection module and providing data to the upper layer through external service interfaces (HTTP API/gRPC Service). +- L3: In the third layer, TiDB pulls the information from each node for aggregation and summarizing, and provides the data to the upper layer in the form of the system table. +- L4: The fourth layer implements the diagnostic framework. The diagnostic framework obtains the status of the entire cluster by querying the system table and obtains the diagnostic result according to the diagnostic rules. + +The following chart shows the data flow from information collection to analysis using the diagnostic rules: + +``` ++-L1--------------+ +-L3-----+ +| +-------------+ | | | +| | Metrics | | | | +| +-------------+ | | | +| +-------------+ | | | +| | Disk IO | +---L2:gRPC-->+ | +| +-------------+ | | | +| +-------------+ | | TiDB | +| | Network IO | | | | +| +-------------+ | | | +| +-------------+ | | | +| | Hardware | +---L2:HTTP-->+ | +| +-------------+ | | | +| +-------------+ | | | +| | System Info | | | | +| +-------------+ | | | ++-----------------+ +---+----+ + | + +---infoschema---+ + | + v ++-L4---------------+---------------------+ +| | +| Diagnosis Framework | +| | +| +---------+ +---------+ +---------+ | +| | rule1 | | rule2 | | rule3 | | +| +---------+ +---------+ +---------+ | ++----------------------------------------+ +``` + +### System information collection + +The system information collection module needs to be implemented for all of TiDB/TiKV/PD components. TiDB/PD uses Golang to implement with reuse of the underlying logic; TiKV needs separation implementation in Rust. + +#### Node hardware information + +The hardware information that each node needs to obtain includes: + +- CPU information: physical core number, logical core number, NUMA information, CPU frequency, CPU vendor, L1/L2/L3 cache +- NIC information: NIC device name, NIC enabled status, manufacturer, model, bandwidth, driver version, number of interface queues (optional) +- Disk information: disk name, disk capacity, disk usage, disk partition, mount information +- USB device list +- Memory information + +#### Node System Information + +The hardware information that each node needs to obtain includes: + +- CPU Usage, loads in 1/5/15 minutes: +- Memory: Total/Free/Available/Buffers/Cached/Active/Inactive/Swap +- Disk IO: + - tps: number of transfers per second that were issued to the device. + - rrqm/s: number of read requests merged per second that were queued to the device. + - wrqm/s: number of write requests merged per second that were queued to the device. + - r/s: number (after merges) of read requests completed per second for the device. + - r/s: number (after merges) of read requests completed per second for the device. + - r/s: number (after merges) of read requests completed per second for the device. + - r/s: number (after merges) of read requests completed per second for the device. + - r/s: number (after merges) of read requests completed per second for the device. + - w/s: number (after merges) of write requests completed per second for the device. + - rsec/s: number of sectors (kilobytes, megabytes) read from the device per second. + - wsec/s: number of sectors (kilobytes, megabytes) written to the device per second. + - await: average time (in milliseconds) for I/O requests issued to the device to be served. + - %util: percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device) +- Network IO + - IFACE: name of the network interface for which statistics are reported. + - rxpck/s: total number of packets received per second. + - txpck/s: total number of packets transmitted per second. + - rxkB/s: total number of kilobytes received per second. + - txkB/s: total number of kilobytes transmitted per second. + - rxcmp/s: number of compressed packets received per second. + - txcmp/s: number of compressed packets transmitted per second. + - rxmcst/s: number of multicast packets received per second. +- System configuration: `sysctl -a` + +#### Node configuration information + +All nodes contain the effective configuration for the current node, and no additional steps are required to get the configuration information. + +#### Node log information + +Currently, the logs generated by TiDB/TiKV/PD are saved on their respective nodes, and no additional log collection components are deployed during TiDB cluster deployment, so there are the following problems in log retrieval: + +- Logs are distributed on each node. You need to log in to each node to search using keywords. +- Log files are rotated every day, so we need to search among multiple log files even on a single node. +- There is no easy way to combine logs of multiple nodes into a single file which sorted by the time. + +This proposal provides the following two solutions to the above problems: + +- Introduce a third-party log collection component to collect logs from all nodes + - Advantages: with a unified log management mechanism, logs can be saved for a long time, and are easy to retrieve; logs of multiple components can be sorted by time. + - Disadvantages: third-party components are not easy to integrate with TiDB SQL engine, which may increase the difficulty of cluster operation and maintenance; the log collection tool collects logs fully, so the collection process will take up system resources (Disk IO, Network IO). +- Each node provides a log service. TiDB pushes the predicate to the log retrieval interface through the log service of each node, and directly merges the logs returned by each node. + - Advantages: no third-party component is introduced. Only logs that have been filtered by the pushdown predicates are returned; the implementation can easily be integrated with TiDB SQL and reuse SQL engine functions such as filter and aggregation. + - Disadvantages: If the log files are deleted in some nodes, the corresponding log cannot be retrieved. + +This proposal uses the second way after weighing in on the above advantages and disadvantages. That is, each node provides a log search service, and TiDB pushes the predicate in the log search SQL to each node. The semantics of the log search service is: search for local log files, and filter using predicates, and then return the matched results. + +The following are the predicates that the log interface needs to process: + +- `start_time`: start time of the log retrieval (Unix timestamp, in milliseconds). If there is no such predicate, the default is 0. +- `end_time:`: end time of the log retrieval (Unix timestamp, in milliseconds). If there is no such predicate, the default is `int64::MAX`. +- `pattern`: filter pattern determined by the keyword. For example, `SELECT * FROM tidb_cluster_log` WHERE "%gc%" `%gc%` is the filtered keyword. +- `level`: log level; can be selected as DEBUG/INFO/WARN/WARNING/TRACE/CRITICAL/ERROR +- `limit`: the maximum of logs items to return, preventing the log from being too large and occupying a large bandwidth of the network.. If not specified, the default limit is 64k. + +#### Node performance sampling data + +In a TiDB cluster, when performance bottlenecks occur, we usually need a way to quickly locate the problem. The Flame Graph was invented by Brendan Gregg. Unlike other trace and profiling methods, Flame Graph looks at the time distribution in a global view, listing all possible call stacks from bottom to top. Other rendering methods generally only list a single call stack or a non-hierarchical time distribution. + +TiKV and TiDB currently have different ways of obtaining a flame graph and all of them rely on external tools. + +- TiKV retrieves the flame graph via: + + ``` + perf record -F 99 -p proc_pid -g -- sleep 60 + perf script > out.perf + /opt/FlameGraph/stackcollapse-perf.pl out.perf > out.folded + /opt/FlameGraph/flamegraph.pl out.folded > cpu.svg + ``` + +- TiDB retrieves the flame graph via: + + ``` + curl http://127.0.0.1:10080/debug/pprof/profile > cpu.pprof + go tool pprof -svg cpu.svn cpu.pprof + ``` + +There are two main problems currently: + +- The production environment may not contain the corresponding external tool (perf/flamegraph.pl/go) +- There is no unified way for TiKV and TiDB. + +In order to solve the above two problems, this proposal proposes to build the flame map in TiDB in the flame map, so that TiDB and TiKV can both use SQL to trigger sampling, and the sampled data can be converted into query results in flame maps. This way we can reduce the dependency on external tools, and at the same time improve efficiency greatly. For each node, a sampling data acquisition function is implemented and the sampling interface provided, through which a specified format is output to the upper layer. The tentative output is the ProtoBuf format defined by `[pprof](github.com/google/pprof)`. + +Sampling data acquisition method: + +- TiDB/PD: Use the sample data acquisition interface built in Golang runtime +- TiKV: Collect sample data using the `[pprof-rs](github.com/tikv/pprof-rs)` library + +#### Node monitoring information + +Monitoring information mainly includes monitoring metrics defined internally by each component. At present, TiDB/TiKV/PD provides the `/metrics` HTTP API, through which the deployed Prometheus component pulls the monitoring metrics of each node of the cluster in a timing manner (15s interval by default). And the Grafana component is deployed to pull the monitoring data from Prometheus for visualization. + +The monitoring information is different from the system information acquired in real time. The monitoring data is time-series data which contains the data of each node at each time point. It is very useful for troubleshooting and diagnosing problems, so how monitoring information is saved and inquired is very important for this proposal. In order to be able to use SQL to query monitoring data in TiDB, there are currently the following options: + +- Use Prometheus client and PromQL to query the data from the Prometheus server + - Advantages: there is a ready-made solution; just register the address of Prometheus server to TiDB, which is simple to implement. + - Disadvantages: TiDB will rely more on Prometheus, which increases the difficulty for subsequent removal of Prometheus. +- Save the monitoring data for the most recent period (tentative 1 day) to PD and query monitoring data from PD. + - Advantages: this solution does not depend on the Prometheus server. Easy for subsequent removal of Prometheus. + - Disadvantages: requires implementation of time-series saving logic in PD and the corresponding query engine. The workload and difficulty are high. + +In this proposal, we are opt to the second solution. Although it is more difficult to implement, it will benefit the follow-up work. Considering the difficulty and long development cycle, we can implement this function in three stages (the third stage is implemented depending on the specific situation): + +1. Add the `remote-metrics-storage` configuration to the PD and temporarily configure it as the address of the Prometheus Server. PD acts as a proxy, and the request is transferred to Prometheus for execution. The main considerations are as follows: + + - PD will have its own implementation of the query interface to realize bootstraping. No other changes needed for TiDB. + - With bootstrapping realized, users can still use SQL to query monitoring information and diagnostic frameworks without relying on the Prometheus component deployed by TiDB + +2. Extract the modules for persisting and querying Prometheus time series data and embed it in PD. +3. PD internally implements its own module for persisting and query time series data (currently CockroachDB's solution) + +##### PD performance analysis + +PD mainly handles scheduling and TSO services for TiDB clusters, where: + +1. TSO fetching accumulates only one atomic variable in Leader memory +2. The Operator and OperatorStep generated by the schedule are only stored in memory, and the state in memory is updated according to the heartbeat information of the Region. + +From the above information, it can be concluded that the performance impact of the new monitoring function on the PD can be ignored in most cases. + +### Retrieve system information + +Since the TiDB/TiKV/PD component has previously exposed some system information through the HTTP API, and the PD mainly provides external services through the HTTP API, some interfaces of this proposal reuse existing logics and use the HTTP API to obtain data from various components. For example, configuration information. + +However, because TiKV plan to completely removes the HTTP API in the future, only the existing interfaces will be reused for TiKV, and no new HTTP APIs will be added. There will be a unified gRPC service defined for log retrieval, hardware information, and system information acquisition. Each component implements their own services, which are registered to the gRPC Server during startup. + +#### gRPC service definition + +```proto +// Diagnostics service for TiDB cluster components. +service Diagnostics { + // Searchs log in the target node + rpc search_log(SearchLogRequest) returns (SearchLogResponse) {}; + // Retrieves server info in the target node + rpc server_info(ServerInfoRequest) returns (ServerInfoResponse) {}; +} + +enum LogLevel { + Debug = 0; + Info = 1; + Warn = 2; + Trace = 3; + Critical = 4; + Error = 5; +} + +message SearchLogRequest { + int64 start_time = 1; + int64 end_time = 2; + LogLevel level = 3; + string pattern = 4; + int64 limit = 5; +} + +message SearchLogResponse { + repeated LogMessage messages = 1; +} + +message LogMessage { + int64 time = 1; + LogLevel level = 2; + string message = 3; +} + +enum ServerInfoType { + All = 0; + HardwareInfo = 1; + SystemInfo = 2; + LoadInfo = 3; +} + +message ServerInfoRequest { + ServerInfoType tp = 1; +} + +message ServerInfoItem { + // cpu, memory, disk, network ... + string tp = 1; + // eg. network: lo1/eth0, cpu: core1/core2, disk: sda1/sda2 + string name = 1; + string key = 2; + string value = 3; +} + +message ServerInfoResponse { + repeated ServerInfoItem items = 1; +} +``` + +#### Reusable HTTP API + +Currently, TiDB/TiKV/PD includes a partially reusable HTTP API. This proposal does not migrate the corresponding interface to the gRPC Service. The migration will be completed by other subsequent plans. All HTTP APIs need to return data in JSON format. The following is a list of HTTP APIs that may be used in this proposal: + +- Retrieve configuration information + - PD: /pd/api/v1/config + - TiDB/TiKV: /config +- Performance sampling interface: TiDB and PD contain all the following interfaces, while TiKV temporarily only contains the CPU performance sampling interface + - CPU: /debug/pprof/profile + - Memory: /debug/pprof/heap + - Allocs: /debug/pprof/allocs + - Mutex: /debug/pprof/mutex + - Block: /debug/pprof/block + +#### Cluster information system tables + +Each TiDB instance can access the information of other nodes through the HTTP API or gRPC Service provided by the first two layers. This way we can implement the Global View forw the cluster. In this proposal, the collected cluster information is provided to the upper layer by creating a series of related system tables. The upper layer includes not limited to: + +- End User: Users can obtain cluster information directly through SQL query to troubleshooting problem +- Operation and maintenance system: The ability to obtain cluster information through SQL will make it easier for users to integrate TiDB into their own operation and maintenance systems. +- Eco-system tools: External tools get the cluster information through SQL to realize function customization. For example, `[sqltop](https://github.com/ngaut/sqltop)` can directly obtain the SQL sampling information of the entire cluster through the `events_statements_summary_by_digest` table of the cluster. + +#### Cluster Topology System Table + + To realize **Global View** for the TiDB instance, we need to provide a topology system table, where we can obtain the HTTP API Address and gRPC Service Address of each node. This way the Endpoints can be easily constructed for remote APIs. The Endpoint further acquires the information collected by the target node. + +The implementation of this proposal can query the following results through SQL: + +``` +mysql> use information_schema; +Database changed + +mysql> desc TIDB_CLUSTER_INFO; ++----------------+---------------------+------+------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++----------------+---------------------+------+------+---------+-------+ +| TYPE | varchar(64) | YES | | NULL | | +| ADDRESS | varchar(64) | YES | | NULL | | +| STATUS_ADDRESS | varchar(64) | YES | | NULL | | +| VERSION | varchar(64) | YES | | NULL | | +| GIT_HASH | varchar(64) | YES | | NULL | | ++----------------+---------------------+------+------+---------+-------+ +5 rows in set (0.00 sec) + +mysql> select TYPE, ADDRESS, STATUS_ADDRESS,VERSION from TIDB_CLUSTER_INFO; ++------+-----------------+-----------------+-----------------------------------------------+ +| TYPE | ADDRESS | STATUS_ADDRESS | VERSION | ++------+-----------------+-----------------+-----------------------------------------------+ +| tidb | 127.0.0.1:4000 | 127.0.0.1:10080 | 5.7.25-TiDB-v4.0.0-alpha-793-g79eef48a3-dirty | +| pd | 127.0.0.1:2379 | 127.0.0.1:2379 | 4.0.0-alpha | +| tikv | 127.0.0.1:20160 | 127.0.0.1:20180 | 4.0.0-alpha | ++------+-----------------+-----------------+-----------------------------------------------+ +3 rows in set (0.00 sec) +``` + +#### Monitoring information system table + +Monitoring metrics are added and deleted as the program is iterated. For this reason, the same monitoring metric might be obtained through different PromQL expressions to monitor information in different dimensions. Therefore, it is necessary to design a flexible monitoring system table frame. This proposal temporarily adopts the this scheme - mapping expressions to system tables in the `metrics_schema` database. The relationship between expressions and system tables can be mapped in the following ways: + +- Define in the configuration file + + ``` + # tidb.toml + [metrics_schema] + qps = `sum(rate(tidb_server_query_total[$INTERVAL] offset $OFFSET_TIME)) by (result)` + memory_usage = `process_resident_memory_bytes{job="tidb"}` + goroutines = `rate(go_gc_duration_seconds_sum{job="tidb"}[$INTERVAL] offset $OFFSET_TIME)` + ``` + +- Inject via HTTP API + + ``` + curl -XPOST http://host:port/metrics_schema?name=distsql_duration&expr=`histogram_quantile(0.999, + sum(rate(tidb_distsql_handle_query_duration_seconds_bucket[$INTERVAL] offset $OFFSET_TIME)) by (le, type))` + ``` + +- Use special SQL commands + + ``` + mysql> admin metrics_schema add parse_duration `histogram_quantile(0.95, sum(rate(tidb_session_parse_duration_seconds_bucket[$INTERVAL] offset $OFFSET_TIME)) by (le, sql_type))` + ``` + +- Load from file + + ``` + mysql> admin metrics_schema load external_metrics.txt + #external_metrics.txt + execution_duration = `histogram_quantile(0.95, sum(rate(tidb_session_execute_duration_seconds_bucket[$INTERVAL] offset $OFFSET_TIME)) by (le, sql_type))` + pd_client_cmd_ops = `sum(rate(pd_client_cmd_handle_cmds_duration_seconds_count{type!="tso"}[$INTERVAL] offset $OFFSET_TIME)) by (type)` + ``` + +After passing the above mapping, you can view the following table in the `metrics_schema` library: + +``` +mysql> use metrics_schema; +Database changed + +mysql> show tables; ++-------------------------------------+ +| Tables_in_metrics_schema | ++-------------------------------------+ +| qps | +| memory_usage | +| goroutines | +| distsql_duration | +| parse_duration | +| execution_duration | +| pd_client_cmd_ops | ++-------------------------------------+ +7 rows in set (0.00 sec) +``` + +The way the field is determined when the expression is mapped to the system table depends mainly on the result data of the expression execution. Take expression `sum(rate(pd_client_cmd_handle_cmds_duration_seconds_count{type!="tso"}[1m]offset 0)) by (type)` as an example, the result of the query is: + +| Element | Value | +|---------|-------| +| {type="update_gc_safe_point"} | 0 | +| {type="wait"} | 2.910521666666667 | +| {type="get_all_stores"} | 0 | +| {type="get_prev_region"} | 0 | +| {type="get_region"} | 0 | +| {type="get_region_byid"} | 0 | +| {type="scan_regions"} | 0 | +| {type="tso_async_wait"} | 2.910521666666667 | +| {type="get_operator"} | 0 | +| {type="get_store"} | 0 | +| {type="scatter_region"} | 0 | + +The following are the query results mapped to the system table schema: + +``` +mysql> desc pd_client_cmd_ops; ++------------+-------------+------+-----+-------------------+-------+ +| Field | Type | Null | Key | Default | Extra | ++------------+-------------+------+-----+-------------------+-------+ +| address | varchar(32) | YES | | NULL | | +| type | varchar(32) | YES | | NULL | | +| value | float | YES | | NULL | | +| interval | int | YES | | 60 | | +| start_time | int | YES | | CURRENT_TIMESTAMP | | ++------------+-------------+------+-----+-------------------+-------+ +3 rows in set (0.02 sec) + +mysql> select address, type, value from pd_client_cmd_ops; ++------------------+----------------------+---------+ +| address | type | value | ++------------------+----------------------+---------+ +| 172.16.5.33:2379 | update_gc_safe_point | 0 | +| 172.16.5.33:2379 | wait | 2.91052 | +| 172.16.5.33:2379 | get_all_stores | 0 | +| 172.16.5.33:2379 | get_prev_region | 0 | +| 172.16.5.33:2379 | get_region | 0 | +| 172.16.5.33:2379 | get_region_byid | 0 | +| 172.16.5.33:2379 | scan_regions | 0 | +| 172.16.5.33:2379 | tso_async_wait | 2.91052 | +| 172.16.5.33:2379 | get_operator | 0 | +| 172.16.5.33:2379 | get_store | 0 | +| 172.16.5.33:2379 | scatter_region | 0 | ++------------------+----------------------+---------+ +11 rows in set (0.00 sec) + +mysql> select address, type, value from pd_client_cmd_ops where start_time=’2019-11-14 10:00:00’; ++------------------+----------------------+---------+ +| address | type | value | ++------------------+----------------------+---------+ +| 172.16.5.33:2379 | update_gc_safe_point | 0 | +| 172.16.5.33:2379 | wait | 0.82052 | +| 172.16.5.33:2379 | get_all_stores | 0 | +| 172.16.5.33:2379 | get_prev_region | 0 | +| 172.16.5.33:2379 | get_region | 0 | +| 172.16.5.33:2379 | get_region_byid | 0 | +| 172.16.5.33:2379 | scan_regions | 0 | +| 172.16.5.33:2379 | tso_async_wait | 0.82052 | +| 172.16.5.33:2379 | get_operator | 0 | +| 172.16.5.33:2379 | get_store | 0 | +| 172.16.5.33:2379 | scatter_region | 0 | ++------------------+----------------------+---------+ +11 rows in set (0.00 sec) +``` + +PromQL query statements with multiple labels will be mapped to multiple columns of data, which can be easily filtered and aggregated using existing SQL execution engines. + +#### Performance profiling system table + +The corresponding node performance sampling data is obtained by `/debug/pprof/profile` of each node, and then aggregated performance profiling results are output to the user in the form of SQL query. Since the SQL query results cannot be output in svg format, we need to solve the problem of output display. + +Our proposed solution is to aggregate the sampled data and display all the call paths on a line-by-line basis in a tree structure. This can be implemented by using flamegraph. The core ideas and the corresponding implementations are described as below: + +- Provide a global view: use a separate column for each aggregate result to show the global usage scale, which can be used to facilitate filtering and sorting to +- Show all call paths: all call paths are used as query results. And use a separate column to number the subtrees of each call path, we can easily view only one subtreeby filtering +- Hierarchical display: use the tree structure to display the stack, use a separate column to record the depth of the stack, which is convenient Filtering the depth of different stacks + +This proposal needs to implement the following performance profiling table: + +| Table | Description | +|------|-----| +| tidb_profile_cpu | TiDB CPU flame graph | +| tikv_profile_cpu | TiKV CPU flame graph | +| tidb_profile_block | Stack traces that led to blocking on synchronization primitives | +| tidb_profile_memory | A sampling of memory allocations of live objects | +| tidb_profile_allocs | A sampling of all past memory allocations | +| tidb_profile_mutex | Stack traces of holders of contended mutexes | +| tidb_profile_goroutines | Stack traces of all current goroutines | + +#### Globalized memory system table + +Current the `slow_query`/`events_statements_summary_by_digest`/`processlist` memory tables only contain single-node data. This proposal allows any TiDB instance to view information about the entire cluster by adding the following three cluster-level system tables: + +| Table Name | Description | +|------|-----| +| tidb_cluster_slow_query | slow_query table data for all TiDB nodes | +| tidb_cluster_statements_summary | statements summary table Data for all TiDB nodes | +| tidb_cluster_processlist | processlist table data for all TiDB nodes | + +#### Configuration information of all nodes + +For a large cluster, the way to obtain configuration by each node through HTTP API is cumbersome and inefficient. This proposal provides a full cluster configuration information system table, which simplifies the acquisition, filtering, and aggregation of the entire cluster configuration information. + +See the following example for some expected results of this proposal: + +``` +mysql> use information_schema; +Database changed + +mysql> select * from tidb_cluster_config where `key` like 'log%'; ++------+-----------------+-----------------------------+---------------+ +| TYPE | ADDRESS | KEY | VALUE | ++------+-----------------+-----------------------------+---------------+ +| pd | 127.0.0.1:2379 | log-file | | +| pd | 127.0.0.1:2379 | log-level | | +| pd | 127.0.0.1:2379 | log.development | false | +| pd | 127.0.0.1:2379 | log.disable-caller | false | +| pd | 127.0.0.1:2379 | log.disable-error-verbose | true | +| pd | 127.0.0.1:2379 | log.disable-stacktrace | false | +| pd | 127.0.0.1:2379 | log.disable-timestamp | false | +| pd | 127.0.0.1:2379 | log.file.filename | | +| pd | 127.0.0.1:2379 | log.file.log-rotate | true | +| pd | 127.0.0.1:2379 | log.file.max-backups | 0 | +| pd | 127.0.0.1:2379 | log.file.max-days | 0 | +| pd | 127.0.0.1:2379 | log.file.max-size | 0 | +| pd | 127.0.0.1:2379 | log.format | text | +| pd | 127.0.0.1:2379 | log.level | | +| pd | 127.0.0.1:2379 | log.sampling | | +| tidb | 127.0.0.1:4000 | log.disable-error-stack | | +| tidb | 127.0.0.1:4000 | log.disable-timestamp | | +| tidb | 127.0.0.1:4000 | log.enable-error-stack | | +| tidb | 127.0.0.1:4000 | log.enable-timestamp | | +| tidb | 127.0.0.1:4000 | log.expensive-threshold | 10000 | +| tidb | 127.0.0.1:4000 | log.file.filename | | +| tidb | 127.0.0.1:4000 | log.file.max-backups | 0 | +| tidb | 127.0.0.1:4000 | log.file.max-days | 0 | +| tidb | 127.0.0.1:4000 | log.file.max-size | 300 | +| tidb | 127.0.0.1:4000 | log.format | text | +| tidb | 127.0.0.1:4000 | log.level | info | +| tidb | 127.0.0.1:4000 | log.query-log-max-len | 4096 | +| tidb | 127.0.0.1:4000 | log.record-plan-in-slow-log | 1 | +| tidb | 127.0.0.1:4000 | log.slow-query-file | tidb-slow.log | +| tidb | 127.0.0.1:4000 | log.slow-threshold | 300 | +| tikv | 127.0.0.1:20160 | log-file | | +| tikv | 127.0.0.1:20160 | log-level | info | +| tikv | 127.0.0.1:20160 | log-rotation-timespan | 1d | ++------+-----------------+-----------------------------+---------------+ +33 rows in set (0.00 sec) + +mysql> select * from tidb_cluster_config where type='tikv' and `key` like 'raftdb.wal%'; ++------+-----------------+---------------------------+--------+ +| TYPE | ADDRESS | KEY | VALUE | ++------+-----------------+---------------------------+--------+ +| tikv | 127.0.0.1:20160 | raftdb.wal-bytes-per-sync | 512KiB | +| tikv | 127.0.0.1:20160 | raftdb.wal-dir | | +| tikv | 127.0.0.1:20160 | raftdb.wal-recovery-mode | 2 | +| tikv | 127.0.0.1:20160 | raftdb.wal-size-limit | 0KiB | +| tikv | 127.0.0.1:20160 | raftdb.wal-ttl-seconds | 0 | ++------+-----------------+---------------------------+--------+ +5 rows in set (0.01 sec) +``` + +#### Node hardware/system/load information system tables + +According to the definition of `gRPC Service` protocol, each `ServerInfoItem` contains the name of the information and the corresponding key-value pair. When presented to the user, the type of the node and the node address need to be added. + +``` +mysql> use information_schema; +Database changed + +mysql> select * from tidb_cluster_hardware ++------+-----------------+----------+----------+-------------+--------+ +| TYPE | ADDRESS | HW_TYPE | HW_NAME | KEY | VALUE | ++------+-----------------+----------+----------+-------------+--------+ +| tikv | 127.0.0.1:20160 | cpu | cpu-1 | frequency | 3.3GHz | +| tikv | 127.0.0.1:20160 | cpu | cpu-2 | frequency | 3.6GHz | +| tikv | 127.0.0.1:20160 | cpu | cpu-1 | core | 40 | +| tikv | 127.0.0.1:20160 | cpu | cpu-2 | core | 48 | +| tikv | 127.0.0.1:20160 | cpu | cpu-1 | vcore | 80 | +| tikv | 127.0.0.1:20160 | cpu | cpu-2 | vcore | 96 | +| tikv | 127.0.0.1:20160 | network | memory | capacity | 256GB | +| tikv | 127.0.0.1:20160 | network | lo0 | bandwidth | 10000M | +| tikv | 127.0.0.1:20160 | network | eth0 | bandwidth | 1000M | +| tikv | 127.0.0.1:20160 | disk | /dev/sda | capacity | 4096GB | ++------+-----------------+----------+----------+-------------+--------+ +10 rows in set (0.01 sec) + +mysql> select * from tidb_cluster_systeminfo ++------+-----------------+----------+--------------+--------+ +| TYPE | ADDRESS | MODULE | KEY | VALUE | ++------+-----------------+----------+--------------+--------+ +| tikv | 127.0.0.1:20160 | sysctl | ktrace.state | 0 | +| tikv | 127.0.0.1:20160 | sysctl | hw.byteorder | 1234 | +| ... | ++------+-----------------+----------+--------------+--------+ +20 rows in set (0.01 sec) + +mysql> select * from tidb_cluster_load ++------+-----------------+----------+-------------+--------+ +| TYPE | ADDRESS | MODULE | KEY | VALUE | ++------+-----------------+----------+-------------+--------+ +| tikv | 127.0.0.1:20160 | network | rsec/s | 1000Kb | +| ... | ++------+-----------------+----------+-------------+--------+ +100 rows in set (0.01 sec) +``` + + +#### Full-chain log system table + +To search in the current log, users need to log in to multiple machines for retrieval respectively, and there is no easy way to sort the retrieval results of multiple machines by time. This proposal creates a new `tidb_cluster_log` system table to provide full-link logs, thereby simplifying the way to troubleshoot problems through logs and improving efficiency. This is achieved by pushing the log-filtering predicates down to the nodes through the `search_log` interface of the gRPC Diagnosis Service. The filtered logs will be eventually merged by time. + +The following example shows the expected results of this proposal: + +``` +mysql> use information_schema; +Database changed + +mysql> desc tidb_cluster_log; ++---------+-------------+------+------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++---------+-------------+------+------+---------+-------+ +| type | varchar(16) | YES | | NULL | | +| address | varchar(32) | YES | | NULL | | +| time | varchar(32) | YES | | NULL | | +| level | varchar(8) | YES | | NULL | | +| message | text | YES | | NULL | | ++---------+-------------+------+------+---------+-------+ +5 rows in set (0.00 sec) + +mysql> select * from tidb_cluster_log where content like '%412134239937495042%'; -- Query the full link log related to TSO 412134239937495042 ++------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| TYPE | ADDRESS | LEVEL | CONTENT | ++------+------------------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:501.60574ms txnStartTS:412134239937495042 region_id:180 store_addr:10.9.82.29:20160 kv_process_ms:416 scan_total_write:340807 scan_processed_write:340806 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:698.095216ms txnStartTS:412134239937495042 region_id:88 store_addr:10.9.1.128:20160 kv_process_ms:583 scan_total_write:491123 scan_processed_write:491122 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.529574387s txnStartTS:412134239937495042 region_id:112 store_addr:10.9.1.128:20160 kv_process_ms:945 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.55722114s txnStartTS:412134239937495042 region_id:100 store_addr:10.9.82.29:20160 kv_process_ms:1000 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.608597018s txnStartTS:412134239937495042 region_id:96 store_addr:10.9.137.171:20160 kv_process_ms:1048 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.614233631s txnStartTS:412134239937495042 region_id:92 store_addr:10.9.137.171:20160 kv_process_ms:1000 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.67587146s txnStartTS:412134239937495042 region_id:116 store_addr:10.9.137.171:20160 kv_process_ms:950 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.693188495s txnStartTS:412134239937495042 region_id:108 store_addr:10.9.1.128:20160 kv_process_ms:949 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.693383633s txnStartTS:412134239937495042 region_id:120 store_addr:10.9.1.128:20160 kv_process_ms:951 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.731990066s txnStartTS:412134239937495042 region_id:128 store_addr:10.9.82.29:20160 kv_process_ms:1035 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.744524732s txnStartTS:412134239937495042 region_id:104 store_addr:10.9.137.171:20160 kv_process_ms:1030 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.786915459s txnStartTS:412134239937495042 region_id:132 store_addr:10.9.82.29:20160 kv_process_ms:1014 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.786978732s txnStartTS:412134239937495042 region_id:124 store_addr:10.9.82.29:20160 kv_process_ms:1002 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tikv | 10.9.82.29:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=17] [block_read_count=1810] [block_read_byte=114945337] [scan_first_range="Some(start: 74800000000000002B5F728000000000130A96 end: 74800000000000002B5F728000000000196372)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.001s] [peer_id=ipv4:10.9.120.251:47968] [region_id=100] | +| tikv | 10.9.82.29:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=19] [block_read_count=1793] [block_read_byte=96014381] [scan_first_range="Some(start: 74800000000000002B5F728000000000393526 end: 74800000000000002B5F7280000000003F97A6)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.002s] [peer_id=ipv4:10.9.120.251:47994] [region_id=124] | +| tikv | 10.9.82.29:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=17] [block_read_count=1811] [block_read_byte=96620574] [scan_first_range="Some(start: 74800000000000002B5F72800000000045F083 end: 74800000000000002B5F7280000000004C51E4)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.014s] [peer_id=ipv4:10.9.120.251:47998] [region_id=132] | +| tikv | 10.9.137.171:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=17] [block_read_count=1779] [block_read_byte=95095959] [scan_first_range="Some(start: 74800000000000002B5F7280000000004C51E4 end: 74800000000000002B5F72800000000052B456)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=2ms] [total_process_time=1.025s] [peer_id=ipv4:10.9.120.251:34926] [region_id=136] | +| tikv | 10.9.137.171:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=15] [block_read_count=1793] [block_read_byte=114024055] [scan_first_range="Some(start: 74800000000000002B5F728000000000196372 end: 74800000000000002B5F7280000000001FC628)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=2ms] [total_process_time=1.03s] [peer_id=ipv4:10.9.120.251:34954] [region_id=104] | +| tikv | 10.9.82.29:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831930] [internal_delete_skipped_count=0] [block_cache_hit_count=18] [block_read_count=1796] [block_read_byte=96116255] [scan_first_range="Some(start: 74800000000000002B5F7280000000003F97A6 end: 74800000000000002B5F72800000000045F083)"] [scan_ranges=1] [scan_iter_processed=831930] [scan_iter_ops=831932] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.035s] [peer_id=ipv4:10.9.120.251:47996] [region_id=128] | +| tikv | 10.9.137.171:20180 | WARN | [tracker.rs:150] [slow-query] [internal_key_skipped_count=831928] [internal_delete_skipped_count=0] [block_cache_hit_count=15] [block_read_count=1792] [block_read_byte=113958562] [scan_first_range="Some(start: 74800000000000002B5F7280000000000CB1BA end: 74800000000000002B5F728000000000130A96)"] [scan_ranges=1] [scan_iter_processed=831928] [scan_iter_ops=831930] [scan_is_desc=false] [tag=select] [table_id=43] [txn_start_ts=412134239937495042] [wait_time=1ms] [total_process_time=1.048s] [peer_id=ipv4:10.9.120.251:34924] [region_id=96] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.841528722s txnStartTS:412134239937495042 region_id:140 store_addr:10.9.137.171:20160 kv_process_ms:991 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.410650751s txnStartTS:412134239937495042 region_id:144 store_addr:10.9.82.29:20160 kv_process_ms:1000 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.930478221s txnStartTS:412134239937495042 region_id:136 store_addr:10.9.137.171:20160 kv_process_ms:1025 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.26929792s txnStartTS:412134239937495042 region_id:148 store_addr:10.9.82.29:20160 kv_process_ms:901 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.116672983s txnStartTS:412134239937495042 region_id:152 store_addr:10.9.82.29:20160 kv_process_ms:828 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.642668083s txnStartTS:412134239937495042 region_id:156 store_addr:10.9.1.128:20160 kv_process_ms:888 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.537375971s txnStartTS:412134239937495042 region_id:168 store_addr:10.9.137.171:20160 kv_process_ms:728 scan_total_write:831931 scan_processed_write:831930 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.602765417s txnStartTS:412134239937495042 region_id:164 store_addr:10.9.82.29:20160 kv_process_ms:871 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.583965975s txnStartTS:412134239937495042 region_id:172 store_addr:10.9.1.128:20160 kv_process_ms:933 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.712528952s txnStartTS:412134239937495042 region_id:160 store_addr:10.9.1.128:20160 kv_process_ms:959 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.664343044s txnStartTS:412134239937495042 region_id:220 store_addr:10.9.1.128:20160 kv_process_ms:976 scan_total_write:865647 scan_processed_write:865646 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | +| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:1.713342373s txnStartTS:412134239937495042 region_id:176 store_addr:10.9.1.128:20160 kv_process_ms:950 scan_total_write:831929 scan_processed_write:831928 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | ++------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +31 rows in set (0.01 sec) + +mysql> select * from tidb_cluster_log where type='pd' and content like '%scheduler%'; -- Query scheduler logs of PD + +mysql> select * from tidb_cluster_log where type='tidb' and content like '%ddl%'; -- Query DDL logs of TiDB +``` + +### Cluster diagnostics + +In the current cluster topology, each component is dispersed. The data sources and data formats are heterogeneous. It is not convenient to perform cluster diagnosis through programmatic ways, so manual diagnosis is required. With the data system tables provided by the previous layers, each TiDB node has a stable Global View for the full cluster. Based on this, a problem diagnosis framework can be implemented. By defining diagnostic rules, we can quickly discover existing and potential problems on your cluster. + +**Diagnostic rule definition**: Diagnostic rules are the logics for finding problems by reading data from various system tables and detecting abnormal data. + +Diagnostic rules can be divided into three levels: + +- Discovery of potential problems: For example, identify insufficient disk capacity by determining the ratio of disk capacity and disk usage +- Locate an existing problem: For example, by looking at the load, we can know that the thread pool of Coprocessor has reached the bottleneck +- Provide fix suggestions: For example, by analyzing the disk IO, we observe a high latency. A recommendation to replace the disk can be provided to you. + +This proposal mainly focuses on implementing the diagnostic framework and some diagnostic rules. More diagnostic rules need to be gradually accumulated based on experience, with the ultimate goal of becoming an expert system that reduces lower the bar of using and operation difficulty. The following content does not detail the specific diagnostic rules, mainly focusing on the implementation of the diagnostic framework. + +#### Diagnostic framework design + +A variety of user scenarios must be considered for the diagnostic framework design , including but not limited to: + +- After selecting a fixed version, users won't upgrade the TiDB cluster version easily. +- User-defined diagnostic rules +- Loading new diagnostic rule without restarting the cluster +- The diagnosis framework needs to be easily integrated with the existing operation and maintenance system. +- Users may block some diagnostics. For example, if the user expects a heterogeneous system, heterogeneous diagnostic rules will be blocked by users. +- ... + +Therefore, implementing a diagnostic system that supports hot rule loading is necessary. Currently there are the following options: + +- Golang Plugin: Use different plugins to define diagnostic rules and load them into the TiDB processes +     - Advantages: Low development threshold in Golang +     - Disadvantages: Version management is error-prone and it requires compiling plugins with the same Golang version as the host TiDB +- Embedded Lua: Load Lua scripts in runtime or during startup. The script reads system table data from TiDB, evaluates and provide feedback based on diagnostic rules +     - Advantages: Lua is a fully host-dependent language with simple syntaxes; easy integration with the host +     - Disadvantages: Relying on another scripting language +- Shell Script: Shell supports process control, so you can define diagnostic rules with Shell + - Advantages: easy to write, load and execute + - Disadvantages: need to run on the machine where the MySQL client is installed + +This proposal temporarily adopts the third option to write diagnostic rules using Shell. There is no intrusion into TiDB, and it also provides scalability for subsequent implementations of better solutions. \ No newline at end of file