Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: tidb built-in sql diagnostics #13481

Merged
merged 16 commits into from
Nov 28, 2019
Merged

proposal: tidb built-in sql diagnostics #13481

merged 16 commits into from
Nov 28, 2019

Conversation

lonng
Copy link
Contributor

@lonng lonng commented Nov 15, 2019

Signed-off-by: Lonng heng@lonng.org

Summary

Currently, TiDB diagnostic information acquisition relies mainly on external tools (perf/iosnoop/iotop/iostat/vmstat/sar/...), monitoring systems (Prometheus/Grafana), log files, HTTP APIs, and system tables provided by TiDB. The decentralized toolchains and cumbersome acquisition methods lead to high barriers to the use of TiDB clusters, difficulty in operation and maintenance, failure to detect problems in advance, and failure to timely investigate, diagnose, and recover clusters.
This proposal proposes a new method of acquiring diagnostic information in TiDB and exposing diagnostic information by the system tables so that users can query using SQL.

Motivation

This proposal mainly solves the following problems in TiDB's process of obtaining diagnostic information:

  • The toolchains are scattered, it needs to switch back and forth between different tools, and some Linux distributions do not have built-in corresponding tools or built-in tools don't have versions as expected.
  • The information acquisition methods are inconsistent, such as SQL, HTTP, export monitoring, login to each node to view logs, and so on.
  • There are many TiDB cluster components, and the correlation monitoring information between different components is inefficient and cumbersome.
  • TiDB does not have centralized log management components, and there is no efficient ways to filter, retrieve, analyze, and aggregate logs of the entire cluster.
  • The system table only contains the current node information, and does not reflect the state of the entire cluster, such as: SLOW_QUERY, PROCESSLIST, STATEMENTS_SUMMARY.

The efficiency of the cluster-based information query, state acquisition, log retrieval, one-click inspection, and fault diagnosis will be improved after the multi-dimensional cluster-level system table and the cluster's diagnostic rule framework is provided. And provide basic data for the subsequent abnormal early warning function.

Signed-off-by: Lonng <heng@lonng.org>
@codecov
Copy link

codecov bot commented Nov 15, 2019

Codecov Report

Merging #13481 into master will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master     #13481   +/-   ##
===========================================
  Coverage   80.1103%   80.1103%           
===========================================
  Files           474        474           
  Lines        117257     117257           
===========================================
  Hits          93935      93935           
  Misses        15948      15948           
  Partials       7374       7374

@dcalvin dcalvin self-requested a review November 15, 2019 06:47
Signed-off-by: Lonng <heng@lonng.org>
Signed-off-by: Lonng <heng@lonng.org>
Signed-off-by: Lonng <heng@lonng.org>
Signed-off-by: Lonng <heng@lonng.org>
Signed-off-by: Lonng <heng@lonng.org>
| 33 | pd | pd-0 | 127.0.0.1:2379 | log.format | text |
| 34 | pd | pd-0 | 127.0.0.1:2379 | log.level | |
| 35 | pd | pd-0 | 127.0.0.1:2379 | log.sampling | <nil> |
| 114 | tidb | tidb-0 | 127.0.0.1:4000 | log.disable-error-stack | <nil> |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tidb-0? can we name a TiDB instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we cannot, it's a temporary name generated by TIDB_CLUSTER_INFO. Do we need to remote it?

Signed-off-by: Lonng <heng@lonng.org>

#### 监控信息系统表

由于监控指标会随着程序的迭代添加和删除监控指标,对于同一个监控指标,可能有不同的表达式获取监控不同维度的信息。鉴于以上两个需求,需要设计一个有弹性的监控系统表框架,本提案暂时才采取以下方案:将表达式映射为 `metrics_schema` 数据库中的系统表,表达式与系统表的关系可以通过以下方式关联:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need to provide a mechanism that checking the query is valid.

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved
| 126 | tidb | tidb-0 | 127.0.0.1:4000 | log.record-plan-in-slow-log | 1 |
| 127 | tidb | tidb-0 | 127.0.0.1:4000 | log.slow-query-file | tidb-slow.log |
| 128 | tidb | tidb-0 | 127.0.0.1:4000 | log.slow-threshold | 300 |
| 213 | tikv | tikv-0 | 127.0.0.1:20160 | log-file | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of the tikv name suffix, store id?

Copy link
Contributor Author

@lonng lonng Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been removed and I have not updated this part yet.
See: #13587

+------+------+--------+-----------------+-----------------------------+---------------+
| ID | TYPE | NAME | ADDRESS | KEY | VALUE |
+------+------+--------+-----------------+-----------------------------+---------------+
| 21 | pd | pd-0 | 127.0.0.1:2379 | log-file | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been removed and I have not updated this part yet.
See: #13587

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved
- 劣势:增加集群运维难度,第三方组件不容易与 TiDB 内部 SQL 集成;日志收集工具会收集全量日志,收集过程占用各个系统资源(磁盘 IO、网络 IO)
- 各个节点提供日志服务,TiDB 通过各个节点的接口将谓词下推到日志检索接口,直接对各个节点返回的日志进行归并
- 优势:不引入三方组件,谓词下推后只返回过滤后的日志,能轻易的与 TiDB SQL 进行集成,并能复用 SQL 引擎的过滤、聚合等
- 劣势:如果节点日志删除后,不能检索到对应日志
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be resolved by allowing TiDB to query other logging systems. Making the log storage pluggable. And it can also be applied to the monitoring system. If users want more persistent monitoring/logging data, they can replace the builtin storage engines for this kind of data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We choose a lightweight way for the time being.


#### 节点配置信息

所有节点都包含当前节点的生效配置,不需要额外的步骤既可拿到配置信息。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
所有节点都包含当前节点的生效配置,不需要额外的步骤既可拿到配置信息
所有节点都包含当前节点的生效配置,不需要额外的步骤即可拿到配置信息

- rxcmp/s:每秒钟接收的压缩数据包
- txcmp/s:每秒钟发送的压缩数据包
- rxmcst/s:每秒钟接收的多播数据包
- 常用的系统配置:sysctl -a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ulimit -a

1. PD 中添加 `remote-metrics-storage` 配置,暂时配置为 Prometheus Server 的地址。PD 作为 proxy,将请求转移到 Prometheus 上执行,主要有以下考量:
- 后续 PD 实现查询接口实现自举,TiDB 不需要做其他改动
- 用户不使用 TiDB 部署的 Prometheus 而使用自建的监控服务,依然可以使用 SQL 查询监控信息以及诊断框架
2. 将 Prometheus 时序数据保存和查询相应的模块抽离出来,并嵌入到 PD 中
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to embed Prometheus into PD, how about shim TiKV as the storage of metrics via https://github.com/bragfoo/TiPrometheus?


- 使用 Prometheus client 和 PromQL 查询 Prometheus server 的数据
- 优势:有现成解决方案,只需要将 Prometheus server 的地址注册到 TiDB 即可,实现简单
- 劣势:增强了 TiDB 对 Prometheus 的依赖,为后续完全移除 Prometheus 增加了困难
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about how to integrate the alert and graphing solutions from community, like alertmanager and grafana after we remove Prometheus.

lonng and others added 4 commits November 28, 2019 12:29
Signed-off-by: Lonng <heng@lonng.org>
Co-Authored-By: Tennix <tennix@users.noreply.github.com>
Signed-off-by: Lonng <heng@lonng.org>
Copy link
Contributor

@winkyao winkyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sre-bot sre-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 28, 2019
@lonng lonng removed the status/can-merge Indicates a PR has been approved by a committer. label Nov 28, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Nov 28, 2019

/run-all-tests

Copy link
Member

@bb7133 bb7133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lonng lonng merged commit 08f0072 into pingcap:master Nov 28, 2019
@lonng lonng deleted the sql-diagnositics branch November 28, 2019 06:45
+---------+-------------+------+------+---------+-------+
5 rows in set (0.00 sec)

mysql> select * from tidb_cluster_log where content like '%412134239937495042%'; -- 查询 TSO 为 412134239937495042 全链路日志
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mysql> -- 查询 TSO 为 412134239937495042 全链路日
mysql> select * from tidb_cluster_log where content like '%412134239937495042%'; 

may look better

+------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| TYPE | ADDRESS | LEVEL | CONTENT |
+------+------------------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:501.60574ms txnStartTS:412134239937495042 region_id:180 store_addr:10.9.82.29:20160 kv_process_ms:416 scan_total_write:340807 scan_processed_write:340806 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the log is too verbose here, can you simply them and only list a few?

Copy link
Member

@siddontang siddontang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

XiaTianliang pushed a commit to XiaTianliang/tidb that referenced this pull request Dec 21, 2019
Signed-off-by: Lonng <heng@lonng.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.