-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[clickhouse] Long running QA tests for replicated cluster #6953
Labels
Comments
karencfv
added
clickhouse
Related to the ClickHouse metrics DBMS
Testing & Analysis
Tests & Analyzers
labels
Oct 30, 2024
20 tasks
karencfv
added a commit
that referenced
this issue
Nov 6, 2024
## Overview This commit implements a new clickhouse-admin endpoint to retrieve and parse information from `[system.distributed_ddl_queue](https://clickhouse.com/docs/en/operations/system-tables/distributed_ddl_queue)`. ## Purpose As part of [stage 1](https://rfd.shared.oxide.computer/rfd/0468) of rolling out replicated ClickHouse, we'll be needing to monitor the installed but not visible replicated cluster on our dogfood rack. This endpoint will aid in said monitoring by providing information about distributed ddl queries. ## Testing ```console $ cargo run --bin=clickhouse-admin-server -- run -c ./smf/clickhouse-admin-server/config.toml -a [::1]:8888 -l [::1]:22001 -b /Users/karcar/src/omicron/out/clickhouse/clickhouse Compiling omicron-clickhouse-admin v0.1.0 (/Users/karcar/src/omicron/clickhouse-admin) Finished `dev` profile [unoptimized + debuginfo] target(s) in 3.79s Running `target/debug/clickhouse-admin-server run -c ./smf/clickhouse-admin-server/config.toml -a '[::1]:8888' -l '[::1]:22001' -b /Users/karcar/src/omicron/out/clickhouse/clickhouse` note: configured to log to "/dev/stdout" {"msg":"listening","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:41.746887Z","hostname":"ixchel","pid":55286,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:197"} {"msg":"accepted connection","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:55.674044Z","hostname":"ixchel","pid":55286,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:1105","remote_addr":"[::1]:59735"} {"msg":"Retrieved data from `system.distributed_ddl_queue`","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:56.148241Z","hostname":"ixchel","pid":55286,"component":"ClickhouseCli","file":"clickhouse-admin/types/src/lib.rs:1018","output":"\"{\\\"entry\\\":\\\"query-0000000001\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22001,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"query_duration_ms\\\":\\\"3\\\"}\\n{\\\"entry\\\":\\\"query-0000000001\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22002,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n{\\\"entry\\\":\\\"query-0000000000\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\",\\\"output_format_json_quote_64bit_integers\\\":\\\"0\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22002,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n{\\\"entry\\\":\\\"query-0000000000\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\",\\\"output_format_json_quote_64bit_integers\\\":\\\"0\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22001,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n\""} {"msg":"request completed","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:56.148459Z","hostname":"ixchel","pid":55286,"uri":"/distributed-ddl-queue","method":"GET","req_id":"fb46b182-2573-4daa-a791-118dad20a3c3","remote_addr":"[::1]:59735","local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950","latency_us":473916,"response_code":"200"} ``` ```console $ curl http://[::1]:8888/distributed-ddl-queue [{"entry":"query-0000000001","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random"},"query_create_time":"2024-11-01 16:17:08","host":"::1","port":22001,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:17:08","query_duration_ms":"3"},{"entry":"query-0000000001","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random"},"query_create_time":"2024-11-01 16:17:08","host":"::1","port":22002,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:17:08","query_duration_ms":"4"},{"entry":"query-0000000000","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random","output_format_json_quote_64bit_integers":"0"},"query_create_time":"2024-11-01 16:16:45","host":"::1","port":22002,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:16:45","query_duration_ms":"4"},{"entry":"query-0000000000","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random","output_format_json_quote_64bit_integers":"0"},"query_create_time":"2024-11-01 16:16:45","host":"::1","port":22001,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:16:45","query_duration_ms":"4"}] ``` ### Caveat I have purposely not written an integration test as the port situation is getting out of hand 😅 I'm working on a better implementation of running these integration tests that involves less hardcoded ports. I will include an integration test for this endpoint in that PR Related: #6953
karencfv
added a commit
that referenced
this issue
Nov 19, 2024
…ic_log` tables (#7100) ## Overview In order to reliably roll out a replicated ClickHouse cluster, we'll be running a set of long running testing (stage 1 of RFD [#468](https://rfd.shared.oxide.computer/rfd/0468#_stage_1)). ClickHouse provides a `system` database with several tables with information about the system. For monitoring purposes, the [`system.asynchronous_metric_log`](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metric_log) and [`system.metric_log`](https://clickhouse.com/docs/en/operations/system-tables/metric_log) tables are particularly useful. With them we can retrieve information about queries per second, CPU usage, memory usage etc. Full lists of available metrics [here](https://clickhouse.com/docs/en/operations/system-tables/metrics) and [here](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metrics) During our long running testing I'd like to give these tables a TTL of 30 days. Once we are confident the system is stable and we roll out the cluster to all racks, we can reduce TTL to 7 or 14 days. This PR will only enable the tables themselves. There will be follow up PRs to actually retrieve the data we'll be monitoring ## Manual testing ### Queries per second ```console oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query) FROM system.metric_log WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400 GROUP BY t ORDER BY t WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' SELECT toStartOfInterval(event_time, toIntervalSecond(60)) AS t, avg(ProfileEvent_Query) FROM system.metric_log WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400)) GROUP BY t ORDER BY t ASC WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' Query id: 1b91946b-fe8b-4074-bc94-f071f72f55f5 ┌────────────────────t─┬─avg(ProfileEvent_Query)─┐ │ 2024-11-18T06:40:00Z │ 1.3571428571428572 │ │ 2024-11-18T06:41:00Z │ 1.3666666666666667 │ │ 2024-11-18T06:42:00Z │ 1.3666666666666667 │ │ 2024-11-18T06:43:00Z │ 1.3666666666666667 │ │ 2024-11-18T06:44:00Z │ 1.3666666666666667 │ ``` ### Disk usage ```console oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(value) FROM system.asynchronous_metric_log WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400 AND metric = 'DiskUsed_default' GROUP BY t ORDER BY t WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' SELECT toStartOfInterval(event_time, toIntervalSecond(60)) AS t, avg(value) FROM system.asynchronous_metric_log WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400)) AND (metric = 'DiskUsed_default') GROUP BY t ORDER BY t ASC WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' Query id: bcf9cd9b-7fea-4aea-866b-d69e60a7c0b6 ┌────────────────────t─┬─────────avg(value)─┐ │ 2024-11-18T06:42:00Z │ 860941425.7777778 │ │ 2024-11-18T06:43:00Z │ 865134523.7333333 │ │ 2024-11-18T06:44:00Z │ 871888896 │ │ 2024-11-18T06:45:00Z │ 874408891.7333333 │ │ 2024-11-18T06:46:00Z │ 878761984 │ │ 2024-11-18T06:47:00Z │ 881646933.3333334 │ │ 2024-11-18T06:48:00Z │ 883998788.2666667 │ ``` Related #6953
karencfv
added a commit
that referenced
this issue
Nov 26, 2024
## Overview This commit adds an endpoint to retrieve timeseries from the `system` database. For the time being we will only add support for the `metric_log` and `asynchronous_metric_log` tables. This endpoint is still a bit bare bones, but will be a good start to begin monitoring the ClickHouse servers. ## Examples ### Queries per second ```console $ curl "http://[::1]:8888/timeseries/metric_log/ProfileEvent_Query/avg" | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2883 100 2883 0 0 5912 0 --:--:-- --:--:-- --:--:-- 5907 [ { "time": "2024-11-20T04:09:00Z", "value": 0.0 }, { "time": "2024-11-20T04:10:00Z", "value": 0.0 }, { "time": "2024-11-20T04:11:00Z", "value": 0.06666666666666667 } ] ``` ### Disk usage ```console $ curl "http://[::1]:8888/timeseries/asynchronous_metric_log/DiskUsed_default/avg?interval=120&time_range=3600×tamp_format=unix_epoch" | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1786 100 1786 0 0 3728 0 --:--:-- --:--:-- --:--:-- 3728 [ { "time": "1732513320", "value": 120491427254.85716 }, { "time": "1732513440", "value": 120382774033.06668 }, { "time": "1732513560", "value": 120364752622.93332 } ] ``` Related: #6953
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Overview
As part of the work to roll out replicated ClickHouse we'll be needing some long running testing to ensure stability of the replicated cluster. Specifically, we'll be wanting to know how stable the replicated cluster is when left alone for a while under load (i.e., days, weeks).
We'll need to monitor the system and answer the following questions periodically during a long period of time (a month or so?):
Implementation
We'll probably want to use clickhouse-admin to extract information from the system. There is a clickhouse-admin binary already installed in each of the clickhouse-{server|keeper} nodes.
To retrieve the information we need, we can leverage the following native ClickHouse tooling:
system.opentelemetry_span_log
system.replication_queue
andsystem.distributed_ddl_queue
system.system.user_processes
Altinity has a pretty cool ClickHouse stress test suite. We can probably use it for running stress tests, or take inspiration from it to create our own stress tests.
Relevant links
Tasks
system.metric_log
andsystem.asynchronous_metric_log
tables #7100The text was updated successfully, but these errors were encountered: