Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clickhouse] Long running QA tests for replicated cluster #6953

Open
3 of 8 tasks
Tracked by #5999
karencfv opened this issue Oct 30, 2024 · 0 comments
Open
3 of 8 tasks
Tracked by #5999

[clickhouse] Long running QA tests for replicated cluster #6953

karencfv opened this issue Oct 30, 2024 · 0 comments
Assignees
Labels
clickhouse Related to the ClickHouse metrics DBMS Testing & Analysis Tests & Analyzers

Comments

@karencfv
Copy link
Contributor

karencfv commented Oct 30, 2024

Overview

As part of the work to roll out replicated ClickHouse we'll be needing some long running testing to ensure stability of the replicated cluster. Specifically, we'll be wanting to know how stable the replicated cluster is when left alone for a while under load (i.e., days, weeks).

We'll need to monitor the system and answer the following questions periodically during a long period of time (a month or so?):

  • Is data consistent across all replicas?
  • Is query performance acceptable under load?
  • Are the queue lengths acceptable under load?
  • Do queue lengths grow over time or are they consistent depending on the load?
  • <more?>

Implementation

We'll probably want to use clickhouse-admin to extract information from the system. There is a clickhouse-admin binary already installed in each of the clickhouse-{server|keeper} nodes.

To retrieve the information we need, we can leverage the following native ClickHouse tooling:

Altinity has a pretty cool ClickHouse stress test suite. We can probably use it for running stress tests, or take inspiration from it to create our own stress tests.

Relevant links

Tasks

@karencfv karencfv added clickhouse Related to the ClickHouse metrics DBMS Testing & Analysis Tests & Analyzers labels Oct 30, 2024
@karencfv karencfv self-assigned this Oct 30, 2024
karencfv added a commit that referenced this issue Nov 6, 2024
## Overview

This commit implements a new clickhouse-admin endpoint to retrieve and
parse information from
`[system.distributed_ddl_queue](https://clickhouse.com/docs/en/operations/system-tables/distributed_ddl_queue)`.

## Purpose

As part of [stage 1](https://rfd.shared.oxide.computer/rfd/0468) of
rolling out replicated ClickHouse, we'll be needing to monitor the
installed but not visible replicated cluster on our dogfood rack. This
endpoint will aid in said monitoring by providing information about
distributed ddl queries.

## Testing

```console
$ cargo run --bin=clickhouse-admin-server -- run -c ./smf/clickhouse-admin-server/config.toml -a [::1]:8888 -l [::1]:22001 -b /Users/karcar/src/omicron/out/clickhouse/clickhouse
   Compiling omicron-clickhouse-admin v0.1.0 (/Users/karcar/src/omicron/clickhouse-admin)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 3.79s
     Running `target/debug/clickhouse-admin-server run -c ./smf/clickhouse-admin-server/config.toml -a '[::1]:8888' -l '[::1]:22001' -b /Users/karcar/src/omicron/out/clickhouse/clickhouse`
note: configured to log to "/dev/stdout"
{"msg":"listening","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:41.746887Z","hostname":"ixchel","pid":55286,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:197"}
{"msg":"accepted connection","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:55.674044Z","hostname":"ixchel","pid":55286,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:1105","remote_addr":"[::1]:59735"}
{"msg":"Retrieved data from `system.distributed_ddl_queue`","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:56.148241Z","hostname":"ixchel","pid":55286,"component":"ClickhouseCli","file":"clickhouse-admin/types/src/lib.rs:1018","output":"\"{\\\"entry\\\":\\\"query-0000000001\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22001,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"query_duration_ms\\\":\\\"3\\\"}\\n{\\\"entry\\\":\\\"query-0000000001\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22002,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n{\\\"entry\\\":\\\"query-0000000000\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\",\\\"output_format_json_quote_64bit_integers\\\":\\\"0\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22002,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n{\\\"entry\\\":\\\"query-0000000000\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\",\\\"output_format_json_quote_64bit_integers\\\":\\\"0\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22001,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n\""}
{"msg":"request completed","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:56.148459Z","hostname":"ixchel","pid":55286,"uri":"/distributed-ddl-queue","method":"GET","req_id":"fb46b182-2573-4daa-a791-118dad20a3c3","remote_addr":"[::1]:59735","local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950","latency_us":473916,"response_code":"200"}
```

```console
$ curl http://[::1]:8888/distributed-ddl-queue
[{"entry":"query-0000000001","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random"},"query_create_time":"2024-11-01 16:17:08","host":"::1","port":22001,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:17:08","query_duration_ms":"3"},{"entry":"query-0000000001","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random"},"query_create_time":"2024-11-01 16:17:08","host":"::1","port":22002,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:17:08","query_duration_ms":"4"},{"entry":"query-0000000000","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random","output_format_json_quote_64bit_integers":"0"},"query_create_time":"2024-11-01 16:16:45","host":"::1","port":22002,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:16:45","query_duration_ms":"4"},{"entry":"query-0000000000","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random","output_format_json_quote_64bit_integers":"0"},"query_create_time":"2024-11-01 16:16:45","host":"::1","port":22001,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:16:45","query_duration_ms":"4"}]
```

### Caveat

I have purposely not written an integration test as the port situation
is getting out of hand 😅 I'm working on a better implementation of
running these integration tests that involves less hardcoded ports. I
will include an integration test for this endpoint in that PR

Related: #6953
karencfv added a commit that referenced this issue Nov 19, 2024
…ic_log` tables (#7100)

## Overview

In order to reliably roll out a replicated ClickHouse cluster, we'll be
running a set of long running testing (stage 1 of RFD
[#468](https://rfd.shared.oxide.computer/rfd/0468#_stage_1)).

ClickHouse provides a `system` database with several tables with
information about the system. For monitoring purposes, the
[`system.asynchronous_metric_log`](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metric_log)
and
[`system.metric_log`](https://clickhouse.com/docs/en/operations/system-tables/metric_log)
tables are particularly useful. With them we can retrieve information
about queries per second, CPU usage, memory usage etc. Full lists of
available metrics
[here](https://clickhouse.com/docs/en/operations/system-tables/metrics)
and
[here](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metrics)

During our long running testing I'd like to give these tables a TTL of
30 days. Once we are confident the system is stable and we roll out the
cluster to all racks, we can reduce TTL to 7 or 14 days.

This PR will only enable the tables themselves. There will be follow up
PRs to actually retrieve the data we'll be monitoring

## Manual testing

### Queries per second

```console
oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

SELECT
    toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
    avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

Query id: 1b91946b-fe8b-4074-bc94-f071f72f55f5

┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-11-18T06:40:00Z │      1.3571428571428572 │
│ 2024-11-18T06:41:00Z │      1.3666666666666667 │
│ 2024-11-18T06:42:00Z │      1.3666666666666667 │
│ 2024-11-18T06:43:00Z │      1.3666666666666667 │
│ 2024-11-18T06:44:00Z │      1.3666666666666667 │
```

### Disk usage

```console
oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(value)
FROM system.asynchronous_metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
AND metric = 'DiskUsed_default'
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

SELECT
    toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
    avg(value)
FROM system.asynchronous_metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400)) AND (metric = 'DiskUsed_default')
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

Query id: bcf9cd9b-7fea-4aea-866b-d69e60a7c0b6

┌────────────────────t─┬─────────avg(value)─┐
│ 2024-11-18T06:42:00Z │  860941425.7777778 │
│ 2024-11-18T06:43:00Z │  865134523.7333333 │
│ 2024-11-18T06:44:00Z │          871888896 │
│ 2024-11-18T06:45:00Z │  874408891.7333333 │
│ 2024-11-18T06:46:00Z │          878761984 │
│ 2024-11-18T06:47:00Z │  881646933.3333334 │
│ 2024-11-18T06:48:00Z │  883998788.2666667 │

```

Related #6953
karencfv added a commit that referenced this issue Nov 26, 2024
## Overview

This commit adds an endpoint to retrieve timeseries from the `system`
database. For the time being we will only add support for the
`metric_log` and `asynchronous_metric_log` tables.

This endpoint is still a bit bare bones, but will be a good start to
begin monitoring the ClickHouse servers.

## Examples

### Queries per second

```console
$ curl "http://[::1]:8888/timeseries/metric_log/ProfileEvent_Query/avg" | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2883  100  2883    0     0   5912      0 --:--:-- --:--:-- --:--:--  5907
[
  {
    "time": "2024-11-20T04:09:00Z",
    "value": 0.0
  },
  {
    "time": "2024-11-20T04:10:00Z",
    "value": 0.0
  },
  {
    "time": "2024-11-20T04:11:00Z",
    "value": 0.06666666666666667
  }
]
```

### Disk usage

```console
$ curl "http://[::1]:8888/timeseries/asynchronous_metric_log/DiskUsed_default/avg?interval=120&time_range=3600&timestamp_format=unix_epoch" | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1786  100  1786    0     0   3728      0 --:--:-- --:--:-- --:--:--  3728
[
 {
    "time": "1732513320",
    "value": 120491427254.85716
  },
  {
    "time": "1732513440",
    "value": 120382774033.06668
  },
  {
    "time": "1732513560",
    "value": 120364752622.93332
  }
]
```

Related: #6953
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse Related to the ClickHouse metrics DBMS Testing & Analysis Tests & Analyzers
Projects
None yet
Development

No branches or pull requests

1 participant