Replication lag #864

tellienne · 2024-10-10T15:15:18Z

Hello!
Thank you for your work, we really like the connector-lightweight.

Recently we noticed that replication at some point begins to lag, although there are not many events in the database.
After the snapshot, replication works fine for a couple of hours, but then a lag begins and continues to grow.
We have a heartbeat table into which Debezium inserts a value every 15 seconds.
And it is clearly visible that the lag has begun.
Our connector is deployed in k8s, v2.3.0-lt.

On the k8s node at this moment there is enough CPU and RAM, in clickhouse and postgres too.
Do you have any idea why replication becomes so slow?

Also at this moment insertions into the altinity_sink_connector.replica_source_info table in the clickhouse become slow. Several seconds may pass between a row being deleted and a new one appearing in it.

tellienne · 2024-10-11T10:39:51Z

we see in metrics:
clickhouse_sink_db_lag - 699622
clickhouse_sink_debezium_lag - 699598

sometimes the lag can grow to a 5583808.
how to understand where data transfer is slowing down? :(

subkanthi · 2024-10-11T15:03:44Z

we see in metrics:
clickhouse_sink_db_lag - 699622
clickhouse_sink_debezium_lag - 699598

sometimes the lag can grow to a 5583808.
how to understand where data transfer is slowing down? :(

Hi @tellienne are u able to check the cpu and memory usage of sink connector

tellienne · 2024-10-11T16:08:44Z

are u able to check the cpu and memory usage of sink connector

judging by k8s metrics CPU and memory usage has remained virtually unchanged since the first start. At the time of the snapshot, the connector uses 4GB of memory; after the snapshot, the use does not decrease.
At the time of the snapshot, 2 cores are used, after the snapshot - about 150-200m CPU

when replication starts to slow down there is no change in CPU and memory usage.

the debezium_postgres_connector_metrics_TotalNumberOfEventsSeen metric shows that 92,000 events were transmitted in the last 7 hours.

what other information can we provide?

subkanthi · 2024-10-11T22:55:31Z

can you check the query_log on the ClickHouse side to check the time it takes to execute the queries?

tellienne · 2024-10-14T04:00:56Z

We checked it on the last lag.
In 2 hours, the Debezium metric recorded 70529 events.
The increase in the number was sharp; in the first half hour, 38825 events were recorded.
During these two hours the size wal files grew by 1GB.
The query execution time on the clickhouse side ranges from 17 to 120 ms.
The clickhouse_sink_debezium_lag metric for the debezium_heartbeat table showed a lag of 2353267 ms.
Data is inserted into the table every 15000ms.
At the same time, the metric debezium_postgres_connector_metrics_MilliSecondsBehindSource during the lag shows no more than 500ms
In total, in the query_log during this time there were 25,113 queries to the database where the lag was observed.
used a query to calculate

            select type, current_database, query_kind, count (written_rows) as total
            from system.query_log where event_time > now () - interval 3 hours and type='QueryFinish' 
            and query_kind not in ('Select')
            and current_database not in ('altinity_sink_connector','default')
            and written_rows != 0
            group by type, current_database, query_kind;

debezium_postgres_connector_metrics_QueueTotalCapacity = 81290
debezium_postgres_connector_metrics_QueueRemainingCapacity metric during lag is not less than 81000

subkanthi · 2024-10-14T21:18:08Z

Data is inserted into the table every 15000ms.

Could this be causing the lag? are you running in single.threaded: "true" or multi -threaded.

if its multi threaded, you might want to increase the buffer flush time.
buffer.flush.time.ms and buffer.max.records

tellienne · 2024-10-15T04:02:56Z

are you running in single.threaded: "true" or multi -threaded.

we use default settings. As I understand it, by default it works in a multi-thread.
Should we also increase thread.pool.size? Currently the config states thread.pool.size=20

tellienne · 2024-10-15T08:20:57Z

added to config

thread.pool.size: 20
slot.stream.params: publication_names=catalog_publication
buffer.flush.time.ms: "15000"
buffer.max.records: "100000"

But again we see a lag. In half an hour we see about 35,000 events and the lag is already 30 minutes, judging by the heartbeat table. There seems to be no effect.
The issue is similar to the one described here #598

tellienne · 2024-10-16T04:14:12Z

thanks for the ideas and your help :)
We tried switching the connector to single-threaded mode while we monitor performance.
But we noticed that after the switch, entries no longer appeared in the altinity_sink_connector.replica_source_info table. It's as if the connector has stopped fixing the offset. Is this expected behavior?

The status endpoint only shows

[{"Seconds_Behind_Source":0},{"Replica_Running":true},{"Database":"system"}]

without lsn

UPD: we updated the connector to version 2.3.1 and the problem with offset in single-threaded mode was fixed

tellienne · 2024-10-16T09:20:24Z

It looks like switching to single.threaded solved the problem. Over the past hours, the clickhouse_sink_debezium_lag metric has not shown more than 120,000 ms.
It's strange that multi-threading works slower than single-threading mode.
We switched all connectors to single-threaded mode and noticed a performance increase in all of them compared to multi-threaded mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication lag #864

Replication lag #864

tellienne commented Oct 10, 2024

tellienne commented Oct 11, 2024

subkanthi commented Oct 11, 2024

tellienne commented Oct 11, 2024

subkanthi commented Oct 11, 2024

tellienne commented Oct 14, 2024 •

edited

Loading

subkanthi commented Oct 14, 2024

tellienne commented Oct 15, 2024 •

edited

Loading

tellienne commented Oct 15, 2024 •

edited

Loading

tellienne commented Oct 16, 2024 •

edited

Loading

tellienne commented Oct 16, 2024

Replication lag #864

Replication lag #864

Comments

tellienne commented Oct 10, 2024

tellienne commented Oct 11, 2024

subkanthi commented Oct 11, 2024

tellienne commented Oct 11, 2024

subkanthi commented Oct 11, 2024

tellienne commented Oct 14, 2024 • edited Loading

subkanthi commented Oct 14, 2024

tellienne commented Oct 15, 2024 • edited Loading

tellienne commented Oct 15, 2024 • edited Loading

tellienne commented Oct 16, 2024 • edited Loading

tellienne commented Oct 16, 2024

tellienne commented Oct 14, 2024 •

edited

Loading

tellienne commented Oct 15, 2024 •

edited

Loading

tellienne commented Oct 15, 2024 •

edited

Loading

tellienne commented Oct 16, 2024 •

edited

Loading