Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiCDC owner gets stuck with unknown status #3331

Closed
amyangfei opened this issue Nov 8, 2021 · 4 comments
Closed

TiCDC owner gets stuck with unknown status #3331

amyangfei opened this issue Nov 8, 2021 · 4 comments
Assignees
Labels

Comments

@amyangfei
Copy link
Contributor

amyangfei commented Nov 8, 2021

What did you do?

The same scenario in #3329

  1. Setup a ticdc cluster, create a changefeed with command cdc cli changefeed create -c test-cf --sink-uri="kafka://172.18.0.2:9092/cdc-test?kafka-version=2.7.0&max-batch-size=1&max-message-bytes=5000"
  2. Execute a SQL in upstream TiDB with a wide row change, whose data size is larger than 5000
  3. TiCDC owner will resume the changefeed periodically.

What did you expect to see?

No response

What did you see instead?

image

Timezone in Grafana is +08:00

From the metrics, we found TiCDC owner is missing during [22:25:30 +00:00, 22:38:30 +00:00] and [23:05:30 +00:00, 23:19:30 +00:00].

[2021/11/04 23:03:55.933 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=35.616862ms]
[2021/11/04 23:04:00.035 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=31.255724ms]
[2021/11/04 23:04:04.135 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=38.156251ms]
[2021/11/04 23:04:08.233 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=35.016215ms]
[2021/11/04 23:04:10.835 +00:00] [ERROR] [feed_state_manager.go:240] ["processor report an error"] [changefeedID=multi-table-config-part-0-kafka] [captureID=5755ab41-8a1d-412b-94c6-7da0501a426c] [error="{\"addr\":\"test-hightouch-ticdc-6.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301\",\"code\":\"CDC:ErrJSONCodecRowTooLarge\",\"message\":\"[CDC:ErrJSONCodecRowTooLarge]json codec single row too large\"}"]
[2021/11/04 23:04:17.816 +00:00] [ERROR] [feed_state_manager.go:240] ["processor report an error"] [changefeedID=multi-table-config-part-0-kafka] [captureID=5755ab41-8a1d-412b-94c6-7da0501a426c] [error="{\"addr\":\"test-hightouch-ticdc-6.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301\",\"code\":\"CDC:ErrJSONCodecRowTooLarge\",\"message\":\"[CDC:ErrJSONCodecRowTooLarge]json codec single row too large\"}"]
[2021/11/04 23:04:23.970 +00:00] [ERROR] [feed_state_manager.go:240] ["processor report an error"] [changefeedID=multi-table-config-part-0-kafka] [captureID=5755ab41-8a1d-412b-94c6-7da0501a426c] [error="{\"addr\":\"test-hightouch-ticdc-6.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301\",\"code\":\"CDC:ErrJSONCodecRowTooLarge\",\"message\":\"[CDC:ErrJSONCodecRowTooLarge]json codec single row too large\"}"]
[2021/11/04 23:04:24.016 +00:00] [INFO] [changefeed.go:272] ["close changefeed"] [changefeed=multi-table-config-part-0-kafka] [info="{\"sink-uri\":\"***\",\"opts\":{\"max-message-bytes\":\"972800\"},\"create-time\":\"2021-11-04T22:31:22.779535576Z\",\"start-ts\":428884644922392816,\"target-ts\":0,\"admin-job-type\":0,\"sort-engine\":\"unified\",\"sort-dir\":\"\",\"config\":{\"case-sensitive\":true,\"enable-old-value\":false,\"force-replicate\":false,\"check-gc-safe-point\":true,\"filter\":{\"rules\":[\".*\],\"ignore-txn-start-ts\":[1]},\"mounter\":{\"worker-num\":16},\"sink\":{\"dispatchers\":[{\"matcher\":[\"*.*\"],\"dispatcher\":\"default\"}],\"protocol\":\"default\"},\"cyclic-replication\":{\"enable\":false,\"replica-id\":1,\"filter-replica-ids\":[2,3],\"id-buckets\":0,\"sync-ddl\":true},\"scheduler\":{\"type\":\"table-number\",\"polling-time\":-1}},\"state\":\"normal\",\"history\":[1636067050867,1636067057882,1636067063981],\"error\":{\"addr\":\"test-hightouch-ticdc-6.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301\",\"code\":\"CDC:ErrJSONCodecRowTooLarge\",\"message\":\"[CDC:ErrJSONCodecRowTooLarge]json codec single row too large\"},\"sync-point-enabled\":false,\"sync-point-interval\":600000000000,\"creator-version\":\"v5.2.0-master\"}"]
[2021/11/04 23:04:26.632 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=32.651435ms]
[2021/11/04 23:04:34.733 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=34.659275ms]
[2021/11/04 23:06:50.532 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=33.931342ms]
[2021/11/04 23:07:41.433 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=32.475703ms]
[2021/11/04 23:07:57.732 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=36.234958ms]
[2021/11/04 23:09:12.733 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=36.707648ms]
[2021/11/04 23:10:21.632 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=31.537643ms]
[2021/11/04 23:10:29.471 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=56.766232ms]
[2021/11/04 23:10:31.733 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=33.514194ms]
[2021/11/04 23:10:51.933 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=35.797585ms]
[2021/11/04 23:11:06.133 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=34.833724ms]
[2021/11/04 23:11:16.435 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=34.338258ms]
[2021/11/04 23:12:09.133 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=35.788984ms]
[2021/11/04 23:13:26.136 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=37.057709ms]
[2021/11/04 23:13:30.234 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=31.124091ms]
[2021/11/04 23:14:14.933 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=36.370516ms]
[2021/11/04 23:14:41.432 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=31.461365ms]
[2021/11/04 23:16:12.740 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=44.281307ms]
[2021/11/04 23:16:39.472 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=57.602113ms]
[2021/11/04 23:17:29.749 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=50.275141ms]
[2021/11/04 23:18:20.435 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=32.337227ms]
[2021/11/04 23:18:48.739 +00:00] [WARN] [pd.go:150] ["get timestamp too slow"] ["cost time"=38.372653ms]
[2021/11/04 23:19:32.869 +00:00] [INFO] [gc_manager.go:100] ["update gc safe point success"] [gcSafePointTs=428885065532445069]
[2021/11/04 23:19:32.901 +00:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=multi-table-config-part-5-kafka] [capture=test-hightouch-ticdc-7.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301] [count=1] [qps=0] [ddl=0]
[2021/11/04 23:19:32.925 +00:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=single-table-config-BillVersionItemPaymentPricingResponse-kafka] [capture=test-hightouch-ticdc-7.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301] [count=0] [qps=0] [ddl=0]
[2021/11/04 23:19:32.950 +00:00] [INFO] [changefeed.go:196] ["initialize changefeed"] [changefeed=multi-table-config-part-0-kafka] [info="{\"sink-uri\":\"***\",\"opts\":{\"max-message-bytes\":\"972800\"},\"create-time\":\"2021-11-04T22:31:22.779535576Z\",\"start-ts\":428884644922392816,\"target-ts\":0,\"admin-job-type\":0,\"sort-engine\":\"unified\",\"sort-dir\":\"\",\"config\":{\"case-sensitive\":true,\"enable-old-value\":false,\"force-replicate\":false,\"check-gc-safe-point\":true,\"filter\":{\"rules\":[\".*\"],\"ignore-txn-start-ts\":[1]},\"mounter\":{\"worker-num\":16},\"sink\":{\"dispatchers\":[{\"matcher\":[\"*.*\"],\"dispatcher\":\"default\"}],\"protocol\":\"default\"},\"cyclic-replication\":{\"enable\":false,\"replica-id\":1,\"filter-replica-ids\":[2,3],\"id-buckets\":0,\"sync-ddl\":true},\"scheduler\":{\"type\":\"table-number\",\"polling-time\":-1}},\"state\":\"normal\",\"history\":[],\"error\":{\"addr\":\"test-hightouch-ticdc-6.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301\",\"code\":\"CDC:ErrJSONCodecRowTooLarge\",\"message\":\"[CDC:ErrJSONCodecRowTooLarge]json codec single row too large\"},\"sync-point-enabled\":false,\"sync-point-interval\":600000000000,\"creator-version\":\"v5.2.0-master\"}"] ["checkpoint ts"=428885065532445070]
[2021/11/04 23:19:33.393 +00:00] [INFO] [kafka.go:436] ["Starting kafka sarama producer ..."] [config="{\"PartitionNum\":2,\"ReplicationFactor\":3,\"Version\":\"2.3.1\",\"MaxMessageBytes\":972800,\"Compression\":\"none\",\"ClientID\":\"\",\"Credential\":{\"ca-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\",\"cert-allowed-cn\":null},\"SaslScram\":{\"sasl-user\":\"\",\"sasl-password\":\"\",\"sasl-mechanism\":\"\"},\"TopicPreProcess\":true}"]
[2021/11/04 23:19:33.418 +00:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=multi-table-config-part-2-kafka] [capture=test-hightouch-ticdc-7.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301] [count=11906] [qps=9] [ddl=0]
[2021/11/04 23:19:33.439 +00:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=multi-table-config-part-1-kafka] [capture=test-hightouch-ticdc-7.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301] [count=8824] [qps=7] [ddl=0]
[2021/11/04 23:19:33.538 +00:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=multi-table-config-part-6-kafka] [capture=test-hightouch-ticdc-7.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301] [count=25447] [qps=19] [ddl=0]
[2021/11/04 23:19:33.538 +00:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=multi-table-config-part-3-kafka] [capture=test-hightouch-ticdc-7.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301] [count=39407] [qps=32] [ddl=0]
[2021/11/04 23:19:33.539 +00:00] [INFO] [statistics.go:154] ["sink replication status"] [name=MQ] [changefeed=multi-table-config-part-4-kafka] [capture=test-hightouch-ticdc-7.test-hightouch-ticdc-peer.tidb-test-hightouch.svc.us-east-1a.test-hightouch.tidb.musta.ch:8301] [count=749] [qps=0] [ddl=0]
[2021/11/04 23:19:36.231 +00:00] [INFO] [kafka.go:402] ["get partition number of topic"] [topic=ticdc-hightouch-cluster-topic-2] [partition_num=2]
[2021/11/04 23:19:36.248 +00:00] [INFO] [changefeed.go:196] ["initialize changefeed"] [changefeed=multi-table-config-part-8-kafka]

More logs can be found in ONCALL-4004 (internal only).

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v5.2.1

TiCDC version (execute cdc version):

release-5.2@pingcap/ticdc@f38b575

@amyangfei amyangfei added type/bug The issue is confirmed as a bug. area/ticdc Issues or PRs related to TiCDC. labels Nov 8, 2021
@amyangfei
Copy link
Contributor Author

It is sure that Tick function of owner blocked more than 10 minutes which caused the owner missing. https://github.com/pingcap/ticdc/blob/f38b5757ed86c00e7d510cc849f93f924352b119/cdc/owner/owner.go#L107-L145

Suspicion:

  1. The close of Kafka sink could be blocked.

@overvenus
Copy link
Member

Both https://github.com/pingcap/ticdc/pull/3665 and https://github.com/pingcap/ticdc/pull/3598 mitigate this issue, change to severity/moderate

@nongfushanquan
Copy link
Contributor

/unassign @overvenus
/assign @asddongmen

@ti-chi-bot ti-chi-bot assigned asddongmen and unassigned overvenus Sep 22, 2022
@asddongmen
Copy link
Contributor

fix by #5186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants