TiCDC stability issue when PD cluster is not stable #4251

truong-hua · 2022-01-09T05:26:58Z

What did you do?

I deploy TiDB on the 3 servers that the stability is not good for testing and verifying how TiCDC behave in case of shaking network issue or other physical server related issues (slow disk, overloaded CPU, lacking of RAM...). To be clearer, our cluster usually have problems and the healthcheckers will detecting it and restart the failed/dead container. All component like PD, TiDB and TiKV behave great in that case but it's not for TiCDC which is a very important component that we expect the latency is always fast.

After our PD cluster is down and recovered (it's usually down few times per day), TiCDC usually get stuck with may be up to few hours (some time more than a day) and then it will back to normal. I checked log and there is a lot of error of loading safe point from PD like below, but I'm definitely sure that the PD has been recovered and working normally (downtime usually in just few minutes):

[2022/01/09 05:13:37.650 +00:00] [ERROR] [kv.go:241] ["fail to load safepoint from pd"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/tikv/client-go/v2/tikv.(*EtcdSafePointKV).Get\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:130\ngithub.com/tikv/client-go/v2/tikv.loadSafePoint\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:165\ngithub.com/tikv/client-go/v2/tikv.(*KVStore).runSafePointChecker\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/kv.go:234\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2022/01/09 05:15:03.710 +00:00] [ERROR] [kv.go:241] ["fail to load safepoint from pd"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/tikv/client-go/v2/tikv.(*EtcdSafePointKV).Get\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:130\ngithub.com/tikv/client-go/v2/tikv.loadSafePoint\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:165\ngithub.com/tikv/client-go/v2/tikv.(*KVStore).runSafePointChecker\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/kv.go:234\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2022/01/09 05:15:19.734 +00:00] [ERROR] [kv.go:241] ["fail to load safepoint from pd"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/tikv/client-go/v2/tikv.(*EtcdSafePointKV).Get\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:130\ngithub.com/tikv/client-go/v2/tikv.loadSafePoint\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:165\ngithub.com/tikv/client-go/v2/tikv.(*KVStore).runSafePointChecker\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/kv.go:234\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2022/01/09 05:15:52.650 +00:00] [INFO] [reactor_state.go:63] ["remote capture offline"] [capture-id=f9ec6645-bb48-45d2-b939-145ec56f1abf]
[2022/01/09 05:16:12.594 +00:00] [INFO] [reactor_state.go:74] ["remote capture online"] [capture-id=14cca246-7ed3-472b-9c89-6b3449d7f1ef] [info="{\"id\":\"14cca246-7ed3-472b-9c89-6b3449d7f1ef\",\"address\":\"10.0.7.92:8300\",\"version\":\"v5.2.1\"}"]
[2022/01/09 05:16:47.525 +00:00] [ERROR] [kv.go:241] ["fail to load safepoint from pd"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/tikv/client-go/v2/tikv.(*EtcdSafePointKV).Get\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:130\ngithub.com/tikv/client-go/v2/tikv.loadSafePoint\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:165\ngithub.com/tikv/client-go/v2/tikv.(*KVStore).runSafePointChecker\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/kv.go:234\nruntime.goexit\n\truntime/asm_amd64.s:1371"]

Other command that requires data from PD is still find, like changefeed list.

What did you expect to see?

TiCDC will quickly recover from PD failed state to maintain the good latency.

FYI: I'm using Pulsar as sink, not sure it's related. All I just one is TiCDC should try to recover for any connection/network related issue asap to maintain the best latency or configurable as we are a Change Data Capture.

What did you see instead?

TiCDC is hang with a lot of error "fail to load safepoint from pd" even the PD is recovered (other component like TiDB, TiKV is still working find).

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

Release Version: v5.2.1
Edition: Community
Git Commit Hash: cd8fb24c5f7ebd9d479ed228bb41848bd5e97445
Git Branch: heads/refs/tags/v5.2.1
UTC Build Time: 2021-09-08 02:32:56
GoVersion: go1.16.4
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

TiCDC version (execute cdc version):

Release Version: v5.2.1
Git Commit Hash: 81c22b1c1b2041e2806160d8c7e1105a70815ff5
Git Branch: heads/refs/tags/v5.2.1
UTC Build Time: 2021-09-09 12:00:16
Go Version: go version go1.16.4 linux/amd64
Failpoint Build: false

The text was updated successfully, but these errors were encountered:

asddongmen · 2022-01-12T10:10:20Z

It may cause by this issue: #3615, and it will be fix in the newer version.

asddongmen · 2022-07-22T00:20:41Z

I close it since it nerver happens again in TiCDC v6.1.0, please fell free to reopen it, if you see it again.

truong-hua added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Jan 9, 2022

asddongmen added component/kv-client TiKV kv log client component. component/replica-model Replication model component. and removed component/kv-client TiKV kv log client component. labels Jan 11, 2022

Tammyxia added the severity/moderate label Jan 12, 2022

maxshuang mentioned this issue Jan 13, 2022

Tracking Issues and Feature for Correctness and Stability #3844

Open

25 tasks

nongfushanquan assigned asddongmen Apr 26, 2022

asddongmen closed this as completed Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiCDC stability issue when PD cluster is not stable #4251

TiCDC stability issue when PD cluster is not stable #4251

truong-hua commented Jan 9, 2022 •

edited

Loading

asddongmen commented Jan 12, 2022

asddongmen commented Jul 22, 2022 •

edited

Loading

TiCDC stability issue when PD cluster is not stable #4251

TiCDC stability issue when PD cluster is not stable #4251

Comments

truong-hua commented Jan 9, 2022 • edited Loading

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

asddongmen commented Jan 12, 2022

asddongmen commented Jul 22, 2022 • edited Loading

truong-hua commented Jan 9, 2022 •

edited

Loading

asddongmen commented Jul 22, 2022 •

edited

Loading