Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiCDC stability issue when PD cluster is not stable #4251

Closed
Tracked by #3844
truong-hua opened this issue Jan 9, 2022 · 2 comments
Closed
Tracked by #3844

TiCDC stability issue when PD cluster is not stable #4251

truong-hua opened this issue Jan 9, 2022 · 2 comments
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. component/replica-model Replication model component. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@truong-hua
Copy link

truong-hua commented Jan 9, 2022

What did you do?

I deploy TiDB on the 3 servers that the stability is not good for testing and verifying how TiCDC behave in case of shaking network issue or other physical server related issues (slow disk, overloaded CPU, lacking of RAM...). To be clearer, our cluster usually have problems and the healthcheckers will detecting it and restart the failed/dead container. All component like PD, TiDB and TiKV behave great in that case but it's not for TiCDC which is a very important component that we expect the latency is always fast.

After our PD cluster is down and recovered (it's usually down few times per day), TiCDC usually get stuck with may be up to few hours (some time more than a day) and then it will back to normal. I checked log and there is a lot of error of loading safe point from PD like below, but I'm definitely sure that the PD has been recovered and working normally (downtime usually in just few minutes):

[2022/01/09 05:13:37.650 +00:00] [ERROR] [kv.go:241] ["fail to load safepoint from pd"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/tikv/client-go/v2/tikv.(*EtcdSafePointKV).Get\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:130\ngithub.com/tikv/client-go/v2/tikv.loadSafePoint\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:165\ngithub.com/tikv/client-go/v2/tikv.(*KVStore).runSafePointChecker\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/kv.go:234\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2022/01/09 05:15:03.710 +00:00] [ERROR] [kv.go:241] ["fail to load safepoint from pd"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/tikv/client-go/v2/tikv.(*EtcdSafePointKV).Get\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:130\ngithub.com/tikv/client-go/v2/tikv.loadSafePoint\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:165\ngithub.com/tikv/client-go/v2/tikv.(*KVStore).runSafePointChecker\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/kv.go:234\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2022/01/09 05:15:19.734 +00:00] [ERROR] [kv.go:241] ["fail to load safepoint from pd"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/tikv/client-go/v2/tikv.(*EtcdSafePointKV).Get\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:130\ngithub.com/tikv/client-go/v2/tikv.loadSafePoint\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:165\ngithub.com/tikv/client-go/v2/tikv.(*KVStore).runSafePointChecker\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/kv.go:234\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2022/01/09 05:15:52.650 +00:00] [INFO] [reactor_state.go:63] ["remote capture offline"] [capture-id=f9ec6645-bb48-45d2-b939-145ec56f1abf]
[2022/01/09 05:16:12.594 +00:00] [INFO] [reactor_state.go:74] ["remote capture online"] [capture-id=14cca246-7ed3-472b-9c89-6b3449d7f1ef] [info="{\"id\":\"14cca246-7ed3-472b-9c89-6b3449d7f1ef\",\"address\":\"10.0.7.92:8300\",\"version\":\"v5.2.1\"}"]
[2022/01/09 05:16:47.525 +00:00] [ERROR] [kv.go:241] ["fail to load safepoint from pd"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/tikv/client-go/v2/tikv.(*EtcdSafePointKV).Get\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:130\ngithub.com/tikv/client-go/v2/tikv.loadSafePoint\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/safepoint.go:165\ngithub.com/tikv/client-go/v2/tikv.(*KVStore).runSafePointChecker\n\tgithub.com/tikv/client-go/v2@v2.0.0-20210617115813-8d4847a86878/tikv/kv.go:234\nruntime.goexit\n\truntime/asm_amd64.s:1371"]

Other command that requires data from PD is still find, like changefeed list.

What did you expect to see?

TiCDC will quickly recover from PD failed state to maintain the good latency.

FYI: I'm using Pulsar as sink, not sure it's related. All I just one is TiCDC should try to recover for any connection/network related issue asap to maintain the best latency or configurable as we are a Change Data Capture.

What did you see instead?

TiCDC is hang with a lot of error "fail to load safepoint from pd" even the PD is recovered (other component like TiDB, TiKV is still working find).

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

Release Version: v5.2.1
Edition: Community
Git Commit Hash: cd8fb24c5f7ebd9d479ed228bb41848bd5e97445
Git Branch: heads/refs/tags/v5.2.1
UTC Build Time: 2021-09-08 02:32:56
GoVersion: go1.16.4
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

TiCDC version (execute cdc version):

Release Version: v5.2.1
Git Commit Hash: 81c22b1c1b2041e2806160d8c7e1105a70815ff5
Git Branch: heads/refs/tags/v5.2.1
UTC Build Time: 2021-09-09 12:00:16
Go Version: go version go1.16.4 linux/amd64
Failpoint Build: false
@truong-hua truong-hua added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Jan 9, 2022
@asddongmen asddongmen added component/kv-client TiKV kv log client component. component/replica-model Replication model component. and removed component/kv-client TiKV kv log client component. labels Jan 11, 2022
@asddongmen
Copy link
Contributor

It may cause by this issue: #3615, and it will be fix in the newer version.

@asddongmen
Copy link
Contributor

asddongmen commented Jul 22, 2022

I close it since it nerver happens again in TiCDC v6.1.0, please fell free to reopen it, if you see it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. component/replica-model Replication model component. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

3 participants