TiCDC stability issue when PD cluster is not stable #4251
Labels
area/ticdc
Issues or PRs related to TiCDC.
component/replica-model
Replication model component.
severity/moderate
type/bug
The issue is confirmed as a bug.
What did you do?
I deploy TiDB on the 3 servers that the stability is not good for testing and verifying how TiCDC behave in case of shaking network issue or other physical server related issues (slow disk, overloaded CPU, lacking of RAM...). To be clearer, our cluster usually have problems and the healthcheckers will detecting it and restart the failed/dead container. All component like PD, TiDB and TiKV behave great in that case but it's not for TiCDC which is a very important component that we expect the latency is always fast.
After our PD cluster is down and recovered (it's usually down few times per day), TiCDC usually get stuck with may be up to few hours (some time more than a day) and then it will back to normal. I checked log and there is a lot of error of loading safe point from PD like below, but I'm definitely sure that the PD has been recovered and working normally (downtime usually in just few minutes):
Other command that requires data from PD is still find, like
changefeed list
.What did you expect to see?
TiCDC will quickly recover from PD failed state to maintain the good latency.
FYI: I'm using Pulsar as sink, not sure it's related. All I just one is TiCDC should try to recover for any connection/network related issue asap to maintain the best latency or configurable as we are a Change Data Capture.
What did you see instead?
TiCDC is hang with a lot of error "fail to load safepoint from pd" even the PD is recovered (other component like TiDB, TiKV is still working find).
Versions of the cluster
Upstream TiDB cluster version (execute
SELECT tidb_version();
in a MySQL client):TiCDC version (execute
cdc version
):The text was updated successfully, but these errors were encountered: