puller(ticdc): fix wrong update splitting behavior after table scheduling (#11296) #11303

ti-chi-bot · 2024-06-13T06:32:25Z

This is an automated cherry-pick of #11296

What problem does this PR solve?

Issue Number: close #11219

What is changed and how it works?

There are two cdc nodes A and B, and B start before A, that is thresholdTSB < thresholdTSA;
The sync task of table t is first on A;
Table t has an update event which commitTS is smaller than thresholdTSA and larger than thresholdTSB. So the update event is split to a delete event and an insert event on node A;
But the delete event and insert event cannot be send to the downstream in an atomic way. So if after the delete event is send to downstream and before the insert event being send, the table sync task is scheduling to node B, the update event are received by node B again;
The update event is not split by node B because its commitTS is larger than the thresholdTSB, and node B just send an update sql to downstream which cause data inconsistency;

And there is also another thing to notice that after scheduling, node B will send some events to downstream which are already send by node A; So node B must send these events in an idempotent way;
Previously, this is handled by getting a replicateTS in sink module when sink starts and split these events which commitTS is smaller than replicateTS. But this mechanism is also removed in #11030. So we need to handle this case in puller too.

In this pr, instead of maintaining a separate thresholdTS in sourcemanager, we try to get the replicateTS from sink when puller need to check whether to split the update event.
And since puller module starts working before sink module, so we give replicateTS a default value MAXUInt64 which means to split all update events. After sink starts working, replicateTS will be set to the correct value.

The last thing to notice, when sink restarts due to some error, after restart, the sink may send some events downstream which are already send before restart. These events also need be send in an idempotent way. But these events are already in sorter, so just restart sink cannot accomplish this goal. So we forbid restarting sink in this pr and just restart the changefeed when meet error.

Check List

Tests

Manual test (add detailed scripts or steps below)

deploy a cluster with three cdc nodes;
kill all nodes occasionally while running workload and check whether the data is consistent;

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

None

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot · 2024-06-13T06:32:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sdojjy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2024-06-13T06:39:33Z

@ti-chi-bot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cdc-integration-kafka-test	`631a748`	link	true	`/test cdc-integration-kafka-test`
pull-cdc-integration-mysql-test	`631a748`	link	true	`/test cdc-integration-mysql-test`
jenkins-ticdc/verify	`631a748`	link	true	`/test verify`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

lidezhu · 2024-06-13T07:30:45Z

Already fixed in #11282

This is an automated cherry-pick of pingcap#11296

631a748

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot added lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-6.5 This PR is cherry-picked to release-6.5 from a source PR. labels Jun 13, 2024

ti-chi-bot mentioned this pull request Jun 13, 2024

puller(ticdc): fix wrong update splitting behavior after table scheduling #11296

Merged

ti-chi-bot assigned lidezhu Jun 13, 2024

ti-chi-bot bot added the do-not-merge/cherry-pick-not-approved label Jun 13, 2024

ti-chi-bot added the cherry-pick-approved Cherry pick PR approved by release team. label Jun 13, 2024

ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed do-not-merge/cherry-pick-not-approved size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 13, 2024

lidezhu closed this Jun 13, 2024

lidezhu deleted the cherry-pick-11296-to-release-6.5 branch June 13, 2024 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

puller(ticdc): fix wrong update splitting behavior after table scheduling (#11296) #11303

puller(ticdc): fix wrong update splitting behavior after table scheduling (#11296) #11303

ti-chi-bot commented Jun 13, 2024

ti-chi-bot bot commented Jun 13, 2024

ti-chi-bot bot commented Jun 13, 2024

lidezhu commented Jun 13, 2024

puller(ticdc): fix wrong update splitting behavior after table scheduling (#11296) #11303

puller(ticdc): fix wrong update splitting behavior after table scheduling (#11296) #11303

Conversation

ti-chi-bot commented Jun 13, 2024

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

ti-chi-bot bot commented Jun 13, 2024

ti-chi-bot bot commented Jun 13, 2024

lidezhu commented Jun 13, 2024