Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

puller(ticdc): fix wrong update splitting behavior after table scheduling (#11296) #11303

Conversation

ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #11296

What problem does this PR solve?

Issue Number: close #11219

What is changed and how it works?

  1. There are two cdc nodes A and B, and B start before A, that is thresholdTSB < thresholdTSA;
  2. The sync task of table t is first on A;
  3. Table t has an update event which commitTS is smaller than thresholdTSA and larger than thresholdTSB. So the update event is split to a delete event and an insert event on node A;
  4. But the delete event and insert event cannot be send to the downstream in an atomic way. So if after the delete event is send to downstream and before the insert event being send, the table sync task is scheduling to node B, the update event are received by node B again;
  5. The update event is not split by node B because its commitTS is larger than the thresholdTSB, and node B just send an update sql to downstream which cause data inconsistency;

And there is also another thing to notice that after scheduling, node B will send some events to downstream which are already send by node A; So node B must send these events in an idempotent way;
Previously, this is handled by getting a replicateTS in sink module when sink starts and split these events which commitTS is smaller than replicateTS. But this mechanism is also removed in #11030. So we need to handle this case in puller too.

In this pr, instead of maintaining a separate thresholdTS in sourcemanager, we try to get the replicateTS from sink when puller need to check whether to split the update event.
And since puller module starts working before sink module, so we give replicateTS a default value MAXUInt64 which means to split all update events. After sink starts working, replicateTS will be set to the correct value.

The last thing to notice, when sink restarts due to some error, after restart, the sink may send some events downstream which are already send before restart. These events also need be send in an idempotent way. But these events are already in sorter, so just restart sink cannot accomplish this goal. So we forbid restarting sink in this pr and just restart the changefeed when meet error.

Check List

Tests

  • Manual test (add detailed scripts or steps below)
  1. deploy a cluster with three cdc nodes;
  2. kill all nodes occasionally while running workload and check whether the data is consistent;

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

None

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-6.5 This PR is cherry-picked to release-6.5 from a source PR. labels Jun 13, 2024
Copy link
Contributor

ti-chi-bot bot commented Jun 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sdojjy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot added the cherry-pick-approved Cherry pick PR approved by release team. label Jun 13, 2024
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed do-not-merge/cherry-pick-not-approved size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 13, 2024
Copy link
Contributor

ti-chi-bot bot commented Jun 13, 2024

@ti-chi-bot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-integration-kafka-test 631a748 link true /test cdc-integration-kafka-test
pull-cdc-integration-mysql-test 631a748 link true /test cdc-integration-mysql-test
jenkins-ticdc/verify 631a748 link true /test verify

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@lidezhu
Copy link
Collaborator

lidezhu commented Jun 13, 2024

Already fixed in #11282

@lidezhu lidezhu closed this Jun 13, 2024
@lidezhu lidezhu deleted the cherry-pick-11296-to-release-6.5 branch June 13, 2024 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Cherry pick PR approved by release team. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-6.5 This PR is cherry-picked to release-6.5 from a source PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants