Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: eliminate ingest step for add index with local engine #47982

Merged
merged 8 commits into from
Oct 30, 2023

Conversation

tangenta
Copy link
Contributor

@tangenta tangenta commented Oct 25, 2023

What problem does this PR solve?

Issue Number: close #47981

Problem Summary:

Previously, when tidb_enable_dist_task is enabled, the ingest step(step 3) is separated from read-index step(step 1). This cause the problem that if a TiDB crashes in ingest step, index data in local disk is lost. Because the disttask framework does not support changing step backward(like changing step 3 to step 1), it doesn't re-scan the lost index data. Finally, data inconsistency occurs.

What is changed and how it works?

Merge the ingest step to read-index step by Flush every time a subtask is finished.

Thus, the subtask will not be marked as succeed if ingest failed. replaceDeadNodesIfAny will re-distribute these running subtask to another TiDB instance when the lease is expired.

Check List

Tests

  • Unit test
  • Integration test

Before

Running Suite: ddl Suite
========================
Random Seed: 1698295051
Will run 1 of 4 specs

[2023/10/26 12:37:37.494 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:40.736 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:43.988 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:47.244 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:50.520 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:53.767 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:57.027 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:38:00.713 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:38:04.186 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:38:07.424 +08:00] [INFO] [disttask_test.go:103] ["log found"] [log="[\"[2023/10/26 12:38:03.578 +08:00] [Info] [backend.go:364] [\\\"import start\\\"] [engineTag=<import-and-reset>] [engineUUID=818fb05b-4542-5beb-907a-21cb49c99be8] [retryCnt=0]\\n\"]"]
[2023/10/26 12:38:07.424 +08:00] [INFO] [disttask_test.go:124] ["inject fault"] [chaosParams="{\"name\":\"\",\"faultType\":\"kill\",\"selector\":\"tidb(ddl-owner)\",\"selectorPolicy\":\"\",\"faultDuration\":60000000000,\"Spec\":null,\"SelectorPeersList\":null,\"Pitr\":null,\"TiCDC\":null,\"checkConfig\":{\"balanceCheck\":null,\"raftLogLagCheck\":null,\"raftLogGcCheck\":null},\"repeatExecTimes\":0}"]
[2023/10/26 12:38:07.665 +08:00] [INFO] [db.go:103] ["ADMIN SHOW DDL"]
[2023/10/26 12:38:07.724 +08:00] [INFO] [opts.go:34] ["Chaos opts: {map[type:kill] [map[selectorPeers:[tc-tidb-0]]] 1m0s parallelly  0s}"]
[2023/10/26 12:38:07.724 +08:00] [INFO] [run.go:81] ["tcType: *k8s.TiDBCluster"]
[2023/10/26 12:38:07.724 +08:00] [INFO] [chaos.go:297] ["init chaos"] [selector:="{\"selectorPeers\":[\"tc-tidb-0\"]}"] ["fault type:"=kill]
[2023/10/26 12:38:07.975 +08:00] [INFO] [chaos.go:203] ["fault will last for"] [duration=1m0s]
[2023/10/26 12:38:07.975 +08:00] [INFO] [chaos.go:64] ["Run chaos"] [name=kill] [selectors="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [selectorsRetainPolicy(selectors)="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="ContainerKillExperimentSpec{Scheduler: <nil>}"]
[mysql] 2023/10/26 12:38:08 packets.go:37: unexpected EOF
[2023/10/26 12:39:08.042 +08:00] [INFO] [chaos.go:216] ["chaosDo finish since fault duration reaches"]
[2023/10/26 12:39:08.042 +08:00] [INFO] [chaos.go:88] ["Clean chaos"] [name=kill] [chaosId="ns=testbed-tangenta-test-g27gc,kind=container-kill,name=container-kill-dxiyyvan,spec=&k8s.ChaosIdentifier{Namespace:\"testbed-tangenta-test-g27gc\", Name:\"container-kill-dxiyyvan\", Spec:ContainerKillExperimentSpec{Scheduler: <nil>}}"]
STEP: Start One Test
STEP: End One Test
• Failure [113.580 seconds]
disttask-add-index
/home/tangenta/endless/pkg/util/dsl.go:29
  run add index test
  /home/tangenta/endless/testcase/ddl/disttask_test.go:57
    fail on ingest #fail_on_ingest# [It]
    /home/tangenta/endless/pkg/util/dsl.go:61

    Expected
        <*mysql.MySQLError | 0xc000a8db00>: {
            Number: 8223,
            SQLState: [72, 89, 48, 48, 48],
            Message: "data inconsistency in table: sbtest1, index: idx, handle: 428304, index-values:\"\" != record-values:\"handle: 428304, values: [KindString 92149430868-57178916270-87020426646-90156921857-46807764443-77432155857-65114616205-78384108897-94777493229-87970275195]\"",
        }
    to be nil

After:

Running Suite: ddl Suite
========================
Random Seed: 1698295332
Will run 1 of 4 specs

[2023/10/26 12:42:18.608 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:21.859 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:25.121 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:28.386 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:31.774 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:35.029 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:38.284 +08:00] [INFO] [disttask_test.go:103] ["log found"] [log="[\"[2023/10/26 12:42:36.659 +08:00] [Info] [backend.go:364] [\\\"import start\\\"] [engineTag=<import-and-reset>] [engineUUID=462b4eef-7a5c-5d2f-b4d3-35fd1b503f75] [retryCnt=0]\\n\"]"]
[2023/10/26 12:42:38.284 +08:00] [INFO] [disttask_test.go:124] ["inject fault"] [chaosParams="{\"name\":\"\",\"faultType\":\"kill\",\"selector\":\"tidb(ddl-owner)\",\"selectorPolicy\":\"\",\"faultDuration\":60000000000,\"Spec\":null,\"SelectorPeersList\":null,\"Pitr\":null,\"TiCDC\":null,\"checkConfig\":{\"balanceCheck\":null,\"raftLogLagCheck\":null,\"raftLogGcCheck\":null},\"repeatExecTimes\":0}"]
[2023/10/26 12:42:38.532 +08:00] [INFO] [db.go:103] ["ADMIN SHOW DDL"]
[2023/10/26 12:42:38.588 +08:00] [INFO] [opts.go:34] ["Chaos opts: {map[type:kill] [map[selectorPeers:[tc-tidb-0]]] 1m0s parallelly  0s}"]
[2023/10/26 12:42:38.588 +08:00] [INFO] [run.go:81] ["tcType: *k8s.TiDBCluster"]
[2023/10/26 12:42:38.588 +08:00] [INFO] [chaos.go:297] ["init chaos"] [selector:="{\"selectorPeers\":[\"tc-tidb-0\"]}"] ["fault type:"=kill]
[2023/10/26 12:42:38.864 +08:00] [INFO] [chaos.go:203] ["fault will last for"] [duration=1m0s]
[2023/10/26 12:42:38.864 +08:00] [INFO] [chaos.go:64] ["Run chaos"] [name=kill] [selectors="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [selectorsRetainPolicy(selectors)="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="ContainerKillExperimentSpec{Scheduler: <nil>}"]
[mysql] 2023/10/26 12:42:39 packets.go:37: unexpected EOF
[2023/10/26 12:43:38.937 +08:00] [INFO] [chaos.go:216] ["chaosDo finish since fault duration reaches"]
[2023/10/26 12:43:38.937 +08:00] [INFO] [chaos.go:88] ["Clean chaos"] [name=kill] [chaosId="ns=testbed-tangenta-test-g27gc,kind=container-kill,name=container-kill-mwvzyuvu,spec=&k8s.ChaosIdentifier{Namespace:\"testbed-tangenta-test-g27gc\", Name:\"container-kill-mwvzyuvu\", Spec:ContainerKillExperimentSpec{Scheduler: <nil>}}"]
• [SLOW TEST:95.758 seconds]
disttask-add-index
/home/tangenta/endless/pkg/util/dsl.go:29
  run add index test
  /home/tangenta/endless/testcase/ddl/disttask_test.go:57
    fail on ingest #fail_on_ingest#
    /home/tangenta/endless/pkg/util/dsl.go:61
------------------------------
SSS
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 25, 2023
@tiprow
Copy link

tiprow bot commented Oct 25, 2023

Hi @tangenta. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot added needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. and removed do-not-merge/needs-triage-completed labels Oct 25, 2023
@codecov
Copy link

codecov bot commented Oct 25, 2023

Codecov Report

Merging #47982 (f6fe674) into master (f9f6bb3) will increase coverage by 1.2063%.
The diff coverage is 71.4285%.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #47982        +/-   ##
================================================
+ Coverage   71.5801%   72.7864%   +1.2062%     
================================================
  Files          1401       1420        +19     
  Lines        405938     411589      +5651     
================================================
+ Hits         290571     299581      +9010     
+ Misses        95569      93144      -2425     
+ Partials      19798      18864       -934     
Flag Coverage Δ
integration 42.2823% <0.0000%> (?)
unit 71.5981% <71.4285%> (+0.0180%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 54.0503% <ø> (ø)
parser ∅ <ø> (∅)
br 48.6466% <ø> (-4.2527%) ⬇️

Copy link
Contributor

@ywqzzy ywqzzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM.
Maybe do some refinement of backfilling_dispatcher_test

pkg/ddl/backfilling_dispatcher.go Show resolved Hide resolved
@tangenta
Copy link
Contributor Author

/retest

@tiprow
Copy link

tiprow bot commented Oct 26, 2023

@tangenta: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Oct 26, 2023
@zimulala
Copy link
Contributor

zimulala commented Oct 26, 2023

Merge the ingest step to read-index step. Thus, the subtask will not be marked as succeed if ingest failed.

  1. This PR fix is when "EnableDistTask" is true? There seems to be no explanation for the issue.
  2. Why can we delete "ingset step" without an error?

@tangenta
Copy link
Contributor Author

@zimulala I've updated the PR description. PTAL

@@ -20,7 +20,7 @@ import "github.com/pingcap/tidb/pkg/disttask/framework/proto"
// the initial step is StepInit(-1)
// steps are processed in the following order:
// - local sort:
// StepInit -> StepReadIndex -> StepWriteAndIngest -> StepDone
// StepInit -> StepReadIndex -> StepDone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// StepInit -> StepReadIndex -> StepDone
// StepInit -> StepReadIndexAndIngest -> StepDone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is not suitable for global sort:

// - global sort:
// StepInit -> StepReadIndexAndIngest -> StepMergeSort -> StepWriteAndIngest -> StepDone

@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wjhuang2016, ywqzzy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added approved lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Oct 27, 2023
@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 27, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-10-26 06:12:18.383211997 +0000 UTC m=+2501535.970322140: ☑️ agreed by ywqzzy.
  • 2023-10-27 07:36:27.896598342 +0000 UTC m=+2592985.483708487: ☑️ agreed by wjhuang2016.

@tangenta
Copy link
Contributor Author

/retest

@tiprow
Copy link

tiprow bot commented Oct 28, 2023

@tangenta: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tangenta
Copy link
Contributor Author

/retest

@tiprow
Copy link

tiprow bot commented Oct 30, 2023

@tangenta: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wjhuang2016
Copy link
Member

/retest

@wuhuizuo wuhuizuo added the priority/P0 The issue has P0 priority. label Oct 30, 2023
@wuhuizuo
Copy link
Contributor

I add P0 label: prepare to speedup the merging.

@wjhuang2016
Copy link
Member

/retest

1 similar comment
@ywqzzy
Copy link
Contributor

ywqzzy commented Oct 30, 2023

/retest

@tiprow
Copy link

tiprow bot commented Oct 30, 2023

@ywqzzy: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot merged commit fd3b2cc into pingcap:master Oct 30, 2023
12 of 16 checks passed
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #48099.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Oct 30, 2023
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. priority/P0 The issue has P0 priority. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add index failover in ingest step cause data inconsistency
6 participants