ddl: eliminate ingest step for add index with local engine #47982

tangenta · 2023-10-25T09:34:20Z

What problem does this PR solve?

Issue Number: close #47981

Problem Summary:

Previously, when tidb_enable_dist_task is enabled, the ingest step(step 3) is separated from read-index step(step 1). This cause the problem that if a TiDB crashes in ingest step, index data in local disk is lost. Because the disttask framework does not support changing step backward(like changing step 3 to step 1), it doesn't re-scan the lost index data. Finally, data inconsistency occurs.

What is changed and how it works?

Merge the ingest step to read-index step by Flush every time a subtask is finished.

Thus, the subtask will not be marked as succeed if ingest failed. replaceDeadNodesIfAny will re-distribute these running subtask to another TiDB instance when the lease is expired.

Check List

Tests

Unit test
Integration test

Before

Running Suite: ddl Suite
========================
Random Seed: 1698295051
Will run 1 of 4 specs

[2023/10/26 12:37:37.494 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:40.736 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:43.988 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:47.244 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:50.520 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:53.767 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:57.027 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:38:00.713 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:38:04.186 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:38:07.424 +08:00] [INFO] [disttask_test.go:103] ["log found"] [log="[\"[2023/10/26 12:38:03.578 +08:00] [Info] [backend.go:364] [\\\"import start\\\"] [engineTag=<import-and-reset>] [engineUUID=818fb05b-4542-5beb-907a-21cb49c99be8] [retryCnt=0]\\n\"]"]
[2023/10/26 12:38:07.424 +08:00] [INFO] [disttask_test.go:124] ["inject fault"] [chaosParams="{\"name\":\"\",\"faultType\":\"kill\",\"selector\":\"tidb(ddl-owner)\",\"selectorPolicy\":\"\",\"faultDuration\":60000000000,\"Spec\":null,\"SelectorPeersList\":null,\"Pitr\":null,\"TiCDC\":null,\"checkConfig\":{\"balanceCheck\":null,\"raftLogLagCheck\":null,\"raftLogGcCheck\":null},\"repeatExecTimes\":0}"]
[2023/10/26 12:38:07.665 +08:00] [INFO] [db.go:103] ["ADMIN SHOW DDL"]
[2023/10/26 12:38:07.724 +08:00] [INFO] [opts.go:34] ["Chaos opts: {map[type:kill] [map[selectorPeers:[tc-tidb-0]]] 1m0s parallelly  0s}"]
[2023/10/26 12:38:07.724 +08:00] [INFO] [run.go:81] ["tcType: *k8s.TiDBCluster"]
[2023/10/26 12:38:07.724 +08:00] [INFO] [chaos.go:297] ["init chaos"] [selector:="{\"selectorPeers\":[\"tc-tidb-0\"]}"] ["fault type:"=kill]
[2023/10/26 12:38:07.975 +08:00] [INFO] [chaos.go:203] ["fault will last for"] [duration=1m0s]
[2023/10/26 12:38:07.975 +08:00] [INFO] [chaos.go:64] ["Run chaos"] [name=kill] [selectors="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [selectorsRetainPolicy(selectors)="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="ContainerKillExperimentSpec{Scheduler: <nil>}"]
[mysql] 2023/10/26 12:38:08 packets.go:37: unexpected EOF
[2023/10/26 12:39:08.042 +08:00] [INFO] [chaos.go:216] ["chaosDo finish since fault duration reaches"]
[2023/10/26 12:39:08.042 +08:00] [INFO] [chaos.go:88] ["Clean chaos"] [name=kill] [chaosId="ns=testbed-tangenta-test-g27gc,kind=container-kill,name=container-kill-dxiyyvan,spec=&k8s.ChaosIdentifier{Namespace:\"testbed-tangenta-test-g27gc\", Name:\"container-kill-dxiyyvan\", Spec:ContainerKillExperimentSpec{Scheduler: <nil>}}"]
STEP: Start One Test
STEP: End One Test
• Failure [113.580 seconds]
disttask-add-index
/home/tangenta/endless/pkg/util/dsl.go:29
  run add index test
  /home/tangenta/endless/testcase/ddl/disttask_test.go:57
    fail on ingest #fail_on_ingest# [It]
    /home/tangenta/endless/pkg/util/dsl.go:61

    Expected
        <*mysql.MySQLError | 0xc000a8db00>: {
            Number: 8223,
            SQLState: [72, 89, 48, 48, 48],
            Message: "data inconsistency in table: sbtest1, index: idx, handle: 428304, index-values:\"\" != record-values:\"handle: 428304, values: [KindString 92149430868-57178916270-87020426646-90156921857-46807764443-77432155857-65114616205-78384108897-94777493229-87970275195]\"",
        }
    to be nil

After:

Running Suite: ddl Suite
========================
Random Seed: 1698295332
Will run 1 of 4 specs

[2023/10/26 12:42:18.608 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:21.859 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:25.121 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:28.386 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:31.774 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:35.029 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:38.284 +08:00] [INFO] [disttask_test.go:103] ["log found"] [log="[\"[2023/10/26 12:42:36.659 +08:00] [Info] [backend.go:364] [\\\"import start\\\"] [engineTag=<import-and-reset>] [engineUUID=462b4eef-7a5c-5d2f-b4d3-35fd1b503f75] [retryCnt=0]\\n\"]"]
[2023/10/26 12:42:38.284 +08:00] [INFO] [disttask_test.go:124] ["inject fault"] [chaosParams="{\"name\":\"\",\"faultType\":\"kill\",\"selector\":\"tidb(ddl-owner)\",\"selectorPolicy\":\"\",\"faultDuration\":60000000000,\"Spec\":null,\"SelectorPeersList\":null,\"Pitr\":null,\"TiCDC\":null,\"checkConfig\":{\"balanceCheck\":null,\"raftLogLagCheck\":null,\"raftLogGcCheck\":null},\"repeatExecTimes\":0}"]
[2023/10/26 12:42:38.532 +08:00] [INFO] [db.go:103] ["ADMIN SHOW DDL"]
[2023/10/26 12:42:38.588 +08:00] [INFO] [opts.go:34] ["Chaos opts: {map[type:kill] [map[selectorPeers:[tc-tidb-0]]] 1m0s parallelly  0s}"]
[2023/10/26 12:42:38.588 +08:00] [INFO] [run.go:81] ["tcType: *k8s.TiDBCluster"]
[2023/10/26 12:42:38.588 +08:00] [INFO] [chaos.go:297] ["init chaos"] [selector:="{\"selectorPeers\":[\"tc-tidb-0\"]}"] ["fault type:"=kill]
[2023/10/26 12:42:38.864 +08:00] [INFO] [chaos.go:203] ["fault will last for"] [duration=1m0s]
[2023/10/26 12:42:38.864 +08:00] [INFO] [chaos.go:64] ["Run chaos"] [name=kill] [selectors="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [selectorsRetainPolicy(selectors)="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="ContainerKillExperimentSpec{Scheduler: <nil>}"]
[mysql] 2023/10/26 12:42:39 packets.go:37: unexpected EOF
[2023/10/26 12:43:38.937 +08:00] [INFO] [chaos.go:216] ["chaosDo finish since fault duration reaches"]
[2023/10/26 12:43:38.937 +08:00] [INFO] [chaos.go:88] ["Clean chaos"] [name=kill] [chaosId="ns=testbed-tangenta-test-g27gc,kind=container-kill,name=container-kill-mwvzyuvu,spec=&k8s.ChaosIdentifier{Namespace:\"testbed-tangenta-test-g27gc\", Name:\"container-kill-mwvzyuvu\", Spec:ContainerKillExperimentSpec{Scheduler: <nil>}}"]
• [SLOW TEST:95.758 seconds]
disttask-add-index
/home/tangenta/endless/pkg/util/dsl.go:29
  run add index test
  /home/tangenta/endless/testcase/ddl/disttask_test.go:57
    fail on ingest #fail_on_ingest#
    /home/tangenta/endless/pkg/util/dsl.go:61
------------------------------
SSS

Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

tiprow · 2023-10-25T09:34:43Z

Hi @tangenta. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

codecov · 2023-10-25T09:53:59Z

Codecov Report

Merging #47982 (f6fe674) into master (f9f6bb3) will increase coverage by 1.2063%.
The diff coverage is 71.4285%.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #47982        +/-   ##
================================================
+ Coverage   71.5801%   72.7864%   +1.2062%     
================================================
  Files          1401       1420        +19     
  Lines        405938     411589      +5651     
================================================
+ Hits         290571     299581      +9010     
+ Misses        95569      93144      -2425     
+ Partials      19798      18864       -934

Flag	Coverage Δ
integration	`42.2823% <0.0000%> (?)`
unit	`71.5981% <71.4285%> (+0.0180%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`54.0503% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`48.6466% <ø> (-4.2527%)`	⬇️

ywqzzy

Rest LGTM.
Maybe do some refinement of backfilling_dispatcher_test

pkg/ddl/backfilling_dispatcher.go

tangenta · 2023-10-26T04:46:47Z

/retest

tiprow · 2023-10-26T04:47:21Z

@tangenta: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/ddl/backfilling_dispatcher.go

zimulala · 2023-10-26T14:17:00Z

Merge the ingest step to read-index step. Thus, the subtask will not be marked as succeed if ingest failed.

This PR fix is when "EnableDistTask" is true? There seems to be no explanation for the issue.
Why can we delete "ingset step" without an error?

tangenta · 2023-10-26T15:31:47Z

@zimulala I've updated the PR description. PTAL

zimulala · 2023-10-27T03:13:12Z

pkg/ddl/backfilling_proto.go

@@ -20,7 +20,7 @@ import "github.com/pingcap/tidb/pkg/disttask/framework/proto"
 // the initial step is StepInit(-1)
 // steps are processed in the following order:
 // - local sort:
-// StepInit -> StepReadIndex -> StepWriteAndIngest -> StepDone
+// StepInit -> StepReadIndex -> StepDone


Suggested change

// StepInit -> StepReadIndex -> StepDone

// StepInit -> StepReadIndexAndIngest -> StepDone

The name is not suitable for global sort:

// - global sort: // StepInit -> StepReadIndexAndIngest -> StepMergeSort -> StepWriteAndIngest -> StepDone

ti-chi-bot · 2023-10-27T07:36:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wjhuang2016, ywqzzy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wjhuang2016]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2023-10-27T07:36:28Z

[LGTM Timeline notifier]

Timeline:

2023-10-26 06:12:18.383211997 +0000 UTC m=+2501535.970322140: ☑️ agreed by ywqzzy.
2023-10-27 07:36:27.896598342 +0000 UTC m=+2592985.483708487: ☑️ agreed by wjhuang2016.

tangenta · 2023-10-28T19:05:33Z

/retest

tiprow · 2023-10-28T19:06:04Z

@tangenta: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tangenta · 2023-10-30T03:46:36Z

/retest

tiprow · 2023-10-30T03:46:57Z

@tangenta: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wjhuang2016 · 2023-10-30T07:44:34Z

/retest

wuhuizuo · 2023-10-30T08:19:28Z

I add P0 label: prepare to speedup the merging.

wjhuang2016 · 2023-10-30T08:54:04Z

/retest

ywqzzy · 2023-10-30T09:19:59Z

/retest

tiprow · 2023-10-30T09:20:22Z

@ywqzzy: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ti-chi-bot · 2023-10-30T09:44:48Z

In response to a cherrypick label: new pull request created to branch release-7.5: #48099.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

…48099) close #47981

ddl: eliminate ingest step for add index with local engine

976fa0e

ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 25, 2023

update bazel

1f0503d

ti-chi-bot bot added needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. and removed do-not-merge/needs-triage-completed labels Oct 25, 2023

ywqzzy reviewed Oct 26, 2023

View reviewed changes

pkg/ddl/backfilling_dispatcher.go Show resolved Hide resolved

tangenta added 2 commits October 26, 2023 10:45

fix build and unit test

3d01cea

enlarge subtask range

f059a15

update comment

3be9b60

ywqzzy approved these changes Oct 26, 2023

View reviewed changes

ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Oct 26, 2023

wjhuang2016 reviewed Oct 26, 2023

View reviewed changes

pkg/ddl/backfilling_dispatcher.go Outdated Show resolved Hide resolved

only change subtask size for local ingest

4e089b9

zimulala reviewed Oct 27, 2023

View reviewed changes

change regionBatch from 20 to 100

5e5c28f

wjhuang2016 approved these changes Oct 27, 2023

View reviewed changes

ti-chi-bot bot added approved lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Oct 27, 2023

wuhuizuo added the priority/P0 The issue has P0 priority. label Oct 30, 2023

Merge remote-tracking branch 'upstream/master' into HEAD

f6fe674

ti-chi-bot bot merged commit fd3b2cc into pingcap:master Oct 30, 2023
12 of 16 checks passed

ti-chi-bot mentioned this pull request Oct 30, 2023

ddl: eliminate ingest step for add index with local engine (#47982) #48099

Merged

13 tasks

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Oct 30, 2023

This is an automated cherry-pick of pingcap#47982

c65b65d

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ywqzzy mentioned this pull request Oct 30, 2023

Enhance distributed task execution framework #46258

Closed

76 tasks

wuhuizuo mentioned this pull request Oct 30, 2023

Revert "chore(prow/config): vip for a pr in pingcap/tidb" ti-community-infra/configs#984

Merged

ti-chi-bot bot pushed a commit that referenced this pull request Oct 31, 2023

ddl: eliminate ingest step for add index with local engine (#47982) (#…

1702710

…48099) close #47981

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddl: eliminate ingest step for add index with local engine #47982

ddl: eliminate ingest step for add index with local engine #47982

tangenta commented Oct 25, 2023 •

edited

Loading

tiprow bot commented Oct 25, 2023

codecov bot commented Oct 25, 2023 •

edited

Loading

ywqzzy left a comment

tangenta commented Oct 26, 2023

tiprow bot commented Oct 26, 2023

zimulala commented Oct 26, 2023 •

edited

Loading

tangenta commented Oct 26, 2023

zimulala Oct 27, 2023

tangenta Oct 27, 2023

ti-chi-bot bot commented Oct 27, 2023

ti-chi-bot bot commented Oct 27, 2023

tangenta commented Oct 28, 2023

tiprow bot commented Oct 28, 2023

tangenta commented Oct 30, 2023

tiprow bot commented Oct 30, 2023

wjhuang2016 commented Oct 30, 2023

wuhuizuo commented Oct 30, 2023

wjhuang2016 commented Oct 30, 2023

ywqzzy commented Oct 30, 2023

tiprow bot commented Oct 30, 2023

ti-chi-bot commented Oct 30, 2023

	// StepInit -> StepReadIndex -> StepDone
	// StepInit -> StepReadIndexAndIngest -> StepDone

ddl: eliminate ingest step for add index with local engine #47982

ddl: eliminate ingest step for add index with local engine #47982

Conversation

tangenta commented Oct 25, 2023 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

tiprow bot commented Oct 25, 2023

codecov bot commented Oct 25, 2023 • edited Loading

Codecov Report

ywqzzy left a comment

Choose a reason for hiding this comment

tangenta commented Oct 26, 2023

tiprow bot commented Oct 26, 2023

zimulala commented Oct 26, 2023 • edited Loading

tangenta commented Oct 26, 2023

zimulala Oct 27, 2023

Choose a reason for hiding this comment

tangenta Oct 27, 2023

Choose a reason for hiding this comment

ti-chi-bot bot commented Oct 27, 2023

ti-chi-bot bot commented Oct 27, 2023

[LGTM Timeline notifier]

tangenta commented Oct 28, 2023

tiprow bot commented Oct 28, 2023

tangenta commented Oct 30, 2023

tiprow bot commented Oct 30, 2023

wjhuang2016 commented Oct 30, 2023

wuhuizuo commented Oct 30, 2023

wjhuang2016 commented Oct 30, 2023

ywqzzy commented Oct 30, 2023

tiprow bot commented Oct 30, 2023

ti-chi-bot commented Oct 30, 2023

tangenta commented Oct 25, 2023 •

edited

Loading

codecov bot commented Oct 25, 2023 •

edited

Loading

zimulala commented Oct 26, 2023 •

edited

Loading