Pipelined restore. #266

YuJuncen · 2020-05-08T09:55:38Z

What problem does this PR solve?

Currently, the full restore workflow is:

Firstly, create all tables, collect their new table IDs, and make rewrite rules by it.
Then, use those metadata to split regions, and ingest SST files.

That would be good if there are few tables, but when the number of table grows, the time we waste on create table will also be greatly incremented. F1 DDL is slow, even TiDB make great effort on optimize it, create a table will spend about 2 secs, and cannot be paralleled.

What is changed and how it works?

I pipelined the workflow, that is, we won't wait until all tables created. (e.g. we can do restore and create tables at the same time).

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

We tested this on a 3-TiKV cluster, in the 500+ table, per table 2000k records workload:

Test Object	Time Spent
Original	46 mins (27 mins restore + 19 mins DDL)
Pipelined	32 mins
Pipelined + send batch periodicity	25 mins

And there is a internal test report.

Side effects

Increased code complexity
Breaking backward compatibility(don't support restore TiFlash nodes when failed restore yet.)

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Release Note

Boost the restore speed by pipelining the restore process.

We use select instead of for range, so we can send error when context cancelled.

…-restore

codecov · 2020-05-09T09:18:02Z

Codecov Report

Merging #266 into master will decrease coverage by 1.99%.
The diff coverage is 71.07%.

@@            Coverage Diff             @@
##           master     #266      +/-   ##
==========================================
- Coverage   74.18%   72.18%   -2.00%     
==========================================
  Files          50       50              
  Lines        6007     5616     -391     
==========================================
- Hits         4456     4054     -402     
- Misses       1044     1078      +34     
+ Partials      507      484      -23

Impacted Files	Coverage Δ
pkg/restore/client.go	`66.78% <48.83%> (-10.94%)`	⬇️
pkg/restore/pipeline_items.go	`62.66% <62.66%> (ø)`
pkg/task/restore.go	`58.78% <73.33%> (-7.35%)`	⬇️
pkg/restore/util.go	`78.20% <82.41%> (-3.41%)`	⬇️
pkg/restore/batcher.go	`88.82% <88.82%> (ø)`
pkg/restore/range.go	`82.60% <100.00%> (ø)`
pkg/backup/push.go	`64.51% <0.00%> (-4.84%)`	⬇️
pkg/restore/split_client.go	`56.60% <0.00%> (-4.16%)`	⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49448c8...0ecbcd6. Read the comment docs.

…-restore

5kbpers · 2020-05-09T10:37:39Z

Do region splitting and scattering can be pipelined?

YuJuncen · 2020-05-11T07:03:22Z

Do region splitting and scattering can be pipelined?

That would be OK by small change probably, let me check it.

kennytm · 2020-05-11T07:20:00Z

perhaps let's defer the pipelined scattering into the next PR, and get this PR merged first.

YuJuncen · 2020-05-11T08:00:44Z

perhaps let's defer the pipelined scattering into the next PR, and get this PR merged first.

OK, to pipeline scattering seems just need change restore/pipeline_items.go, and would be a small PR.

kennytm

Rest LGTM

pkg/restore/batcher.go

pkg/restore/pipeline_items.go

pkg/restore/client.go

Co-authored-by: kennytm <kennytm@gmail.com>

…to pipelined-restore

kennytm

Rest LGTM

kennytm · 2020-06-15T09:25:33Z

pkg/restore/batcher.go

+		if err := b.manager.Leave(ctx, drainResult.BlankTablesAfterSend); err != nil {
+			log.Error("encountering error when leaving recover mode, we can go on but some regions may stick on restore mode",
+				append(
+					ZapRanges(ranges),
+					ZapTables(tbs),
+					zap.Error(err))...,
+			)
+		}


move this defer after Enter() is called

…to pipelined-restore

YuJuncen · 2020-06-15T10:00:20Z

/run-all-tests

YuJuncen · 2020-06-15T10:21:43Z

/run-all-tests

3pointer · 2020-06-15T13:24:01Z

/run-all-tests

YuJuncen · 2020-06-15T14:09:08Z

/run-all-tests

ti-srebot · 2020-06-15T14:42:38Z

cherry pick to release-3.1 failed

ti-srebot · 2020-06-15T14:42:52Z

cherry pick to release-4.0 failed

* restore: add pipelined CreateTable. * restore: add pipelined ValidateFileRanges. * restore: pipelining restore process. * restore, task: use batching when pipelining. * restore: batcher split by range(instead of table). * restore,task: new way to for polling errCh. We use select instead of for range, so we can send error when context cancelled. * restore, task: pipelining checksum. * restore, task: cancel parallel DDL request. * restore: restore will now send batch periodly. * restore: refactor batcher. * restore: add tests on batcher. * restore, task: make linter happy. * *: add dep to multierr. * task: adjust to new function sig. * task, restore: close updateCh until all task finish. * task, restore: pipelined restore supports parition. * backup: always wait worker to finish. * backup, task: skip checksum when needed. * *: make linter happy. * restore: move batcher test to restore_test package. * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore, task: remove context on struct types. * restore: batcher auto commit can be disabled now. * restore, task: fix typos. * recover: fix a bug about removing tiflash. * restore: MapTableToFiles issues Error log when key range not match. * *: merge master. * restore: fix test to match new change of master. * Apply suggestions from code review * restore: merge two progresses. * restore: fix a bug. that is, when table is too big or batch size is too low, we will fail to restore the head part of this table. * restore: extract batcher to another file * task: don't return imediately when files is empty. * restore,task: do some refactor We move `splitPrepareWork` into a struct named `ContextManager`, so that we can batchly set placement rules on online restore. * restore: fix a shaming bug... :| * task,restore: panic on file broken * restore: record tiflash count to disk when removed * restore,task: simplify some code, * task,restore: fix a bug. The bug causes, when a singal table is splt into multi part of batches, it sometimes fail to checksum. * restore: some factory and fix 1. make the batcher worker has two send style 2. make functions for debuging tables and ranges 3. rewrite a test case to adapt the new batcher * tests: try to fix CI * tests: try to fix CI, again * Apply suggestions from code review Co-authored-by: 3pointer <qdlc2010@gmail.com> * restore: change some log levels * restore: merge joiner of sendWorker into messagebox ... and, some small changes: - don't send sending request if here is one. - the method of how a batcher is send move to log level debug * restore,task: run RemoveRestoreLabels at restore post work * task: adapt the remove-tiflash flag * restore,task: fetch new placement rules each time * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore,task: run Leave always, and modify some log level * restore: fix a bug that may cause checksum time incorrect * restore: don't Leave if never Enter Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com>

* restore: add pipelined CreateTable. * restore: add pipelined ValidateFileRanges. * restore: pipelining restore process. * restore, task: use batching when pipelining. * restore: batcher split by range(instead of table). * restore,task: new way to for polling errCh. We use select instead of for range, so we can send error when context cancelled. * restore, task: pipelining checksum. * restore, task: cancel parallel DDL request. * restore: restore will now send batch periodly. * restore: refactor batcher. * restore: add tests on batcher. * restore, task: make linter happy. * *: add dep to multierr. * task: adjust to new function sig. * task, restore: close updateCh until all task finish. * task, restore: pipelined restore supports parition. * backup: always wait worker to finish. * backup, task: skip checksum when needed. * *: make linter happy. * restore: move batcher test to restore_test package. * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore, task: remove context on struct types. * restore: batcher auto commit can be disabled now. * restore, task: fix typos. * recover: fix a bug about removing tiflash. * restore: MapTableToFiles issues Error log when key range not match. * *: merge master. * restore: fix test to match new change of master. * Apply suggestions from code review * restore: merge two progresses. * restore: fix a bug. that is, when table is too big or batch size is too low, we will fail to restore the head part of this table. * restore: extract batcher to another file * task: don't return imediately when files is empty. * restore,task: do some refactor We move `splitPrepareWork` into a struct named `ContextManager`, so that we can batchly set placement rules on online restore. * restore: fix a shaming bug... :| * task,restore: panic on file broken * restore: record tiflash count to disk when removed * restore,task: simplify some code, * task,restore: fix a bug. The bug causes, when a singal table is splt into multi part of batches, it sometimes fail to checksum. * restore: some factory and fix 1. make the batcher worker has two send style 2. make functions for debuging tables and ranges 3. rewrite a test case to adapt the new batcher * tests: try to fix CI * tests: try to fix CI, again * Apply suggestions from code review Co-authored-by: 3pointer <qdlc2010@gmail.com> * restore: change some log levels * restore: merge joiner of sendWorker into messagebox ... and, some small changes: - don't send sending request if here is one. - the method of how a batcher is send move to log level debug * restore,task: run RemoveRestoreLabels at restore post work * task: adapt the remove-tiflash flag * restore,task: fetch new placement rules each time * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore,task: run Leave always, and modify some log level * restore: fix a bug that may cause checksum time incorrect * restore: don't Leave if never Enter Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com> Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com>

* Pipelined restore. (#266) * restore: add pipelined CreateTable. * restore: add pipelined ValidateFileRanges. * restore: pipelining restore process. * restore, task: use batching when pipelining. * restore: batcher split by range(instead of table). * restore,task: new way to for polling errCh. We use select instead of for range, so we can send error when context cancelled. * restore, task: pipelining checksum. * restore, task: cancel parallel DDL request. * restore: restore will now send batch periodly. * restore: refactor batcher. * restore: add tests on batcher. * restore, task: make linter happy. * *: add dep to multierr. * task: adjust to new function sig. * task, restore: close updateCh until all task finish. * task, restore: pipelined restore supports parition. * backup: always wait worker to finish. * backup, task: skip checksum when needed. * *: make linter happy. * restore: move batcher test to restore_test package. * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore, task: remove context on struct types. * restore: batcher auto commit can be disabled now. * restore, task: fix typos. * recover: fix a bug about removing tiflash. * restore: MapTableToFiles issues Error log when key range not match. * *: merge master. * restore: fix test to match new change of master. * Apply suggestions from code review * restore: merge two progresses. * restore: fix a bug. that is, when table is too big or batch size is too low, we will fail to restore the head part of this table. * restore: extract batcher to another file * task: don't return imediately when files is empty. * restore,task: do some refactor We move `splitPrepareWork` into a struct named `ContextManager`, so that we can batchly set placement rules on online restore. * restore: fix a shaming bug... :| * task,restore: panic on file broken * restore: record tiflash count to disk when removed * restore,task: simplify some code, * task,restore: fix a bug. The bug causes, when a singal table is splt into multi part of batches, it sometimes fail to checksum. * restore: some factory and fix 1. make the batcher worker has two send style 2. make functions for debuging tables and ranges 3. rewrite a test case to adapt the new batcher * tests: try to fix CI * tests: try to fix CI, again * Apply suggestions from code review Co-authored-by: 3pointer <qdlc2010@gmail.com> * restore: change some log levels * restore: merge joiner of sendWorker into messagebox ... and, some small changes: - don't send sending request if here is one. - the method of how a batcher is send move to log level debug * restore,task: run RemoveRestoreLabels at restore post work * task: adapt the remove-tiflash flag * restore,task: fetch new placement rules each time * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore,task: run Leave always, and modify some log level * restore: fix a bug that may cause checksum time incorrect * restore: don't Leave if never Enter Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com> * restore: fix a package name error seems the rename package PR isn't cherry-picked to release-3.0. So I move batcher_test to restore(from restore_test) Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com>

Hillium added 14 commits April 28, 2020 14:50

restore: add pipelined CreateTable.

0b07b64

restore: add pipelined ValidateFileRanges.

061b669

restore: pipelining restore process.

d711ded

restore, task: use batching when pipelining.

f258d99

restore: batcher split by range(instead of table).

cefb696

restore,task: new way to for polling errCh.

7846324

We use select instead of for range, so we can send error when context cancelled.

restore, task: pipelining checksum.

c78a24b

restore, task: cancel parallel DDL request.

93b5942

restore: restore will now send batch periodly.

f3ec5ee

restore: refactor batcher.

bb31ec0

restore: add tests on batcher.

d54b99c

restore, task: make linter happy.

4c4a3d8

*: add dep to multierr.

3cffa5d

Merge branch 'master' of https://github.com/pingcap/br into pipelined…

965779a

…-restore

YuJuncen added the WIP label May 8, 2020

Hillium added 5 commits May 8, 2020 18:09

task: adjust to new function sig.

59e33b0

task, restore: close updateCh until all task finish.

44d52be

task, restore: pipelined restore supports parition.

a49e1f0

backup: always wait worker to finish.

84e267c

backup, task: skip checksum when needed.

7fab3c3

Hillium added 2 commits May 9, 2020 17:23

Merge branch 'master' of https://github.com/pingcap/br into pipelined…

39d2312

…-restore

*: make linter happy.

ac6f5be

YuJuncen added needs-cherry-pick-release-4.0 and removed WIP labels May 9, 2020

YuJuncen linked an issue May 11, 2020 that may be closed by this pull request

BR restore performance optimize #255

Closed

3 tasks

kennytm reviewed Jun 13, 2020

View reviewed changes

YuJuncen and others added 6 commits June 15, 2020 10:23

Apply suggestions from code review

4cbbff0

Co-authored-by: kennytm <kennytm@gmail.com>

restore,task: run Leave always, and modify some log level

460331f

Merge branch 'master' into pipelined-restore

b450532

restore: fix a bug that may cause checksum time incorrect

0ee5223

Merge branch 'pipelined-restore' of https://github.com/Yujuncen/br in…

7437950

…to pipelined-restore

Merge branch 'master' into pipelined-restore

8f96c30

YuJuncen mentioned this pull request Jun 15, 2020

[Meta] collection of log improvements #259

Open

6 tasks

YuJuncen requested a review from kennytm June 15, 2020 07:01

kennytm reviewed Jun 15, 2020

View reviewed changes

yujuncen added 2 commits June 15, 2020 17:30

restore: don't Leave if never Enter

77ab77f

Merge branch 'pipelined-restore' of https://github.com/Yujuncen/br in…

0ecbcd6

…to pipelined-restore

kennytm approved these changes Jun 15, 2020

View reviewed changes

YuJuncen merged commit d7a3060 into pingcap:master Jun 15, 2020

YuJuncen mentioned this pull request Jun 16, 2020

Cherry-pick Pipelined restore (#266) to 4.0 #356

Merged

YuJuncen mentioned this pull request Jun 16, 2020

Cherry-pick #266 to release-3.1 #357

Merged

This was referenced Jul 3, 2020

br don't work well when restoring region file to a new cluster #380

Closed

Cluster restore stuck after completing DDL jobs #348

Closed

YuJuncen mentioned this pull request Jul 14, 2020

Pipelining split and import #419

Closed

dveeden mentioned this pull request Jul 1, 2021

Dumpling v5.1.0 can NOT access Google GCS; Works in v5.0.0 #1302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipelined restore. #266

Pipelined restore. #266

YuJuncen commented May 8, 2020

codecov bot commented May 9, 2020 •

edited

Loading

5kbpers commented May 9, 2020

YuJuncen commented May 11, 2020

kennytm commented May 11, 2020

YuJuncen commented May 11, 2020

kennytm left a comment

kennytm left a comment

kennytm Jun 15, 2020

YuJuncen commented Jun 15, 2020

YuJuncen commented Jun 15, 2020

3pointer commented Jun 15, 2020

YuJuncen commented Jun 15, 2020

ti-srebot commented Jun 15, 2020

ti-srebot commented Jun 15, 2020

Pipelined restore. #266

Pipelined restore. #266

Conversation

YuJuncen commented May 8, 2020

What problem does this PR solve?

What is changed and how it works?

Check List

Release Note

codecov bot commented May 9, 2020 • edited Loading

Codecov Report

5kbpers commented May 9, 2020

YuJuncen commented May 11, 2020

kennytm commented May 11, 2020

YuJuncen commented May 11, 2020

kennytm left a comment

Choose a reason for hiding this comment

kennytm left a comment

Choose a reason for hiding this comment

kennytm Jun 15, 2020

Choose a reason for hiding this comment

YuJuncen commented Jun 15, 2020

YuJuncen commented Jun 15, 2020

3pointer commented Jun 15, 2020

YuJuncen commented Jun 15, 2020

ti-srebot commented Jun 15, 2020

ti-srebot commented Jun 15, 2020

codecov bot commented May 9, 2020 •

edited

Loading