-
Notifications
You must be signed in to change notification settings - Fork 102
Conversation
We use select instead of for range, so we can send error when context cancelled.
Codecov Report
@@ Coverage Diff @@
## master #266 +/- ##
==========================================
- Coverage 74.18% 72.18% -2.00%
==========================================
Files 50 50
Lines 6007 5616 -391
==========================================
- Hits 4456 4054 -402
- Misses 1044 1078 +34
+ Partials 507 484 -23
Continue to review full report at Codecov.
|
Do region splitting and scattering can be pipelined? |
That would be OK by small change probably, let me check it. |
perhaps let's defer the pipelined scattering into the next PR, and get this PR merged first. |
OK, to pipeline scattering seems just need change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Co-authored-by: kennytm <kennytm@gmail.com>
…to pipelined-restore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
if err := b.manager.Leave(ctx, drainResult.BlankTablesAfterSend); err != nil { | ||
log.Error("encountering error when leaving recover mode, we can go on but some regions may stick on restore mode", | ||
append( | ||
ZapRanges(ranges), | ||
ZapTables(tbs), | ||
zap.Error(err))..., | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this defer after Enter()
is called
/run-all-tests |
3 similar comments
/run-all-tests |
/run-all-tests |
/run-all-tests |
cherry pick to release-3.1 failed |
cherry pick to release-4.0 failed |
* restore: add pipelined CreateTable. * restore: add pipelined ValidateFileRanges. * restore: pipelining restore process. * restore, task: use batching when pipelining. * restore: batcher split by range(instead of table). * restore,task: new way to for polling errCh. We use select instead of for range, so we can send error when context cancelled. * restore, task: pipelining checksum. * restore, task: cancel parallel DDL request. * restore: restore will now send batch periodly. * restore: refactor batcher. * restore: add tests on batcher. * restore, task: make linter happy. * *: add dep to multierr. * task: adjust to new function sig. * task, restore: close updateCh until all task finish. * task, restore: pipelined restore supports parition. * backup: always wait worker to finish. * backup, task: skip checksum when needed. * *: make linter happy. * restore: move batcher test to restore_test package. * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore, task: remove context on struct types. * restore: batcher auto commit can be disabled now. * restore, task: fix typos. * recover: fix a bug about removing tiflash. * restore: MapTableToFiles issues Error log when key range not match. * *: merge master. * restore: fix test to match new change of master. * Apply suggestions from code review * restore: merge two progresses. * restore: fix a bug. that is, when table is too big or batch size is too low, we will fail to restore the head part of this table. * restore: extract batcher to another file * task: don't return imediately when files is empty. * restore,task: do some refactor We move `splitPrepareWork` into a struct named `ContextManager`, so that we can batchly set placement rules on online restore. * restore: fix a shaming bug... :| * task,restore: panic on file broken * restore: record tiflash count to disk when removed * restore,task: simplify some code, * task,restore: fix a bug. The bug causes, when a singal table is splt into multi part of batches, it sometimes fail to checksum. * restore: some factory and fix 1. make the batcher worker has two send style 2. make functions for debuging tables and ranges 3. rewrite a test case to adapt the new batcher * tests: try to fix CI * tests: try to fix CI, again * Apply suggestions from code review Co-authored-by: 3pointer <qdlc2010@gmail.com> * restore: change some log levels * restore: merge joiner of sendWorker into messagebox ... and, some small changes: - don't send sending request if here is one. - the method of how a batcher is send move to log level debug * restore,task: run RemoveRestoreLabels at restore post work * task: adapt the remove-tiflash flag * restore,task: fetch new placement rules each time * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore,task: run Leave always, and modify some log level * restore: fix a bug that may cause checksum time incorrect * restore: don't Leave if never Enter Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com>
* restore: add pipelined CreateTable. * restore: add pipelined ValidateFileRanges. * restore: pipelining restore process. * restore, task: use batching when pipelining. * restore: batcher split by range(instead of table). * restore,task: new way to for polling errCh. We use select instead of for range, so we can send error when context cancelled. * restore, task: pipelining checksum. * restore, task: cancel parallel DDL request. * restore: restore will now send batch periodly. * restore: refactor batcher. * restore: add tests on batcher. * restore, task: make linter happy. * *: add dep to multierr. * task: adjust to new function sig. * task, restore: close updateCh until all task finish. * task, restore: pipelined restore supports parition. * backup: always wait worker to finish. * backup, task: skip checksum when needed. * *: make linter happy. * restore: move batcher test to restore_test package. * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore, task: remove context on struct types. * restore: batcher auto commit can be disabled now. * restore, task: fix typos. * recover: fix a bug about removing tiflash. * restore: MapTableToFiles issues Error log when key range not match. * *: merge master. * restore: fix test to match new change of master. * Apply suggestions from code review * restore: merge two progresses. * restore: fix a bug. that is, when table is too big or batch size is too low, we will fail to restore the head part of this table. * restore: extract batcher to another file * task: don't return imediately when files is empty. * restore,task: do some refactor We move `splitPrepareWork` into a struct named `ContextManager`, so that we can batchly set placement rules on online restore. * restore: fix a shaming bug... :| * task,restore: panic on file broken * restore: record tiflash count to disk when removed * restore,task: simplify some code, * task,restore: fix a bug. The bug causes, when a singal table is splt into multi part of batches, it sometimes fail to checksum. * restore: some factory and fix 1. make the batcher worker has two send style 2. make functions for debuging tables and ranges 3. rewrite a test case to adapt the new batcher * tests: try to fix CI * tests: try to fix CI, again * Apply suggestions from code review Co-authored-by: 3pointer <qdlc2010@gmail.com> * restore: change some log levels * restore: merge joiner of sendWorker into messagebox ... and, some small changes: - don't send sending request if here is one. - the method of how a batcher is send move to log level debug * restore,task: run RemoveRestoreLabels at restore post work * task: adapt the remove-tiflash flag * restore,task: fetch new placement rules each time * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore,task: run Leave always, and modify some log level * restore: fix a bug that may cause checksum time incorrect * restore: don't Leave if never Enter Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com>
* restore: add pipelined CreateTable. * restore: add pipelined ValidateFileRanges. * restore: pipelining restore process. * restore, task: use batching when pipelining. * restore: batcher split by range(instead of table). * restore,task: new way to for polling errCh. We use select instead of for range, so we can send error when context cancelled. * restore, task: pipelining checksum. * restore, task: cancel parallel DDL request. * restore: restore will now send batch periodly. * restore: refactor batcher. * restore: add tests on batcher. * restore, task: make linter happy. * *: add dep to multierr. * task: adjust to new function sig. * task, restore: close updateCh until all task finish. * task, restore: pipelined restore supports parition. * backup: always wait worker to finish. * backup, task: skip checksum when needed. * *: make linter happy. * restore: move batcher test to restore_test package. * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore, task: remove context on struct types. * restore: batcher auto commit can be disabled now. * restore, task: fix typos. * recover: fix a bug about removing tiflash. * restore: MapTableToFiles issues Error log when key range not match. * *: merge master. * restore: fix test to match new change of master. * Apply suggestions from code review * restore: merge two progresses. * restore: fix a bug. that is, when table is too big or batch size is too low, we will fail to restore the head part of this table. * restore: extract batcher to another file * task: don't return imediately when files is empty. * restore,task: do some refactor We move `splitPrepareWork` into a struct named `ContextManager`, so that we can batchly set placement rules on online restore. * restore: fix a shaming bug... :| * task,restore: panic on file broken * restore: record tiflash count to disk when removed * restore,task: simplify some code, * task,restore: fix a bug. The bug causes, when a singal table is splt into multi part of batches, it sometimes fail to checksum. * restore: some factory and fix 1. make the batcher worker has two send style 2. make functions for debuging tables and ranges 3. rewrite a test case to adapt the new batcher * tests: try to fix CI * tests: try to fix CI, again * Apply suggestions from code review Co-authored-by: 3pointer <qdlc2010@gmail.com> * restore: change some log levels * restore: merge joiner of sendWorker into messagebox ... and, some small changes: - don't send sending request if here is one. - the method of how a batcher is send move to log level debug * restore,task: run RemoveRestoreLabels at restore post work * task: adapt the remove-tiflash flag * restore,task: fetch new placement rules each time * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore,task: run Leave always, and modify some log level * restore: fix a bug that may cause checksum time incorrect * restore: don't Leave if never Enter Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com> Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com>
* Pipelined restore. (#266) * restore: add pipelined CreateTable. * restore: add pipelined ValidateFileRanges. * restore: pipelining restore process. * restore, task: use batching when pipelining. * restore: batcher split by range(instead of table). * restore,task: new way to for polling errCh. We use select instead of for range, so we can send error when context cancelled. * restore, task: pipelining checksum. * restore, task: cancel parallel DDL request. * restore: restore will now send batch periodly. * restore: refactor batcher. * restore: add tests on batcher. * restore, task: make linter happy. * *: add dep to multierr. * task: adjust to new function sig. * task, restore: close updateCh until all task finish. * task, restore: pipelined restore supports parition. * backup: always wait worker to finish. * backup, task: skip checksum when needed. * *: make linter happy. * restore: move batcher test to restore_test package. * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore, task: remove context on struct types. * restore: batcher auto commit can be disabled now. * restore, task: fix typos. * recover: fix a bug about removing tiflash. * restore: MapTableToFiles issues Error log when key range not match. * *: merge master. * restore: fix test to match new change of master. * Apply suggestions from code review * restore: merge two progresses. * restore: fix a bug. that is, when table is too big or batch size is too low, we will fail to restore the head part of this table. * restore: extract batcher to another file * task: don't return imediately when files is empty. * restore,task: do some refactor We move `splitPrepareWork` into a struct named `ContextManager`, so that we can batchly set placement rules on online restore. * restore: fix a shaming bug... :| * task,restore: panic on file broken * restore: record tiflash count to disk when removed * restore,task: simplify some code, * task,restore: fix a bug. The bug causes, when a singal table is splt into multi part of batches, it sometimes fail to checksum. * restore: some factory and fix 1. make the batcher worker has two send style 2. make functions for debuging tables and ranges 3. rewrite a test case to adapt the new batcher * tests: try to fix CI * tests: try to fix CI, again * Apply suggestions from code review Co-authored-by: 3pointer <qdlc2010@gmail.com> * restore: change some log levels * restore: merge joiner of sendWorker into messagebox ... and, some small changes: - don't send sending request if here is one. - the method of how a batcher is send move to log level debug * restore,task: run RemoveRestoreLabels at restore post work * task: adapt the remove-tiflash flag * restore,task: fetch new placement rules each time * Apply suggestions from code review Co-authored-by: kennytm <kennytm@gmail.com> * restore,task: run Leave always, and modify some log level * restore: fix a bug that may cause checksum time incorrect * restore: don't Leave if never Enter Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com> * restore: fix a package name error seems the rename package PR isn't cherry-picked to release-3.0. So I move batcher_test to restore(from restore_test) Co-authored-by: kennytm <kennytm@gmail.com> Co-authored-by: 3pointer <qdlc2010@gmail.com> Co-authored-by: 3pointer <luancheng@pingcap.com>
What problem does this PR solve?
Currently, the full restore workflow is:
That would be good if there are few tables, but when the number of table grows, the time we waste on create table will also be greatly incremented. F1 DDL is slow, even TiDB make great effort on optimize it, create a table will spend about 2 secs, and cannot be paralleled.
What is changed and how it works?
I pipelined the workflow, that is, we won't wait until all tables created. (e.g. we can do restore and create tables at the same time).
Check List
Tests
We tested this on a 3-TiKV cluster, in the 500+ table, per table 2000k records workload:
And there is a internal test report.
Side effects
Related changes
Release Note