Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

restore: make split and restore pipelined. #427

Merged
merged 36 commits into from
Oct 9, 2020

Conversation

YuJuncen
Copy link
Collaborator

@YuJuncen YuJuncen commented Jul 16, 2020

Signed-off-by: Hillium maruruku@stu.csust.edu.cn

What problem does this PR solve?

Fix #419

What is changed and how it works?

We changed the interface of BatchSender, so it can run asynchronously:

type TableSink interface {
	EmitTables(tables ...CreatedTable)
	EmitError(error)
	Close()
}

// BatchSender is the abstract of how the batcher send a batch.
type BatchSender interface {
	// PutSink sets the sink of this sender, user to this interface promise
	// call this function at least once before first call to `RestoreBatch`.
	PutSink(sink TableSink)
	// RestoreBatch will send the restore request.
	RestoreBatch(ranges DrainResult)
	Close()
}

Then, we must change the call site of Batcher, and make another goroutine(known as contextCleaner) to accept tables that restored asynchronously, do the clean job, and send tables to next.

(TODO?: since the sender can be internally asynchronous, we can remove sendWorker and send requests directly at the call site. However the risk of this is that the channel size of BatchSender might block the caller, and we must pass a ctx to methods like Add since ContextManager need to be in the tree of structured concurrent too.)

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
    Internal test: real time 12m28s -> 9m5s (45000+ ranges)

Release Note

  • Speed up restore by pipelining (split & scatter) and import.

Hillium added 2 commits July 16, 2020 18:59
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
@YuJuncen
Copy link
Collaborator Author

/build

@YuJuncen
Copy link
Collaborator Author

/run-integration-test

Hillium and others added 8 commits July 16, 2020 22:55
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
remove reject stores sometime always grow too huge and provides little useful information, move it to DEBUG.
add a log when any file is fully restored, which is useful troubleshooting when some restore goes slow.

Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
Hillium added 2 commits July 20, 2020 17:26
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
pkg/restore/batcher.go Outdated Show resolved Hide resolved
…hen exit

Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
@YuJuncen
Copy link
Collaborator Author

YuJuncen commented Jul 20, 2020

/run-integration-test

"transport: Error while dialing dial tcp 127.0.0.1:20163: connect: connection refused"

@YuJuncen
Copy link
Collaborator Author

/run-integration-test

@YuJuncen
Copy link
Collaborator Author

Seems it's buggy, will it reproduce again?

[2020-07-21T03:31:23.703Z] [2020/07/21 11:31:23.613 +08:00] [INFO] [client.go:178] ["load backupmeta"] [databases=0] [jobs=3]

/run-integration-test

@kennytm
Copy link
Collaborator

kennytm commented Jul 22, 2020

It's green now.

pkg/restore/util.go Outdated Show resolved Hide resolved
Hillium and others added 2 commits September 18, 2020 12:31
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
@YuJuncen
Copy link
Collaborator Author

/run-integration-test

@YuJuncen
Copy link
Collaborator Author

[2020-09-21T05:41:08.800Z] [2020/09/21 13:41:08.664 +08:00] [INFO] [collector.go:187] ["Full backup Failed summary : total backup ranges: 0, total success: 0, total failed: 0"] [BackupTS=419605212977168385]

/run-integration-test 🤔

@YuJuncen
Copy link
Collaborator Author

/run-integration-test

Copy link
Collaborator

@kennytm kennytm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 LGTM1 label Sep 27, 2020
@@ -338,9 +338,9 @@ func (importer *FileImporter) Import(
zap.Error(errIngest))
return errIngest
}
summary.CollectSuccessUnit(summary.TotalKV, 1, file.TotalKvs)
summary.CollectSuccessUnit(summary.TotalBytes, 1, file.TotalBytes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How strange... 🤔

Hillium and others added 2 commits October 9, 2020 17:42
@3pointer
Copy link
Collaborator

3pointer commented Oct 9, 2020

LGTM

@ti-srebot ti-srebot removed the status/LGT1 LGTM1 label Oct 9, 2020
@ti-srebot ti-srebot added the status/LGT2 LGTM2 label Oct 9, 2020
@YuJuncen
Copy link
Collaborator Author

YuJuncen commented Oct 9, 2020

/merge

@ti-srebot
Copy link
Contributor

/run-all-tests

@ti-srebot ti-srebot merged commit ed2b143 into pingcap:master Oct 9, 2020
@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 failed

YuJuncen added a commit to YuJuncen/br that referenced this pull request Oct 21, 2020
Signed-off-by: Hillium <maruruku@stu.csust.edu.cn>
YuJuncen added a commit that referenced this pull request Oct 23, 2020
*: remove in-struct context (#452)
* restore: make split and restore pipelined. (#427)
*: when cleaning up, use isolated context (#559)
* pd: disable location replacement (#555)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipelining split and import
4 participants