Support non-uniform batch size #114

lonng · 2019-01-09T03:32:30Z

What problem does this PR solve?

import() step will not be concurrent.
If multiple Batch end times are close, it will result in multiple Batch import serials.

What is changed and how it works?

curBatchSize = batchSize / int64(batchSizeScale-curEngineID)

Check List

Tests

Code changes

Side effects

Related changes

sre-bot · 2019-01-09T03:32:32Z

Hi contributor, thanks for your PR.

This patch needs to be approved by someone of admins. They should reply with "/ok-to-test" to accept this PR for running test automatically.

lonng · 2019-01-09T03:32:45Z

/run-all-tests

kennytm · 2019-01-09T03:41:26Z

Perhaps use #113 as the base branch to reduce the diff?

lonng · 2019-01-09T04:44:16Z

/run-all-tests

lonng · 2019-01-09T05:00:21Z

/run-all-tests

kennytm

Nitpick: This is not "dynamic" batch size, just "non-uniform" batch size. I expected dynamic means we'll adjust the batch size during runtime 🙃

lightning/mydump/region.go

tests/checkpoint_engines/run.sh

lonng · 2019-01-09T06:31:44Z

/run-all-tests

…again

kennytm · 2019-01-09T20:01:28Z

/run-all-tests

lonng · 2019-01-10T00:38:20Z

/run-all-tests

* Use the exact result of 1/Beta(N, R) instead of an approximation * When the number of engines is small and the total engine size of the first (table-concurrency) batches exceed the table size, the last batch was truncated, and disrupt the pipeline. Now in these case we will reduce the batch size to avoid this disruption.

kennytm · 2019-01-12T08:11:55Z

/run-all-tests

kennytm · 2019-01-12T09:33:33Z

PTAL @csuzhangxc @GregoryIan

Since both @lonng and I participated in this PR, we need somebody else to review the PR 🙃.

If approved, this PR will be merged into #113, and then #113 will be merged immediately into master.

Summary of the changes:

Employed the non-uniform batch size algorithm outlined in the new section in RFC 3, with the default ratio R = 0.75.
Increased the default batch size B₁ from 10 GiB to 100 GiB.

IANTHEREAL · 2019-01-12T11:24:58Z

ok，I will review it tomorrow or the day after tomorrow

IANTHEREAL · 2019-01-13T12:34:16Z

tidb-lightning.toml

+# Lightning will slightly increase the size of the first few batches to properly distribute
+# resources. The scale up is controlled by this parameter, which expresses the speed ratio between
+# the "import" and "write" steps. If "import" is faster, the batch size anomaly is smaller, and
+# zero means uniform batch size. This value should be in the range (0 <= batch-import-ratio < 1).


how to get import speed, should we provode a constant value?

I've expanded the comment a bit. This can be calculated by (import duration / write duration) of a single small table (e.g. ~1 GB).

I do suspect that the ratio is not a constant. It could be affected by the table structure, for instance. But for the 3 tables we've tested the ratio does approach this value. We could optimize the calculation later.

I just mean give import duration a referenced value, but what's import duration and how to get it for user ? user must have a try and then find it in log

Will we give users a series of different recommended values for different deployments later?

We can explain how to choose the best value in the docs and to OPS. But we don't expect users are going to change these values unless they want to do some heavy optimization.

IANTHEREAL · 2019-01-13T13:15:26Z

lightning/mydump/region.go

+	invBetaNR := math.Exp(logGammaNPlusR - logGammaN - logGammaR) // 1/B(N, R) = Γ(N+R)/Γ(N)Γ(R)
+	for {
+		if n <= 0 || n > tableConcurrency {
+			n = tableConcurrency


should we check whether user set a unreasonable table conc?

in addition, it may be better to compute n by given table conc

We can't directly set n = tableConcurrency as this may produce engine size which is too large or too small.

Example of too large: 8T table with table-conc = 8, forcing each batch to be ~1T, causing pressure on importer.
Example of too small: 200G table with table-conc = 8, forcing each batch to be ~25G, making the data sent to TiKV less sorted and increases compaction cost.

I means we compute n by given batchImportRatio and batchSize, what would happen if batchSize is unreasonable (like too small) and table conc is also small? Will the algorithm degenerate to near-order import?

IANTHEREAL · 2019-01-13T13:46:24Z

lightning/mydump/region.go

+	//              ≲ N/(1-R)
+	//
+	// We use a simple brute force search since the search space is extremely small.
+	ratio := totalDataFileSize * (1 - batchImportRatio) / batchSize


why not compute N at here directly, just try to reduce batch size?

Because there's no simple formula to solve N in X = N - 1/beta(N, R) 😅.

N is in a limit values range, means maybe we can use a heuristic way

IANTHEREAL · 2019-01-14T05:19:04Z

LGTM

csuzhangxc · 2019-01-14T06:53:42Z

LGTM

kennytm · 2019-01-14T12:01:41Z

/run-all-tests

* config,restore: introduced `[mydumper] batch-size` Removed `[tikv-importer] batch-size` to avoid confusion. Removed `[mydumper] min-region-size` since it is useless now. * restore,mydump: pre-allocate engine IDs * restore: separate table checkpoints and engine checkpoints * importer: stop exposing the UUID * checkpoints: make checkpoint diff understand 1 table = many engines * checkpoints: make file checkpoints recognize multiple engines * checkpoints: migrated MySQL-based checkpoint to multi-engine as well * restore: adapt restore workflow for multi-engine * tests: added test case for multi-engine * *: fixed code * *: addressed comments * *: addressed comments * Support non-uniform batch size (#114) * mydump: non-uniform batch size * *: make the `batch-size-scale` configurable * *: implemented the optimized non-uniform strategy * tests: due to change of strategy, checkpoint_engines count becomes 4 again * mydump/region: slightly adjust the batch size computation * Use the exact result of 1/Beta(N, R) instead of an approximation * When the number of engines is small and the total engine size of the first (table-concurrency) batches exceed the table size, the last batch was truncated, and disrupt the pipeline. Now in these case we will reduce the batch size to avoid this disruption. * restore: log the SQL size and KV size of each engine for debugging * config: change default batch size and ratio given experiment result * config: added more explanation about batch-import-ratio Co-authored-by: Lonng <chris@lonng.org>

lonng changed the base branch from master to kennytm/batching January 9, 2019 03:44

lonng force-pushed the lonng/dynamic-batch-size branch from e33b35f to 55753ea Compare January 9, 2019 04:44

lonng force-pushed the lonng/dynamic-batch-size branch from 55753ea to 7f0d002 Compare January 9, 2019 04:58

kennytm reviewed Jan 9, 2019

View reviewed changes

lightning/mydump/region.go Outdated Show resolved Hide resolved

tests/checkpoint_engines/run.sh Outdated Show resolved Hide resolved

lonng changed the title ~~Dynamic batch size~~ Deterministic dynamic batch size Jan 9, 2019

lonng added status/DNM Do not merge, test is failing or blocked by another PR priority/normal type/enhancement Performance improvement or refactoring labels Jan 9, 2019

lonng changed the title ~~Deterministic dynamic batch size~~ Non-uniform batch size Jan 9, 2019

mydump: non-uniform batch size

451015d

lonng force-pushed the lonng/dynamic-batch-size branch from 7f0d002 to 451015d Compare January 9, 2019 06:10

*: make the batch-size-scale configurable

9d121ba

lonng force-pushed the lonng/dynamic-batch-size branch from 786058a to 9d121ba Compare January 9, 2019 06:31

kennytm added 2 commits January 10, 2019 03:48

*: implemented the optimized non-uniform strategy

5e68de6

tests: due to change of strategy, checkpoint_engines count becomes 4 …

38f82b7

…again

lonng added priority/important Should Update Docs Should update docs after this PR is merged. Remove this label once the docs are updated and removed priority/normal labels Jan 10, 2019

lonng changed the title ~~Non-uniform batch size~~ Support non-uniform batch size Jan 10, 2019

kennytm added 2 commits January 10, 2019 14:45

restore: log the SQL size and KV size of each engine for debugging

cd82c4c

lonng added the status/PTAL This PR is ready for review. Add this label back after committing new changes label Jan 11, 2019

lonng removed the status/DNM Do not merge, test is failing or blocked by another PR label Jan 11, 2019

config: change default batch size and ratio given experiment result

62d8ee3

IANTHEREAL reviewed Jan 13, 2019

View reviewed changes

config: added more explanation about batch-import-ratio

cd63d6d

IANTHEREAL added status/LGT1 One reviewer already commented LGTM (LGTM1) and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Jan 14, 2019

csuzhangxc added status/LGT2 Two reviewers already commented LGTM, ready for merge (LGTM2) and removed status/LGT1 One reviewer already commented LGTM (LGTM1) labels Jan 14, 2019

kennytm added the Should Update Ansible The config in TiDB-Ansible should be updated label Jan 14, 2019

kennytm merged commit 8ba38ed into kennytm/batching Jan 14, 2019

lonng deleted the lonng/dynamic-batch-size branch March 5, 2019 07:50

kennytm removed the Should Update Docs Should update docs after this PR is merged. Remove this label once the docs are updated label Mar 11, 2019

kennytm removed the Should Update Ansible The config in TiDB-Ansible should be updated label May 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-uniform batch size #114

Support non-uniform batch size #114

lonng commented Jan 9, 2019

sre-bot commented Jan 9, 2019

lonng commented Jan 9, 2019

kennytm commented Jan 9, 2019

lonng commented Jan 9, 2019

lonng commented Jan 9, 2019

kennytm left a comment

lonng commented Jan 9, 2019

kennytm commented Jan 9, 2019

lonng commented Jan 10, 2019

kennytm commented Jan 12, 2019

kennytm commented Jan 12, 2019

IANTHEREAL commented Jan 12, 2019

IANTHEREAL Jan 13, 2019

kennytm Jan 13, 2019

IANTHEREAL Jan 14, 2019 •

edited

Loading

csuzhangxc Jan 14, 2019

kennytm Jan 14, 2019 •

edited

Loading

IANTHEREAL Jan 13, 2019

IANTHEREAL Jan 13, 2019

kennytm Jan 13, 2019

IANTHEREAL Jan 14, 2019 •

edited

Loading

IANTHEREAL Jan 13, 2019

kennytm Jan 13, 2019

IANTHEREAL Jan 14, 2019

IANTHEREAL commented Jan 14, 2019

csuzhangxc commented Jan 14, 2019

kennytm commented Jan 14, 2019

Support non-uniform batch size #114

Support non-uniform batch size #114

Conversation

lonng commented Jan 9, 2019

What problem does this PR solve?

What is changed and how it works?

Check List

sre-bot commented Jan 9, 2019

lonng commented Jan 9, 2019

kennytm commented Jan 9, 2019

lonng commented Jan 9, 2019

lonng commented Jan 9, 2019

kennytm left a comment

Choose a reason for hiding this comment

lonng commented Jan 9, 2019

kennytm commented Jan 9, 2019

lonng commented Jan 10, 2019

kennytm commented Jan 12, 2019

kennytm commented Jan 12, 2019

IANTHEREAL commented Jan 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IANTHEREAL Jan 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kennytm Jan 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IANTHEREAL Jan 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IANTHEREAL commented Jan 14, 2019

csuzhangxc commented Jan 14, 2019

kennytm commented Jan 14, 2019

IANTHEREAL Jan 14, 2019 •

edited

Loading

kennytm Jan 14, 2019 •

edited

Loading

IANTHEREAL Jan 14, 2019 •

edited

Loading