Skip to content
This repository has been archived by the owner on Dec 8, 2021. It is now read-only.

Support non-uniform batch size #114

Merged
merged 8 commits into from
Jan 14, 2019
Merged

Conversation

lonng
Copy link
Contributor

@lonng lonng commented Jan 9, 2019

What problem does this PR solve?

import() step will not be concurrent.
If multiple Batch end times are close, it will result in multiple Batch import serials.

What is changed and how it works?

curBatchSize = batchSize / int64(batchSizeScale-curEngineID)

Check List

Tests

Code changes

Side effects

Related changes

@sre-bot
Copy link

sre-bot commented Jan 9, 2019

Hi contributor, thanks for your PR.

This patch needs to be approved by someone of admins. They should reply with "/ok-to-test" to accept this PR for running test automatically.

@lonng
Copy link
Contributor Author

lonng commented Jan 9, 2019

/run-all-tests

@kennytm
Copy link
Collaborator

kennytm commented Jan 9, 2019

Perhaps use #113 as the base branch to reduce the diff?

@lonng lonng changed the base branch from master to kennytm/batching January 9, 2019 03:44
@lonng
Copy link
Contributor Author

lonng commented Jan 9, 2019

/run-all-tests

@lonng
Copy link
Contributor Author

lonng commented Jan 9, 2019

/run-all-tests

Copy link
Collaborator

@kennytm kennytm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: This is not "dynamic" batch size, just "non-uniform" batch size. I expected dynamic means we'll adjust the batch size during runtime 🙃

lightning/mydump/region.go Outdated Show resolved Hide resolved
tests/checkpoint_engines/run.sh Outdated Show resolved Hide resolved
@lonng lonng changed the title Dynamic batch size Deterministic dynamic batch size Jan 9, 2019
@lonng lonng added status/DNM Do not merge, test is failing or blocked by another PR priority/normal type/enhancement Performance improvement or refactoring labels Jan 9, 2019
@lonng lonng changed the title Deterministic dynamic batch size Non-uniform batch size Jan 9, 2019
@lonng
Copy link
Contributor Author

lonng commented Jan 9, 2019

/run-all-tests

@kennytm
Copy link
Collaborator

kennytm commented Jan 9, 2019

/run-all-tests

1 similar comment
@lonng
Copy link
Contributor Author

lonng commented Jan 10, 2019

/run-all-tests

@lonng lonng added priority/important Should Update Docs Should update docs after this PR is merged. Remove this label once the docs are updated and removed priority/normal labels Jan 10, 2019
@lonng lonng changed the title Non-uniform batch size Support non-uniform batch size Jan 10, 2019
* Use the exact result of 1/Beta(N, R) instead of an approximation
* When the number of engines is small and the total engine size of the
  first (table-concurrency) batches exceed the table size, the last batch
  was truncated, and disrupt the pipeline. Now in these case we will
  reduce the batch size to avoid this disruption.
@lonng lonng added the status/PTAL This PR is ready for review. Add this label back after committing new changes label Jan 11, 2019
@lonng lonng removed the status/DNM Do not merge, test is failing or blocked by another PR label Jan 11, 2019
@kennytm
Copy link
Collaborator

kennytm commented Jan 12, 2019

/run-all-tests

@kennytm
Copy link
Collaborator

kennytm commented Jan 12, 2019

PTAL @csuzhangxc @GregoryIan

Since both @lonng and I participated in this PR, we need somebody else to review the PR 🙃.

If approved, this PR will be merged into #113, and then #113 will be merged immediately into master.

Summary of the changes:

  1. Employed the non-uniform batch size algorithm outlined in the new section in RFC 3, with the default ratio R = 0.75.
  2. Increased the default batch size B₁ from 10 GiB to 100 GiB.

@IANTHEREAL
Copy link
Collaborator

ok,I will review it tomorrow or the day after tomorrow

# Lightning will slightly increase the size of the first few batches to properly distribute
# resources. The scale up is controlled by this parameter, which expresses the speed ratio between
# the "import" and "write" steps. If "import" is faster, the batch size anomaly is smaller, and
# zero means uniform batch size. This value should be in the range (0 <= batch-import-ratio < 1).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to get import speed, should we provode a constant value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expanded the comment a bit. This can be calculated by (import duration / write duration) of a single small table (e.g. ~1 GB).

I do suspect that the ratio is not a constant. It could be affected by the table structure, for instance. But for the 3 tables we've tested the ratio does approach this value. We could optimize the calculation later.

Copy link
Collaborator

@IANTHEREAL IANTHEREAL Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean give import duration a referenced value, but what's import duration and how to get it for user ? user must have a try and then find it in log

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we give users a series of different recommended values for different deployments later?

Copy link
Collaborator

@kennytm kennytm Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can explain how to choose the best value in the docs and to OPS. But we don't expect users are going to change these values unless they want to do some heavy optimization.

invBetaNR := math.Exp(logGammaNPlusR - logGammaN - logGammaR) // 1/B(N, R) = Γ(N+R)/Γ(N)Γ(R)
for {
if n <= 0 || n > tableConcurrency {
n = tableConcurrency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check whether user set a unreasonable table conc?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition, it may be better to compute n by given table conc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't directly set n = tableConcurrency as this may produce engine size which is too large or too small.

Example of too large: 8T table with table-conc = 8, forcing each batch to be ~1T, causing pressure on importer.
Example of too small: 200G table with table-conc = 8, forcing each batch to be ~25G, making the data sent to TiKV less sorted and increases compaction cost.

Copy link
Collaborator

@IANTHEREAL IANTHEREAL Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I means we compute n by given batchImportRatio and batchSize, what would happen if batchSize is unreasonable (like too small) and table conc is also small? Will the algorithm degenerate to near-order import?

// ≲ N/(1-R)
//
// We use a simple brute force search since the search space is extremely small.
ratio := totalDataFileSize * (1 - batchImportRatio) / batchSize
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not compute N at here directly, just try to reduce batch size?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there's no simple formula to solve N in X = N - 1/beta(N, R) 😅.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N is in a limit values range, means maybe we can use a heuristic way

@IANTHEREAL
Copy link
Collaborator

LGTM

@IANTHEREAL IANTHEREAL added status/LGT1 One reviewer already commented LGTM (LGTM1) and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Jan 14, 2019
@csuzhangxc
Copy link
Member

LGTM

@csuzhangxc csuzhangxc added status/LGT2 Two reviewers already commented LGTM, ready for merge (LGTM2) and removed status/LGT1 One reviewer already commented LGTM (LGTM1) labels Jan 14, 2019
@kennytm kennytm added the Should Update Ansible The config in TiDB-Ansible should be updated label Jan 14, 2019
@kennytm
Copy link
Collaborator

kennytm commented Jan 14, 2019

/run-all-tests

@kennytm kennytm merged commit 8ba38ed into kennytm/batching Jan 14, 2019
kennytm added a commit that referenced this pull request Jan 14, 2019
* config,restore: introduced `[mydumper] batch-size`

Removed `[tikv-importer] batch-size` to avoid confusion.

Removed `[mydumper] min-region-size` since it is useless now.

* restore,mydump: pre-allocate engine IDs

* restore: separate table checkpoints and engine checkpoints

* importer: stop exposing the UUID

* checkpoints: make checkpoint diff understand 1 table = many engines

* checkpoints: make file checkpoints recognize multiple engines

* checkpoints: migrated MySQL-based checkpoint to multi-engine as well

* restore: adapt restore workflow for multi-engine

* tests: added test case for multi-engine

* *: fixed code

* *: addressed comments

* *: addressed comments

* Support non-uniform batch size (#114)

* mydump: non-uniform batch size

* *: make the `batch-size-scale` configurable

* *: implemented the optimized non-uniform strategy

* tests: due to change of strategy, checkpoint_engines count becomes 4 again

* mydump/region: slightly adjust the batch size computation

* Use the exact result of 1/Beta(N, R) instead of an approximation
* When the number of engines is small and the total engine size of the
  first (table-concurrency) batches exceed the table size, the last batch
  was truncated, and disrupt the pipeline. Now in these case we will
  reduce the batch size to avoid this disruption.

* restore: log the SQL size and KV size of each engine for debugging

* config: change default batch size and ratio given experiment result

* config: added more explanation about batch-import-ratio

Co-authored-by: Lonng <chris@lonng.org>
@lonng lonng deleted the lonng/dynamic-batch-size branch March 5, 2019 07:50
@kennytm kennytm removed the Should Update Docs Should update docs after this PR is merged. Remove this label once the docs are updated label Mar 11, 2019
@kennytm kennytm removed the Should Update Ansible The config in TiDB-Ansible should be updated label May 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status/LGT2 Two reviewers already commented LGTM, ready for merge (LGTM2) type/enhancement Performance improvement or refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants