Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-split region before it become a writing hotspot to avoid the TiKV 'server is busy' error #16573

Closed
tiancaiamao opened this issue Apr 18, 2020 · 3 comments · Fixed by #16920
Assignees
Labels
type/enhancement The issue or PR belongs to an enhancement.
Milestone

Comments

@tiancaiamao
Copy link
Contributor

This issue is found in the large transaction test case.

https://github.com/pingcap/tidb-test/pull/1021/files#diff-424b95097a5cbed11ebaebcc55adea53R24

Prepare a table:

create table if not exists sbtest2 (
	id int(11) not null primary key,
	k int(11) not null,
	c char(120) not null default '',
	pad char(255) not null default '',
index i_k(k))`

Prepare some date:

create table if not exists sbtest1 (
id int(11) not null primary key,
k int(11) not null,
c char(120) not null default '',
pad char(255) not null default )

load data local infile 'data.txt' into sbtest1;

The load data is created here https://github.com/pingcap/tidb-test/pull/1021/files#diff-720b9eda508e29a7e8e7f6f7725bd8e9R23
about 30000000 records and the value range for k is small (random 0~512)

Then execute insert into sbtest2 select * from sbtest1

During the 2PC(two phase commit) prewriting, there are a lot of mutations on the index regions.
Because the k range is small, nearly all the write workload comes to a single region, and that region becomes a hotspot.

What follows on is the region split and TiDB get a great amount of RegionMiss, and also the ServerIsBusy error.

0 at 2020-04-11T00:01:19.159167283+08:00"]
[2020/04/11 00:01:19.159 +08:00] [WARN] [backoff.go:309] ["serverBusy backoffer.maxSleep 82000ms is exceeded, errors:\nserver is busy, ctx: region ID: 6774, meta: id:6774 start_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000'\\333\\212\" end_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000.HP\" region_epoch:<conf_ver:1 version:973 > peers:<id:6775 store_id:1 > , peer: id:6775 store_id:1 , addr: 192.168.123.10:20160, idx: 0 at 2020-04-11T00:01:04.396402881+08:00\nserver is busy, ctx: region ID: 6774, meta: id:6774 start_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000'\\333\\212\" end_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000.HP\" region_epoch:<conf_ver:1 version:973 > peers:<id:6775 store_id:1 > , peer: id:6775 store_id:1 , addr: 192.168.123.10:20160, idx: 0 at 2020-04-11T00:01:13.271017372+08:00\nserver is busy, ctx: region ID: 6774, meta: id:6774 start_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000'\\333\\212\" end_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000.HP\" region_epoch:<conf_ver:1 version:973 > peers:<id:6775 store_id:1 > , peer: id:6775 store_id:1 , addr: 192.168.123.10:20160, idx: 0 at 2020-04-11T00:01:19.159637792+08:00"]
[2020/04/11 00:01:19.159 +08:00] [WARN] [backoff.go:309] ["serverBusy backoffer.maxSleep 82000ms is exceeded, errors:\nserver is busy, ctx: region ID: 7034, meta: id:7034 start_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\0009\\332f\" end_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000>\\377\\263\" region_epoch:<conf_ver:1 version:975 > peers:<id:7035 store_id:1 > , peer: id:7035 store_id:1 , addr: 192.168.123.10:20160, idx: 0 at 2020-04-11T00:01:00.725425161+08:00\nserver is busy, ctx: region ID: 7034, meta: id:7034 start_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\0009\\332f\" end_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000>\\377\\263\" region_epoch:<conf_ver:1 version:975 > peers:<id:7035 store_id:1 > , peer: id:7035 store_id:1 , addr: 192.168.123.10:20160, idx: 0 at 2020-04-11T00:01:10.502528266+08:00\nserver is busy, ctx: region ID: 7034, meta: id:7034 start_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\0009\\332f\" end_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000>\\377\\263\" region_epoch:<conf_ver:1 version:975 > peers:<id:7035 store_id:1 > , peer: id:7035 store_id:1 , addr: 192.168.123.10:20160, idx: 0 at 2020-04-11T00:01:19.159771134+08:00"]
[2020/04/11 00:01:19.161 +08:00] [WARN] [region_request.go:370] ["tikv reports `ServerIsBusy` retry later"] [reason="scheduler is busy"] [ctx="region ID: 6826, meta: id:6826 start_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\001T\\261\\202\" end_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\001[\\036I\" region_epoch:<conf_ver:1 version:999 > peers:<id:6827 store_id:1 > , peer: id:6827 store_id:1 , addr: 192.168.123.10:20160, idx: 0"]
[2020/04/11 00:01:19.161 +08:00] [WARN] [region_request.go:370] ["tikv reports `ServerIsBusy` retry later"] [reason="scheduler is busy"] [ctx="region ID: 7056, meta: id:7056 start_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000\\271!?\" end_key:\"t\\200\\000\\000\\000\\000\\000\\001%_r\\200\\000\\000\\000\\000\\276F\\215\" region_epoch:<conf_ver:1 version:986 > peers:<id:7057 store_id:1 > , peer: id:7057 store_id:1 , addr: 192.168.123.10:20160, idx: 0"]

image

All the kv related metrics become abnormal as they can't locate the correct region:

image

TiKV is overload when the writing workload is on a single region

image

image

A related problem is the concurrency rate limit here #15794, a higher concurrency rate makes the situration worse.

Development Task

During 2PC, if we detect a lot of mutations on a single region that region would become a hotspot. Maybe we can pre-split the region to avoid TiKV server is busy error.

The 2PC prewrite code here

req := c.buildPrewriteRequest(batch, txnSize)

Some code related to split table region

func splitTableRegion(store kv.SplittableStore, tbInfo *model.TableInfo, scatter bool) {
used by https://pingcap.com/docs/stable/reference/sql/statements/split-region/

@tiancaiamao tiancaiamao added the type/enhancement The issue or PR belongs to an enhancement. label Apr 18, 2020
@tiancaiamao tiancaiamao added this to the v4.0.0-ga milestone Apr 18, 2020
@tiancaiamao
Copy link
Contributor Author

Would you like to take a look @nrc
/cc @AndreMouche @youjiali1995

@nrc
Copy link
Contributor

nrc commented Apr 18, 2020

Would you like to take a look @nrc

Sure, I can look into it Thursday or Friday next week

@zhangjinpeng87
Copy link
Contributor

@tiancaiamao Is it also beneficial when loading data with the transaction size is 1MB but the concurrency is 100?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants