Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a self-adaptive way to control the rate limit in the 2PC writing #15794

Open
tiancaiamao opened this issue Mar 28, 2020 · 1 comment
Open
Labels
sig/transaction SIG:Transaction type/enhancement The issue or PR belongs to an enhancement. type/performance

Comments

@tiancaiamao
Copy link
Contributor

Development Task

Problem description:

In the prewrite and commit phase, the key-value mutations are split into batches and send to TiKV.
The requests are sent concurrently for each region.

The peak throughput of the TiKV server is limited, so we have to set a rate limit.
There is a TODO here,

	// Set rateLim here for the large transaction.
	// If the rate limit is too high, tikv will report service is busy.
	// If the rate limit is too low, we can't full utilize the tikv's throughput.
	// TODO: Find a self-adaptive way to control the rate limit here.
	if rateLim > 32 {
		rateLim = 32
	}

During the test, I find the value of this rate limit is crucial.
If it's too large, TiKV would report Service is busy, when this happens it could make the whole cluster unavailable. When TiDB get Service is busy from TiKV, it drops region cache, then those metrics become abnormal:

image

There is a lot of error log in the TiDB server like this:

[2020/03/27 07:40:42.529 +08:00] [INFO] [region_cache.go:369] ["invalidate current region, because others failed on same store"] [region=1
180] [store=192.168.123.10:20160]
[2020/03/27 07:40:42.530 +08:00] [INFO] [region_cache.go:369] ["invalidate current region, because others failed on same store"] [region=1
168] [store=192.168.123.10:20160]
[2020/03/27 07:40:42.530 +08:00] [INFO] [region_cache.go:369] ["invalidate current region, because others failed on same store"] [region=1
276] [store=192.168.123.10:20160]
[2020/03/27 07:40:42.530 +08:00] [INFO] [region_cache.go:369] ["invalidate current region, because others failed on same store"] [region=1
286] [store=192.168.123.10:20160]
[2020/03/27 07:40:42.530 +08:00] [INFO] [region_cache.go:369] ["invalidate current region, because others failed on same store"] [region=1
202] [store=192.168.123.10:20160]
tidb.log:[2020/03/27 21:29:18.819 +08:00] [WARN] [backoff.go:309] ["tikvRPC backoffer.maxSleep 20000ms is exceeded, errors:\nsend tikv req
uest error: context deadline exceeded, ctx: region ID: 1510, meta: id:1510 start_key:\"t\\200\\000\\000\\000\\000\\000\\000y_r\\200\\000\\
000\\000\\000{4\\200\" end_key:\"t\\200\\000\\000\\000\\000\\000\\000y_r\\200\\000\\000\\000\\000\\211\\332\\200\" region_epoch:<conf_ver:
1 version:343 > peers:<id:1511 store_id:1 > , peer: id:1511 store_id:1 , addr: 192.168.123.10:20160, idx: 0, try next peer later at 2020-0
3-27T21:28:56.973809386+08:00\nepoch_not_match:<>  at 2020-03-27T21:28:58.309201246+08:00\nsend tikv request error: context deadline excee
ded, ctx: region ID: 1510, meta: id:1510 start_key:\"t\\200\\000\\000\\000\\000\\000\\000y_r\\200\\000\\000\\000\\000{4\\200\" end_key:\"t
\\200\\000\\000\\000\\000\\000\\000y_r\\200\\000\\000\\000\\000\\211\\332\\200\" region_epoch:<conf_ver:1 version:343 > peers:<id:1511 sto
re_id:1 > , peer: id:1511 store_id:1 , addr: 192.168.123.10:20160, idx: 0, try next peer later at 2020-03-27T21:29:18.819898485+08:00"]

Finally, the schema loading process may fail, then every read request fails.

@tiancaiamao tiancaiamao added the type/enhancement The issue or PR belongs to an enhancement. label Mar 28, 2020
@tiancaiamao
Copy link
Contributor Author

Please also keep an eye on this issue. @youjiali1995 @lysu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/transaction SIG:Transaction type/enhancement The issue or PR belongs to an enhancement. type/performance
Projects
None yet
Development

No branches or pull requests

2 participants