The speed of dumpling data from tidb to s3 drops 2+GBit/s to 200Mbit/s #223

cyliu0 · 2020-12-26T01:05:38Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.
Dumpling 4.2TB tpcc data from tidb to s3.
What did you expect to see?
The speed is stable.
What did you see instead?
The speed drops from 2Gbit/s to 200Mbit/s
Versions of the cluster
- Dumpling version (run dumpling -V):
```
# ./bin/dumpling -V
Release version: v4.0.9-2-g34903c7
Git commit hash: 34903c7475cfbfc715c03e0a4e5029c640f33234
Git branch:      master
Build timestamp: 2020-12-25 12:35:36Z
Go version:      go version go1.15.6 linux/amd64
```
- Source database version (execute SELECT version(); in a MySQL client):
```
TiDB v4.0.9
```
- Other interesting information (system version, hardware config, etc):
  
  The dumpling host is an AWS EC2 c5.9xlarge instance w/ ubuntu

The flame graph for 60 seconds during the 200Mbit/s period flame-graph.tar.gz

Operation logs
- Please upload dumpling.log for Dumpling if possible

root@tidb-lightning-ln57s:~/dumpling# ./bin/dumpling -h xxxx -P 4000 -u root -p 'xxxx' -o "s3://xxxxx" -t 32 -r 20000 -F 256M | tee dumpling.log
Release version: v4.0.9-2-g34903c7
Git commit hash: 34903c7475cfbfc715c03e0a4e5029c640f33234
Git branch:      master
Build timestamp: 2020-12-25 12:35:36Z
Go version:      go version go1.15.6 linux/amd64
[2020/12/25 14:47:55.485 +08:00] [INFO] [config.go:598] ["detect server type"] [type=TiDB]
[2020/12/25 14:47:55.485 +08:00] [INFO] [config.go:617] ["detect server version"] [version=4.0.9]
{"level":"warn","ts":"2020-12-25T14:48:05.702+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"xxxxxxxxxxxxxxxxxxxxxxxxxxxx","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection closed"}
[2020/12/25 14:48:05.702 +08:00] [WARN] [dump.go:699] ["meet error while check whether fetched pd addr and TiDB belongs to one cluster"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20200917111840-a15ef68f753d/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20200917111840-a15ef68f753d/juju_adaptor.go:15\ngithub.com/pingcap/dumpling/v4/export.getPdDDLIDs\n\tgithub.com/pingcap/dumpling/v4/export/util.go:24\ngithub.com/pingcap/dumpling/v4/export.checkSameCluster\n\tgithub.com/pingcap/dumpling/v4/export/util.go:46\ngithub.com/pingcap/dumpling/v4/export.tidbSetPDClientForGC\n\tgithub.com/pingcap/dumpling/v4/export/dump.go:697\ngithub.com/pingcap/dumpling/v4/export.runSteps\n\tgithub.com/pingcap/dumpling/v4/export/dump.go:607\ngithub.com/pingcap/dumpling/v4/export.NewDumper\n\tgithub.com/pingcap/dumpling/v4/export/dump.go:55\nmain.main\n\tcommand-line-arguments/main.go:65\nruntime.main\n\truntime/proc.go:204\nruntime.goexit\n\truntime/asm_amd64.s:1374"] [pdAddrs="[db-pd-0.db-pd-peer.tidb1341928121892343808.svc:2379,db-pd-1.db-pd-peer.tidb1341928121892343808.svc:2379,db-pd-2.db-pd-peer.tidb1341928121892343808.svc:2379]"]
[2020/12/25 14:48:05.704 +08:00] [WARN] [dump.go:745] ["If the amount of data to dump is large, criteria: (data more than 60GB or dumped time more than 10 minutes)\nyou'd better adjust the tikv_gc_life_time to avoid export failure due to
TiDB GC during the dump process.\nBefore dumping: run sql `update mysql.tidb set VARIABLE_VALUE = '720h' where VARIABLE_NAME = 'tikv_gc_life_time';` in tidb.\nAfter dumping: run sql `update mysql.tidb set VARIABLE_VALUE = '10m' where VARIABLE_NAME = 'tikv_gc_life_time';` in tidb.\n"]
[2020/12/25 14:48:06.779 +08:00] [WARN] [dump.go:419] ["no data to dump"] [schema=test] [table=test]
[2020/12/25 14:48:06.780 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=10000]
[2020/12/25 14:48:19.540 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=1800000000]
[2020/12/25 15:18:12.837 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=600000]
[2020/12/25 15:18:17.244 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=1800000000]
[2020/12/25 15:26:48.026 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=100000]
[2020/12/25 15:26:50.009 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=540000000]
[2020/12/25 15:28:59.700 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=10000]
[2020/12/25 15:29:02.412 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=1800000000]
[2020/12/25 15:37:28.753 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=6000000000]
[2020/12/25 16:40:00.101 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=60000]
[2020/12/26 08:27:25.005 +08:00] [INFO] [collector.go:212] ["backup Success summary: total backup ranges: 597051, total success: 597051, total failed: 0, total take(backup time): 17h39m18.282210068s, total take(real time): 17h39m18.282233836s, total size(MB): 4649601.89, avg speed(MB/s): 73.15, total rows: 29940972129"]
[2020/12/26 08:27:25.123 +08:00] [INFO] [main.go:78] ["dump data successfully, dumpling will exit now"]

Other interesting logs

Configuration of the cluster and the task
- running command for Dumpling if possible

The text was updated successfully, but these errors were encountered:

lichunzhu · 2020-12-28T07:36:23Z

After check the dumped data, nearly 1.5TB data are dumped sequentially. That's the reason why the speed of dumpling data from tidb to s3 drops 2+GBit/s to 200Mbit/s.

In the local test, I insert 200 records without analyze and also get similar estimate count:
[2020/12/28 14:59:34.669 +08:00] [INFO] [dump.go:337] ["get estimated rows count"] [estimateCount=10000]
Since the parameter is -r 20000, this table will be dumped sequentially.

In a nutshell, incorrect estimate count causes this result.

IANTHEREAL · 2021-07-22T12:51:58Z

can we close it now?

lichunzhu · 2021-07-22T12:56:30Z

tablesample and tableregion will solve this problem for TiDB v3.0+.
But for TiDB < 3.0 and MySQL it will cost too much to run select count(*) rather than use estimateCount. So we won't make further support for this.

lichunzhu added difficulty/2-medium priority/P2 Medium priority issue labels Dec 28, 2020

lichunzhu mentioned this issue Dec 30, 2020

*: refine dumpling log hint system #228

Merged

IANTHEREAL added the Unstable label Jul 22, 2021

lichunzhu closed this as completed Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The speed of dumpling data from tidb to s3 drops 2+GBit/s to 200Mbit/s #223

The speed of dumpling data from tidb to s3 drops 2+GBit/s to 200Mbit/s #223

cyliu0 commented Dec 26, 2020

lichunzhu commented Dec 28, 2020

IANTHEREAL commented Jul 22, 2021

lichunzhu commented Jul 22, 2021 •

edited

Loading

The speed of dumpling data from tidb to s3 drops 2+GBit/s to 200Mbit/s #223

The speed of dumpling data from tidb to s3 drops 2+GBit/s to 200Mbit/s #223

Comments

cyliu0 commented Dec 26, 2020

Bug Report

lichunzhu commented Dec 28, 2020

IANTHEREAL commented Jul 22, 2021

lichunzhu commented Jul 22, 2021 • edited Loading

lichunzhu commented Jul 22, 2021 •

edited

Loading