Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning:The time to import data into Lightning significantly extends with the increase in the number of databases #55054

Closed
shaoxiqian opened this issue Jul 30, 2024 · 1 comment · Fixed by #55230
Assignees
Labels
component/lightning This issue is related to Lightning of TiDB. type/enhancement The issue or PR belongs to an enhancement.

Comments

@shaoxiqian
Copy link
Contributor

lightning version: nightly
cluster version:nightly
cluster topo: 1 tidb (16c48g) + 6 tikv (24c48g + 1500G)

lightning:toml

[lightning]
level = "info"
check-requirements = false
status-addr = ':8289'
index-concurrency = 64
table-concurrency = 64
io-concurrency = 32
region-concurrency = 64

[tikv-importer]
backend = "local"
incremental-import = true
sorted-kv-dir = "/tiup/sort"
range-concurrency = 64

[tidb]
# Information of the target cluster
port = 4000
user = "root"
password = ""
host = "tidb-1-peer"
status-port = 10080
pd-addr = "pd-peer:2379"
build-stats-concurrency = 20
distsql-scan-concurrency = 15
index-serial-scan-concurrency = 20
checksum-table-concurrency = 2

[mydumper]
no-schema = true
data-source-dir = 's3://xxx/xxx/50k-195000?access-key=xxx&secret-access-key=xxx&endpoint=http://xxx.com&force-path-style=false&region=Beijing&provider=ks'
[mydumper.csv]
header = false

[checkpoint]
# Whether to enable checkpoints.
enable = true
driver = "file"

[post-restore]
checksum = false
analyze = false

[conflict]
strategy = "replace"

time for Importing data into cluster:

2000 databases take 8-9 minutes.
img_v3_02cv_4c8f335f-c1e1-4cd1-8fe8-5786c3ceb55g

5000 databases take 1-2.5 hours. ["the whole procedure completed"] [takeTime=2h16m33.498780422s] []
10000 databases take 7-8 hours. ["the whole procedure failed"] [takeTime=7h56m44.007541687s]

Every database has 5 tables and 3 rows peer table.
The time to import data into Lightning significantly extends with the increase in the number of libraries, ranging from just a few minutes for smaller datasets to several hours for larger ones, showcasing a pronounced time amplification effect.

@shaoxiqian shaoxiqian added the type/bug The issue is confirmed as a bug. label Jul 30, 2024
@shaoxiqian shaoxiqian changed the title Lightning:The time to import data into Lightning significantly extends with the increase in the number of libraries Lightning:The time to import data into Lightning significantly extends with the increase in the number of databases Jul 30, 2024
@jebter jebter added severity/major component/lightning This issue is related to Lightning of TiDB. feature/developing the related feature is in development labels Jul 31, 2024
@D3Hunter
Copy link
Contributor

D3Hunter commented Aug 6, 2024

after #42136, table info is part of the checkpoint, in the 5k db case in the issue, the checkpoint file is about 71M, it become the bottleneck as data size of each table is quite small.

we can disable checkpoint or use mysql driver(we only save changed content in this case) as a workaround

put checkpoint file in a faster disk would help a little

@D3Hunter D3Hunter added type/enhancement The issue or PR belongs to an enhancement. and removed type/bug The issue is confirmed as a bug. severity/major feature/developing the related feature is in development labels Aug 7, 2024
ti-chi-bot bot pushed a commit that referenced this issue Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/lightning This issue is related to Lightning of TiDB. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants