-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PS-3410 : LP #1570114: Long running ALTER TABLE ADD INDEX causes sema… #3143
Conversation
b11e341
to
088dd77
Compare
Submitted jenkins job with --big-test on 5.7 pipeline https://ps57.cd.percona.com/job/percona-server-5.7-pipeline/923/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show the point where INSERT thread tried to acquire index latch in S?
Also, should 5.6 be fixed too? |
no, bulk load (WL#7277) is added only in 5.7 |
…phore wait > 600 assertion Problem: -------- A long running ALTER TABLE ADD INDEX with concurrent inserts causes sempahore waits and eventually crashes the server. To see this problem you need to have 1. A table with lots of data. Add index should take significant time to create many pages 2. Compressed table. This is becuase CPU is spent on compress() with mtr already latching index->lock More time spent by mtr, more waits by the INSERT. Helps in crash. 3. Concurrent inserts when ALTER is running. The inserts should happen specifically after the read phase of ALTER and after Bulk load index build (bottump build) started. The entire bulkload process latches the index->lock X mode for the whole duration of bottom up build of index. The index->lock is held across mtrs (because many pages are created during index build). An example is this: Page1 mtr latches index->lock X mode, when page is full, a sibling page is created. The sibling Page 2 (mtr) also acquires index->lock X mode. Recursive X latching is allowed by same thread. Now Page 1 mtr commits but index->lock is still held by Page 2. Now when page 2 is full, another sibling page is created. Sibling Page 3 now acquires index->lock X mode. Page 2 mtr commits.. This goes on and on. Also happens with Pages at non-root levels. Essentially the time index->lock is held is equally proportional to number of pages/mtrs created. And compress tables helps in making mtr take a bit more time in doing compress() and duration of each mtr is higher with compressed tables. At this stage, a concurrent INSERT comes and since there is concurrent DDL and the index is uncommited, this insert should go to online ALTER log. It tries to acquire index->lock in S mode. Bulk load index already took index->lock X mode and is not going to release it until is over. INSERT thread keeps on waiting, and when the wait crosses 600 seconds to acquire index->lock, it will crash. Fix: ---- INSERT thread acquires index->lock to check the index online status. During the bulk load index build, there is no concurrent insert or read. So there is no need to acquire index->lock at all. Bulk load index build is also used to create indexes in table rebuild cases. For example DROP COLUMN, ADD COLUMN. The indexes on intermediate table (#sql-ib..) are built using bulk load insert. A concurrent DMLs at this stage do not acquire index->lock. So acquiring index->lock on the intermediate table, which is not visible to anyone else doesn't block concurrent DMLs. Ideally we can try to remove all index->lock X acquisitions in bulk load index build path. We play *safe* and remove acquisitions only incase of uncommited indexes. The other path (bulk load used during rebuild) is not affected anyway.
Thread 26 (Thread 0x7ffff0bf9700 (LWP 15407)): ==== 948 const bool check = !index->is_committed(); │
|
088dd77
to
7ebaef8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still having trouble wrapping around my head on why is it safe not to X-latch index lock here, given the comment at row_ins_sec_index_entry_low:
/* Ensure that we acquire index->lock when inserting into an
index with index->online_status == ONLINE_INDEX_COMPLETE, but
could still be subject to rollback_inplace_alter_table().
This prevents a concurrent change of index->online_status.
and upstream fix at f2f7d43
Am not removing the index->lock here (in row_ins_sec_index_entry_low()). The acquisition of index->lock is still there. As commented here, the index->lock is necessary when index->onine_status is read/changed. And bulk load insert build doesn't change index->online_status at all (from BtrBulk::init() to Btrbulk::finish()). Please check |
OK, looks good, only the testcase question remains |
…phore wait > 600 assertion
Problem:
A long running ALTER TABLE ADD INDEX with concurrent inserts causes sempahore waits and
eventually crashes the server.
To see this problem you need to have
A table with lots of data. Add index should take significant time to create many pages
Compressed table. This is becuase CPU is spent on compress() with mtr already latching index->lock
More time spent by mtr, more waits by the INSERT. Helps in crash.
Concurrent inserts when ALTER is running. The inserts should happen specifically after the read phase
of ALTER and after Bulk load index build (bottump build) started.
The entire bulkload process latches the index->lock X mode for the whole duration of bottom up build of index. The index->lock is held across mtrs (because many pages are created during index build).
An example is this: Page1 mtr latches index->lock X mode, when page is full, a sibling page is created.
The sibling Page 2 (mtr) also acquires index->lock X mode.
Recursive X latching is allowed by same thread. Now Page 1 mtr commits but index->lock is still held by Page 2. Now when page 2 is full, another sibling page is created. Sibling Page 3 now acquires index->lock X mode. Page 2 mtr commits.. This goes on and on. Also happens with Pages at non-root levels.
Essentially the time index->lock is held is equally proportional to number of pages/mtrs created. And compress tables helps in making mtr take a bit more time in doing compress() and duration of each mtr is higher with compressed tables.
At this stage, a concurrent INSERT comes and since there is concurrent DDL and the index is uncommited, this insert should go to online ALTER log. It tries to acquire index->lock in S mode.
Bulk load index already took index->lock X mode and is not going to release it until is over.
INSERT thread keeps on waiting, and when the wait crosses 600 seconds to acquire index->lock, it will crash.
Fix:
INSERT thread acquires index->lock to check the index online status. During the bulk load index build, there is no concurrent insert or read. So there is no need to acquire index->lock at all.
Bulk load index build is also used to create indexes in table rebuild cases. For example DROP COLUMN, ADD COLUMN. The indexes on intermediate table (#sql-ib..) are built using bulk load insert. A concurrent DMLs at this stage do not acquire index->lock. So acquiring index->lock on the intermediate table, which is not visible to anyone else doesn't block concurrent DMLs.
Ideally we can try to remove all index->lock X acquisitions in bulk load index build path. We play safe and remove acquisitions only incase of uncommited indexes. The other path (bulk load used during rebuild) is not affected anyway.