-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge fixes for PS-9144 and PS-9214 into 8.0 release tree. #5385
Conversation
…oad as PS-9092 https://perconadev.atlassian.net/browse/PS-9144 Problem: -------- ALTER TABLE which rebuilds InnoDB table using INPLACE algorithm might sometimes lead to row loss if concurrent purge happens on the table being ALTERed. Analysis: --------- New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread case. This implementation iterates over all the rows in the table, in general case, handling different subtrees of a B-tree in different threads. This iteration over table rows needs to be paused, from time to time, to commit InnoDB MTR/ release page latches it holds. This is necessary to give a way to concurrent actions on the B-tree scanned or before flushing rows of new version of table from in-memory buffer to the B-tree. In order to resume iteration after such pause persistent cursor position saved before pause is restored. The cause of the problem described above lies in PCursor::restore_position() method. This method used for two purposes in the code: 1) To resume iteration after it was paused in the scenario described above. 2) To initialize cursor when a thread starts iteration through subtree it was assigned to process. In scenario 2) we restore cursor to the record which has not been processed yet. If the record, which position was saved originally when subtrees/ranges to process were assigned to threads, has been purged meanwhile, the cursor will be restored to preceding record. And the cursor needs to be moved to the next record, so we don't start processing our subtree from the record belonging to a different thread. PCursor::restore_position() handles this by detecting situation when saved record was purged and moving to the next record in this case. *** For example, let us assume that we have a page with the following contents - [inf, a, b, c, d, sup], and which is split between 2 threads in the following way : range (inf, b] is assigned to the 1st thread, while range [c, sup) to the 2nd thread. When 2nd thread initializes cursor at thread start it tries to restore savepoint pointing to record 'c'. In case record 'c' was purged meanwhile, the restored cursor will point to record 'b' instead. Since this record is supposed to be processed by thread 1, we need to adjust cursors position to point to record 'd', so 2nd thread can start processing from it. *** However, in scenario 1) we actually restore cursor to the record which has been processed already and from which will be doing step to the next record right after restore. So iterating to the next record if saved record was purged like it is done in PCursor::restore_position(), and which is necessary in case 2), leads to double step forward, resulting in our scan missing record! *** Let us look at example for this case. Let us assume that we have a thread which is in the middle of processing of page with the following contents [inf, a, b, c, d, e, sup]. While processing record 'c' thread decides to flush rows of new version of table from in-memory buffer, so it pauses its iteration and calls PCursor::savepoint(). The latter makes savepoint which internally corresponds to previous record in the page - 'b' and commits mini-transaction/releases latches. While latches are released record 'b', which is marked as deleted, is purged from the page by one of the purge threads. When iteration resumes, PCursor::resume() calls restore_position(), the latter restores cursor to point to record 'a', detects the fact that cursor was not restored to original record and moves cursor to the next record on the page - record 'c'. Code in PCursor::resume() also tries to compensate the fact that savepoint() call saved position of previous record, by moving cursor to the next record on the page once again. Thus cursor is positioned on record 'd' as result. Finally, the code which does record processing continues iteration and moves to the next record on the page - record 'e', as it assumes that record being under the cursor has been processed already (which would have been true if no purge happened and the cursor would have been restored to record 'c' as result of savepoint()/resume()). Hence record 'd' is in effect omitted from iteration/results. *** Fix: ---- This patch solves the problem by using different logic for restore of cursor position in these two cases. For case 1) we simply restore position which was saved using btr_pcur_t::restore_position(). If the record to which cursor is supposed to point has been purged meanwhile, this method will point the cursor to preceding record. Then the calling code will iterate to next record (i.e. successor of purged record) after restore. For case 2) we keep pre-fix behavior and correct cursor position after restoring if the record which position has been saved originally has been purged by moving to the next record in subtree to be processed. PCursor::restore_position() method which implements handling of this case has been renamed to PCursor::restore_position_for_range() and greately simplified.
…primary key (only index). https://perconadev.atlassian.net/browse/PS-9214 Problem: -------- ALTER TABLE with rebuilds InnoDB table using INPLACE algorithm occasionally might fail with unwarranted duplicate primary key error if there are concurrent insertions into the table, even though these insertions do not icause any PK conflict. Analysis: --------- New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread case. This implementation iterates over all the rows in the table, in general case, handling different subtrees of a B-tree in different threads. This iteration over table rows needs to be paused, from time to time, to commit InnoDB MTR/ release page latches it holds. This is necessary to give a way to concurrent actions on the B-tree scanned or before flushing rows of new version of table from in-memory buffer to the B-tree. In order to resume iteration after such pause persistent cursor position saved before pause is restored. The cause of the problem described above lies in how PCursor::savepoint() and PCursor::resume() methods perform this saving and restore of cursor position. Instead of storing position of current record pointed by cursor savepoint() stores the position of record that precedes it, and then resume() does restore in two steps - 1) restores position of this preceding record (or its closest precedessor if it was purged meanwhile) and then 2) moves one step forward assuming that will get to the record at which cursor pointed originally. Such approach makes sense when we try to save/restore cursor pointing to page's supremum record before switching to a new page, as it allows to avoid problems with records being skipped when the next page is merged into the current one while latches are released (so records which we have not scanned yet are moved over supremum record, but not over the record which originally preceded supremum). *** Let us take look at an example. Let us assume that we have two pages p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup]. Out thread which is one of the parallel ALTER TABLE worker threads has completed scan of p1, so its cursor positioned on p1:'sup' record. Now it needs to switch to page p2, but also give a way to threads concurrently updating the table. So it needs to make cursor savepoint, commit mini-transaction and release the latches. If naive approach to making savepoint is used and we simply do btr_pcur_t::store_position()/restore_position() with the cursor positioned on p1 : 'sup' record, then the following might happen: concurrent purge on page p2 might delete some record from it (e.g. 'f') and decide to merge of this page into the page p1. If this happens while latches are released this merge would go through and and resulting in page p1 with the following contents p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't be invalidated (one can say that savepoints for sup and inf are not safe against concurrent merges in this respect) and after restoration of cursor the iteration will continue, on the next page, skipping records 'd' and 'e'. With non-naive approach implemented at the moment, PCursor::savepoint() does a step back before calling btr_pcur_t::store_position(), so for cursor positioned on p1 : 'sup' it is actually position corresponding to p1 : 'c' what is saved. If the merge happens when latches are released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint is not invalidated. PCursor::resume() calls btr_pcur_t::restore_position() gets cursor pointing to p1 : 'c' as result, and then it tries to compensate for step-back in PCursor::savepoint() and moves cursor one step forward making it to point to p1 : 'd'. Code which does scanning detects the situation that we savepoint()/resume() resulted in jump from supremum record to user record and resume iteration from p1 : 'd' without skipping any records. *** However, it is not necessary and becomes problematic we try to save/restore cursor pointing to user record from within record processing callback. In this case record which position we are trying to save when savepoint() method is called can be considered already processed as corresponding value already will be inserted into output buffer soon after restore. When a concurrent insert adds a new record between the record which position we have inteded to save by calling savepoint() and its precedessor which position this call stored internally, the later call to resume() will position cursor at this newly inserted record. This will lead to the resumed scan revisiting original record once again. As result the code will attempt to add this original record into the output buffer one more time and get duplicate key error. *** Let us take a look at an example once again. Let us assume that parallel ALTER TABLE thread is scanning page p1 with the following contents - p1 : [inf, a, b, c, d, sup]. It has processed record 'c' (so cursor points to p1: 'c') and decides that it needs take savepoint/ commit mini-transaction and release latches in order to flush in-memory buffer with new versions of the records. PCursor::savepoint() does a step back and calls btr_pcur_t::store_position() which saves position corresponding to p1 : 'b'. After the latches are released concurrent insert might happen into the page adding record 'b1', between records 'b' and 'c', resulting in page looking like p1: [inf, a, b, b1, c, d, sup]. After that PCursor::resume() will call restore_position() which will restore cursor pointing to p1 : 'b' and then will try to compensate for step-back in savepoint() by moving cursor one step forward to p1 : 'b1'. The scanning code will continue its iteration using this cursor, by moving to the next record - p1 :'c' and trying to process it once again, resulting in duplicate key error. *** Fix: --- This patch solves the problem by adjusting PCursors::savepoint()/resume() logic not to do this step back on save/step forward on restore if we are trying to save/restore cursor pointing to a user record (in which case it is not necessary). This process still used when we are trying to save/ restore cursor pointing to page infimum (where it is useful). It also adds some comments explaining how this code works and a few debug asserts enforcing its invariants.
@dlenev Thanks for updating the commit messages. I will take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the fixes looks good to me and easy to understand with the new examples.
Thank you. I just left only one small question about resetting the variable.
Jenkins results look OK to me: https://ps80.cd.percona.com/view/8.0%20parallel%20MTR/job/percona-server-8.0-pipeline-parallel-mtr/677/#showFailuresLink |
See individual commits for proper change descriptions.