Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge fixes for PS-9144 and PS-9214 into 8.0 release tree. #5385

Merged
merged 2 commits into from
Sep 3, 2024

Conversation

dlenev
Copy link
Contributor

@dlenev dlenev commented Aug 27, 2024

See individual commits for proper change descriptions.

…oad as PS-9092

https://perconadev.atlassian.net/browse/PS-9144

Problem:
--------
ALTER TABLE which rebuilds InnoDB table using INPLACE algorithm might sometimes
lead to row loss if concurrent purge happens on the table being ALTERed.

Analysis:
---------
New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in
MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread
case.
This implementation iterates over all the rows in the table, in general case,
handling different subtrees of a B-tree in different threads. This iteration
over table rows needs to be paused, from time to time, to commit InnoDB MTR/
release page latches it holds. This is necessary to give a way to concurrent
actions on the B-tree scanned or before flushing rows of new version of table
from in-memory buffer to the B-tree. In order to resume iteration after such
pause persistent cursor position saved before pause is restored.

The cause of the problem described above lies in PCursor::restore_position()
method. This method used for two purposes in the code:

1)  To resume iteration after it was paused in the scenario described
    above.
2)  To initialize cursor when a thread starts iteration through subtree
    it was assigned to process.

In scenario 2) we restore cursor to the record which has not been
processed yet. If the record, which position was saved originally when
subtrees/ranges to process were assigned to threads, has been purged
meanwhile, the cursor will be restored to preceding record. And the
cursor needs to be moved to the next record, so we don't start processing
our subtree from the record belonging to a different thread.
PCursor::restore_position() handles this by detecting situation when saved
record was purged and moving to the next record in this case.

***

For example, let us assume that we have a page with the following contents -
[inf, a, b, c, d, sup], and which is split between 2 threads in the following
way : range (inf, b] is assigned to the 1st thread, while range [c, sup) to
the 2nd thread. When 2nd thread initializes cursor at thread start it tries
to restore savepoint pointing to record 'c'. In case record 'c' was purged
meanwhile, the restored cursor will point to record 'b' instead. Since this
record is supposed to be processed by thread 1, we need to adjust cursors
position to point to record 'd', so 2nd thread can start processing from it.

***

However, in scenario 1) we actually restore cursor to the record which
has been processed already and from which will be doing step to the next
record right after restore. So iterating to the next record if saved record
was purged like it is done in PCursor::restore_position(), and which is
necessary in case 2), leads to double step forward, resulting in our scan
missing record!

***

Let us look at example for this case. Let us assume that we have a thread
which is in the middle of processing of page with the following contents
[inf, a, b, c, d, e, sup]. While processing record 'c' thread decides to
flush rows of new version of table from in-memory buffer, so it pauses its
iteration and calls PCursor::savepoint(). The latter makes savepoint which
internally corresponds to previous record in the page - 'b' and commits
mini-transaction/releases latches. While latches are released record 'b',
which is marked as deleted, is purged from the page by one of the purge
threads. When iteration resumes, PCursor::resume() calls restore_position(),
the latter restores cursor to point to record 'a', detects the fact that
cursor was not restored to original record and moves cursor to the next
record on the page - record 'c'. Code in PCursor::resume() also tries to
compensate the fact that savepoint() call saved position of previous
record, by moving cursor to the next record on the page once again.
Thus cursor is positioned on record 'd' as result. Finally, the code
which does record processing continues iteration and moves to the next
record on the page - record 'e', as it assumes that record being under
the cursor has been processed already (which would have been true if
no purge happened and the cursor would have been restored to record 'c'
as result of savepoint()/resume()).  Hence record 'd' is in effect
omitted from iteration/results.

***

Fix:
----
This patch solves the problem by using different logic for restore of
cursor position in these two cases.

For case 1) we simply restore position which was saved using
btr_pcur_t::restore_position(). If the record to which cursor is
supposed to point has been purged meanwhile, this method will point
the cursor to preceding record. Then the calling code will iterate to
next record (i.e. successor of purged record) after restore.

For case 2) we keep pre-fix behavior and correct cursor position
after restoring if the record which position has been saved originally
has been purged by moving to the next record in subtree to be processed.
PCursor::restore_position() method which implements handling of this
case has been renamed to PCursor::restore_position_for_range() and
greately simplified.
…primary key (only index).

https://perconadev.atlassian.net/browse/PS-9214

Problem:
--------
ALTER TABLE with rebuilds InnoDB table using INPLACE algorithm occasionally
might fail with unwarranted duplicate primary key error if there are
concurrent insertions into the table, even though these insertions do not
icause any PK conflict.

Analysis:
---------
New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in
MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread
case.
This implementation iterates over all the rows in the table, in general case,
handling different subtrees of a B-tree in different threads. This iteration
over table rows needs to be paused, from time to time, to commit InnoDB MTR/
release page latches it holds. This is necessary to give a way to concurrent
actions on the B-tree scanned or before flushing rows of new version of table
from in-memory buffer to the B-tree. In order to resume iteration after such
pause persistent cursor position saved before pause is restored.

The cause of the problem described above lies in how PCursor::savepoint()
and PCursor::resume() methods perform this saving and restore of cursor
position.

Instead of storing position of current record pointed by cursor savepoint()
stores the position of record that precedes it, and then resume() does
restore in two steps - 1) restores position of this preceding record
(or its closest precedessor if it was purged meanwhile) and then 2) moves
one step forward assuming that will get to the record at which cursor
pointed originally.

Such approach makes sense when we try to save/restore cursor pointing to
page's supremum record before switching to a new page, as it allows to
avoid problems with records being skipped when the next page is merged into
the current one while latches are released (so records which we have not
scanned yet are moved over supremum record, but not over the record which
originally preceded supremum).

***

Let us take look at an example. Let us assume that we have two pages
p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup].
Out thread which is one of the parallel ALTER TABLE worker threads
has completed scan of p1, so its cursor positioned on p1:'sup' record.
Now it needs to switch to page p2, but also give a way to threads
concurrently updating the table. So it needs to make cursor savepoint,
commit mini-transaction and release the latches.

If naive approach to making savepoint is used and we simply do
btr_pcur_t::store_position()/restore_position() with the cursor
positioned on p1 : 'sup' record, then the following might happen:
concurrent purge on page p2 might delete some record from it
(e.g. 'f') and decide to merge of this page into the page p1.
If this happens while latches are released this merge would go through
and and resulting in page p1 with the following contents
p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't
be invalidated (one can say that savepoints for sup and inf are not
safe against concurrent merges in this respect) and after restoration
of cursor the iteration will continue, on the next page, skipping
records 'd' and 'e'.

With non-naive approach implemented at the moment, PCursor::savepoint()
does a step back before calling btr_pcur_t::store_position(), so for
cursor positioned on p1 : 'sup' it is actually position corresponding
to p1 : 'c' what is saved. If the merge happens when latches are
released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint
is not invalidated. PCursor::resume() calls btr_pcur_t::restore_position()
gets cursor pointing to p1 : 'c' as result, and then it tries to
compensate for step-back in PCursor::savepoint() and moves cursor
one step forward making it to point to p1 : 'd'. Code which does
scanning detects the situation that we savepoint()/resume() resulted
in jump from supremum record to user record and resume iteration
from p1 : 'd' without skipping any records.

***

However, it is not necessary and becomes problematic we try to save/restore
cursor pointing to user record from within record processing callback.
In this case record which position we are trying to save when savepoint()
method is called can be considered already processed as corresponding value
already will be inserted into output buffer soon after restore. When a
concurrent insert adds a new record between the record which position we
have inteded to save by calling savepoint() and its precedessor which
position this call stored internally, the later call to resume() will
position cursor at this newly inserted record. This will lead to the
resumed scan revisiting original record once again. As result the code
will attempt to add this original record into the output buffer one more
time and get duplicate key error.

***

Let us take a look at an example once again. Let us assume that
parallel ALTER TABLE thread is scanning page p1 with the following
contents - p1 : [inf, a, b, c, d, sup]. It has processed record 'c'
(so cursor points to p1: 'c') and decides that it needs take savepoint/
commit mini-transaction and release latches in order to flush in-memory
buffer with new versions of the records. PCursor::savepoint() does a
step back and calls btr_pcur_t::store_position() which saves position
corresponding to p1 : 'b'. After the latches are released concurrent
insert might happen into the page adding record 'b1', between records
'b' and 'c', resulting in page looking like p1: [inf, a, b, b1, c, d, sup].
After that PCursor::resume() will call restore_position() which will
restore cursor pointing to p1 : 'b' and then will try to compensate for
step-back in savepoint() by moving cursor one step forward to p1 : 'b1'.
The scanning code will continue its iteration using this cursor, by
moving to the next record - p1 :'c' and trying to process it once again,
resulting in duplicate key error.

***

Fix:
---
This patch solves the problem by adjusting PCursors::savepoint()/resume()
logic not to do this step back on save/step forward on restore if we are
trying to save/restore cursor pointing to a user record (in which case it
is not necessary). This process still used when we are trying to save/
restore cursor pointing to page infimum (where it is useful).

It also adds some comments explaining how this code works and a few
debug asserts enforcing its invariants.
@satya-bodapati
Copy link
Contributor

@dlenev Thanks for updating the commit messages. I will take a look.

Copy link
Contributor

@satya-bodapati satya-bodapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the fixes looks good to me and easy to understand with the new examples.

Thank you. I just left only one small question about resetting the variable.

@dlenev
Copy link
Contributor Author

dlenev commented Sep 3, 2024

@dlenev dlenev merged commit fac2624 into percona:release-8.0.39-30 Sep 3, 2024
25 of 26 checks passed
@dlenev dlenev deleted the ps-8.0-9144-9214 branch September 3, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants