Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge fixes for PS-9144 and PS-9214 into 8.0 release tree. #5385

Merged
merged 2 commits into from
Sep 3, 2024

Commits on Aug 27, 2024

  1. PS-9144 : Missing rows after ALGORITHM=INPLACE ALTER under same workl…

    …oad as PS-9092
    
    https://perconadev.atlassian.net/browse/PS-9144
    
    Problem:
    --------
    ALTER TABLE which rebuilds InnoDB table using INPLACE algorithm might sometimes
    lead to row loss if concurrent purge happens on the table being ALTERed.
    
    Analysis:
    ---------
    New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in
    MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread
    case.
    This implementation iterates over all the rows in the table, in general case,
    handling different subtrees of a B-tree in different threads. This iteration
    over table rows needs to be paused, from time to time, to commit InnoDB MTR/
    release page latches it holds. This is necessary to give a way to concurrent
    actions on the B-tree scanned or before flushing rows of new version of table
    from in-memory buffer to the B-tree. In order to resume iteration after such
    pause persistent cursor position saved before pause is restored.
    
    The cause of the problem described above lies in PCursor::restore_position()
    method. This method used for two purposes in the code:
    
    1)  To resume iteration after it was paused in the scenario described
        above.
    2)  To initialize cursor when a thread starts iteration through subtree
        it was assigned to process.
    
    In scenario 2) we restore cursor to the record which has not been
    processed yet. If the record, which position was saved originally when
    subtrees/ranges to process were assigned to threads, has been purged
    meanwhile, the cursor will be restored to preceding record. And the
    cursor needs to be moved to the next record, so we don't start processing
    our subtree from the record belonging to a different thread.
    PCursor::restore_position() handles this by detecting situation when saved
    record was purged and moving to the next record in this case.
    
    ***
    
    For example, let us assume that we have a page with the following contents -
    [inf, a, b, c, d, sup], and which is split between 2 threads in the following
    way : range (inf, b] is assigned to the 1st thread, while range [c, sup) to
    the 2nd thread. When 2nd thread initializes cursor at thread start it tries
    to restore savepoint pointing to record 'c'. In case record 'c' was purged
    meanwhile, the restored cursor will point to record 'b' instead. Since this
    record is supposed to be processed by thread 1, we need to adjust cursors
    position to point to record 'd', so 2nd thread can start processing from it.
    
    ***
    
    However, in scenario 1) we actually restore cursor to the record which
    has been processed already and from which will be doing step to the next
    record right after restore. So iterating to the next record if saved record
    was purged like it is done in PCursor::restore_position(), and which is
    necessary in case 2), leads to double step forward, resulting in our scan
    missing record!
    
    ***
    
    Let us look at example for this case. Let us assume that we have a thread
    which is in the middle of processing of page with the following contents
    [inf, a, b, c, d, e, sup]. While processing record 'c' thread decides to
    flush rows of new version of table from in-memory buffer, so it pauses its
    iteration and calls PCursor::savepoint(). The latter makes savepoint which
    internally corresponds to previous record in the page - 'b' and commits
    mini-transaction/releases latches. While latches are released record 'b',
    which is marked as deleted, is purged from the page by one of the purge
    threads. When iteration resumes, PCursor::resume() calls restore_position(),
    the latter restores cursor to point to record 'a', detects the fact that
    cursor was not restored to original record and moves cursor to the next
    record on the page - record 'c'. Code in PCursor::resume() also tries to
    compensate the fact that savepoint() call saved position of previous
    record, by moving cursor to the next record on the page once again.
    Thus cursor is positioned on record 'd' as result. Finally, the code
    which does record processing continues iteration and moves to the next
    record on the page - record 'e', as it assumes that record being under
    the cursor has been processed already (which would have been true if
    no purge happened and the cursor would have been restored to record 'c'
    as result of savepoint()/resume()).  Hence record 'd' is in effect
    omitted from iteration/results.
    
    ***
    
    Fix:
    ----
    This patch solves the problem by using different logic for restore of
    cursor position in these two cases.
    
    For case 1) we simply restore position which was saved using
    btr_pcur_t::restore_position(). If the record to which cursor is
    supposed to point has been purged meanwhile, this method will point
    the cursor to preceding record. Then the calling code will iterate to
    next record (i.e. successor of purged record) after restore.
    
    For case 2) we keep pre-fix behavior and correct cursor position
    after restoring if the record which position has been saved originally
    has been purged by moving to the next record in subtree to be processed.
    PCursor::restore_position() method which implements handling of this
    case has been renamed to PCursor::restore_position_for_range() and
    greately simplified.
    dlenev committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    d9d50ad View commit details
    Browse the repository at this point in the history
  2. PS-9214 : Alter table online results in "duplicate key" error on the …

    …primary key (only index).
    
    https://perconadev.atlassian.net/browse/PS-9214
    
    Problem:
    --------
    ALTER TABLE with rebuilds InnoDB table using INPLACE algorithm occasionally
    might fail with unwarranted duplicate primary key error if there are
    concurrent insertions into the table, even though these insertions do not
    icause any PK conflict.
    
    Analysis:
    ---------
    New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in
    MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread
    case.
    This implementation iterates over all the rows in the table, in general case,
    handling different subtrees of a B-tree in different threads. This iteration
    over table rows needs to be paused, from time to time, to commit InnoDB MTR/
    release page latches it holds. This is necessary to give a way to concurrent
    actions on the B-tree scanned or before flushing rows of new version of table
    from in-memory buffer to the B-tree. In order to resume iteration after such
    pause persistent cursor position saved before pause is restored.
    
    The cause of the problem described above lies in how PCursor::savepoint()
    and PCursor::resume() methods perform this saving and restore of cursor
    position.
    
    Instead of storing position of current record pointed by cursor savepoint()
    stores the position of record that precedes it, and then resume() does
    restore in two steps - 1) restores position of this preceding record
    (or its closest precedessor if it was purged meanwhile) and then 2) moves
    one step forward assuming that will get to the record at which cursor
    pointed originally.
    
    Such approach makes sense when we try to save/restore cursor pointing to
    page's supremum record before switching to a new page, as it allows to
    avoid problems with records being skipped when the next page is merged into
    the current one while latches are released (so records which we have not
    scanned yet are moved over supremum record, but not over the record which
    originally preceded supremum).
    
    ***
    
    Let us take look at an example. Let us assume that we have two pages
    p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup].
    Out thread which is one of the parallel ALTER TABLE worker threads
    has completed scan of p1, so its cursor positioned on p1:'sup' record.
    Now it needs to switch to page p2, but also give a way to threads
    concurrently updating the table. So it needs to make cursor savepoint,
    commit mini-transaction and release the latches.
    
    If naive approach to making savepoint is used and we simply do
    btr_pcur_t::store_position()/restore_position() with the cursor
    positioned on p1 : 'sup' record, then the following might happen:
    concurrent purge on page p2 might delete some record from it
    (e.g. 'f') and decide to merge of this page into the page p1.
    If this happens while latches are released this merge would go through
    and and resulting in page p1 with the following contents
    p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't
    be invalidated (one can say that savepoints for sup and inf are not
    safe against concurrent merges in this respect) and after restoration
    of cursor the iteration will continue, on the next page, skipping
    records 'd' and 'e'.
    
    With non-naive approach implemented at the moment, PCursor::savepoint()
    does a step back before calling btr_pcur_t::store_position(), so for
    cursor positioned on p1 : 'sup' it is actually position corresponding
    to p1 : 'c' what is saved. If the merge happens when latches are
    released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint
    is not invalidated. PCursor::resume() calls btr_pcur_t::restore_position()
    gets cursor pointing to p1 : 'c' as result, and then it tries to
    compensate for step-back in PCursor::savepoint() and moves cursor
    one step forward making it to point to p1 : 'd'. Code which does
    scanning detects the situation that we savepoint()/resume() resulted
    in jump from supremum record to user record and resume iteration
    from p1 : 'd' without skipping any records.
    
    ***
    
    However, it is not necessary and becomes problematic we try to save/restore
    cursor pointing to user record from within record processing callback.
    In this case record which position we are trying to save when savepoint()
    method is called can be considered already processed as corresponding value
    already will be inserted into output buffer soon after restore. When a
    concurrent insert adds a new record between the record which position we
    have inteded to save by calling savepoint() and its precedessor which
    position this call stored internally, the later call to resume() will
    position cursor at this newly inserted record. This will lead to the
    resumed scan revisiting original record once again. As result the code
    will attempt to add this original record into the output buffer one more
    time and get duplicate key error.
    
    ***
    
    Let us take a look at an example once again. Let us assume that
    parallel ALTER TABLE thread is scanning page p1 with the following
    contents - p1 : [inf, a, b, c, d, sup]. It has processed record 'c'
    (so cursor points to p1: 'c') and decides that it needs take savepoint/
    commit mini-transaction and release latches in order to flush in-memory
    buffer with new versions of the records. PCursor::savepoint() does a
    step back and calls btr_pcur_t::store_position() which saves position
    corresponding to p1 : 'b'. After the latches are released concurrent
    insert might happen into the page adding record 'b1', between records
    'b' and 'c', resulting in page looking like p1: [inf, a, b, b1, c, d, sup].
    After that PCursor::resume() will call restore_position() which will
    restore cursor pointing to p1 : 'b' and then will try to compensate for
    step-back in savepoint() by moving cursor one step forward to p1 : 'b1'.
    The scanning code will continue its iteration using this cursor, by
    moving to the next record - p1 :'c' and trying to process it once again,
    resulting in duplicate key error.
    
    ***
    
    Fix:
    ---
    This patch solves the problem by adjusting PCursors::savepoint()/resume()
    logic not to do this step back on save/step forward on restore if we are
    trying to save/restore cursor pointing to a user record (in which case it
    is not necessary). This process still used when we are trying to save/
    restore cursor pointing to page infimum (where it is useful).
    
    It also adds some comments explaining how this code works and a few
    debug asserts enforcing its invariants.
    dlenev committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    e331f9f View commit details
    Browse the repository at this point in the history