You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Question: Why do we need SELECT...FOR UPDATE in ghostferry?
In this ghostferry issue, the question of why we use SELECT...FOR UPDATE in ghostferry (instead of plain SELECT) was raised. It is a great question. Here is the relevant text from the issue and what I plan to reply as a possible scenario where without FOR UPDATE, we will end up with data corruption.
My assumption is that this is to keep the data consistent between the reading and the row verification. Beyond verification, it should be safe to operate without the FOR UPDATE (which also improves performance, as we don't require the round-trips for the transaction, which is always rolled back anyways), because any modifications on the source DB overlapping with a row read should be "fixed" by the binlog-writer anyways.
Can you confirm my assumption is correct?
As @shuhaowu pointed out, SELECT...FOR UPDATE is needed for maintaining the data correctness, not just for the row verification. Let's take a look at a possible sequence of actions that will introduce a data corruption with plain SELECT. We will use a similar notation as introduced in another ghostferry issue. We denote the database row values with r0 and r1, where r0 represents one set of values and r1 represents a different set of values. We define a "payload" as either an argument that an action is given (e.g. INSERT v1 INTO ...) or a result that an action receives (e.g. SELECT FROM ... -> v1).
Step
Actor
Action
Payload
Source
Target
1
Application
INSERT (SOURCE)
r1
r1
nil
2
DataIterator
BEGIN
N/A
r1
nil
3
DataIterator
SELECT
r1
r1
nil
4
Application
UPDATE (SOURCE)
r0
r0
nil
5
BinlogWriter
UPDATE
r1->r0
r0
nil
6
BatchWriter
BEGIN
N/A
r0
nil
7
BatchWriter
INSERT IGNORE
r1
r0
r1
8
BatchWriter
ROLLBACK
N/A
r0
r1
9
DataIterator
ROLLBACK
N/A
r0
r1
As we can see in step 4 above, the application is allowed to update the row selected in step 3 because FOR UPDATE is not enforced. A binlog UPDATE event is then created based on step 4. However, it silently fails to execute step 5 to update the target DB because it must perform r1->r0 and the old row value r1 doesn't exist in the target DB yet. Step 7 then writes the old stale row value r1 (obtained in step 3) to the target DB. At this point, the target DB contains r1, which is different from what is in the source DB, namely r0. If we have a verifier enabled, we will detect the data inconsistency and fail the ghostferry run.
The text was updated successfully, but these errors were encountered:
Question: Why do we need
SELECT...FOR UPDATE
in ghostferry?In this ghostferry issue, the question of why we use
SELECT...FOR UPDATE
in ghostferry (instead of plainSELECT
) was raised. It is a great question. Here is the relevant text from the issue and what I plan to reply as a possible scenario where withoutFOR UPDATE
, we will end up with data corruption.As @shuhaowu pointed out,
SELECT...FOR UPDATE
is needed for maintaining the data correctness, not just for the row verification. Let's take a look at a possible sequence of actions that will introduce a data corruption with plainSELECT
. We will use a similar notation as introduced in another ghostferry issue. We denote the database row values with r0 and r1, where r0 represents one set of values and r1 represents a different set of values. We define a "payload" as either an argument that an action is given (e.g. INSERT v1 INTO ...) or a result that an action receives (e.g. SELECT FROM ... -> v1).As we can see in step 4 above, the application is allowed to update the row selected in step 3 because
FOR UPDATE
is not enforced. A binlog UPDATE event is then created based on step 4. However, it silently fails to execute step 5 to update the target DB because it must perform r1->r0 and the old row value r1 doesn't exist in the target DB yet. Step 7 then writes the old stale row value r1 (obtained in step 3) to the target DB. At this point, the target DB contains r1, which is different from what is in the source DB, namely r0. If we have a verifier enabled, we will detect the data inconsistency and fail the ghostferry run.The text was updated successfully, but these errors were encountered: