Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do we need SELECT...FOR UPDATE in ghostferry #1

Closed
xliang6 opened this issue Apr 26, 2020 · 0 comments
Closed

Why do we need SELECT...FOR UPDATE in ghostferry #1

xliang6 opened this issue Apr 26, 2020 · 0 comments

Comments

@xliang6
Copy link
Owner

xliang6 commented Apr 26, 2020

Question: Why do we need SELECT...FOR UPDATE in ghostferry?

In this ghostferry issue, the question of why we use SELECT...FOR UPDATE in ghostferry (instead of plain SELECT) was raised. It is a great question. Here is the relevant text from the issue and what I plan to reply as a possible scenario where without FOR UPDATE, we will end up with data corruption.

My assumption is that this is to keep the data consistent between the reading and the row verification. Beyond verification, it should be safe to operate without the FOR UPDATE (which also improves performance, as we don't require the round-trips for the transaction, which is always rolled back anyways), because any modifications on the source DB overlapping with a row read should be "fixed" by the binlog-writer anyways.
Can you confirm my assumption is correct?

As @shuhaowu pointed out, SELECT...FOR UPDATE is needed for maintaining the data correctness, not just for the row verification. Let's take a look at a possible sequence of actions that will introduce a data corruption with plain SELECT. We will use a similar notation as introduced in another ghostferry issue. We denote the database row values with r0 and r1, where r0 represents one set of values and r1 represents a different set of values. We define a "payload" as either an argument that an action is given (e.g. INSERT v1 INTO ...) or a result that an action receives (e.g. SELECT FROM ... -> v1).

Step Actor Action Payload Source Target
1 Application INSERT (SOURCE) r1 r1 nil
2 DataIterator BEGIN N/A r1 nil
3 DataIterator SELECT r1 r1 nil
4 Application UPDATE (SOURCE) r0 r0 nil
5 BinlogWriter UPDATE r1->r0 r0 nil
6 BatchWriter BEGIN N/A r0 nil
7 BatchWriter INSERT IGNORE r1 r0 r1
8 BatchWriter ROLLBACK N/A r0 r1
9 DataIterator ROLLBACK N/A r0 r1

As we can see in step 4 above, the application is allowed to update the row selected in step 3 because FOR UPDATE is not enforced. A binlog UPDATE event is then created based on step 4. However, it silently fails to execute step 5 to update the target DB because it must perform r1->r0 and the old row value r1 doesn't exist in the target DB yet. Step 7 then writes the old stale row value r1 (obtained in step 3) to the target DB. At this point, the target DB contains r1, which is different from what is in the source DB, namely r0. If we have a verifier enabled, we will detect the data inconsistency and fail the ghostferry run.

@xliang6 xliang6 closed this as completed Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant