Why do we need `SELECT...FOR UPDATE` in ghostferry #1

xliang6 · 2020-04-26T20:19:46Z

Question: Why do we need `SELECT...FOR UPDATE` in ghostferry?

In this ghostferry issue, the question of why we use SELECT...FOR UPDATE in ghostferry (instead of plain SELECT) was raised. It is a great question. Here is the relevant text from the issue and what I plan to reply as a possible scenario where without FOR UPDATE, we will end up with data corruption.

My assumption is that this is to keep the data consistent between the reading and the row verification. Beyond verification, it should be safe to operate without the FOR UPDATE (which also improves performance, as we don't require the round-trips for the transaction, which is always rolled back anyways), because any modifications on the source DB overlapping with a row read should be "fixed" by the binlog-writer anyways.
Can you confirm my assumption is correct?

As @shuhaowu pointed out, SELECT...FOR UPDATE is needed for maintaining the data correctness, not just for the row verification. Let's take a look at a possible sequence of actions that will introduce a data corruption with plain SELECT. We will use a similar notation as introduced in another ghostferry issue. We denote the database row values with r0 and r1, where r0 represents one set of values and r1 represents a different set of values. We define a "payload" as either an argument that an action is given (e.g. INSERT v1 INTO ...) or a result that an action receives (e.g. SELECT FROM ... -> v1).

Step	Actor	Action	Payload	Source	Target
1	Application	INSERT (SOURCE)	r1	r1	nil
2	DataIterator	BEGIN	N/A	r1	nil
3	DataIterator	SELECT	r1	r1	nil
4	Application	UPDATE (SOURCE)	r0	r0	nil
5	BinlogWriter	UPDATE	r1->r0	r0	nil
6	BatchWriter	BEGIN	N/A	r0	nil
7	BatchWriter	INSERT IGNORE	r1	r0	r1
8	BatchWriter	ROLLBACK	N/A	r0	r1
9	DataIterator	ROLLBACK	N/A	r0	r1

As we can see in step 4 above, the application is allowed to update the row selected in step 3 because FOR UPDATE is not enforced. A binlog UPDATE event is then created based on step 4. However, it silently fails to execute step 5 to update the target DB because it must perform r1->r0 and the old row value r1 doesn't exist in the target DB yet. Step 7 then writes the old stale row value r1 (obtained in step 3) to the target DB. At this point, the target DB contains r1, which is different from what is in the source DB, namely r0. If we have a verifier enabled, we will detect the data inconsistency and fail the ghostferry run.

The text was updated successfully, but these errors were encountered:

xliang6 closed this as completed Apr 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do we need `SELECT...FOR UPDATE` in ghostferry #1

Why do we need `SELECT...FOR UPDATE` in ghostferry #1

xliang6 commented Apr 26, 2020 •

edited

Loading

Why do we need SELECT...FOR UPDATE in ghostferry #1

Why do we need SELECT...FOR UPDATE in ghostferry #1

Comments

xliang6 commented Apr 26, 2020 • edited Loading

Question: Why do we need SELECT...FOR UPDATE in ghostferry?

Why do we need `SELECT...FOR UPDATE` in ghostferry #1

Why do we need `SELECT...FOR UPDATE` in ghostferry #1

xliang6 commented Apr 26, 2020 •

edited

Loading

Question: Why do we need `SELECT...FOR UPDATE` in ghostferry?