Replication, delete_mode not keep and key amnesia #1813

martinsumner · 2022-03-07T17:35:37Z

Scenario

There are two clusters: ClusterA, ClusterB. Both clusters are configured with the standard delete_mode (reaping on 3s timeout).

An object key (K1) belongs to a Preflist of {VnodeA1, VnodeA2, VnodeA3} on ClusterA, and {VnodeB1, VnodeB2, VnodeB3} on ClusterB.

An object is initiated with that key K1 on ClusterA (VnodeA1 say, acting as coordinator). The creation is replicated.

The object is deleted on ClusterB (VnodeB1 say, acting as coordinator). The deletion is replicated.

For some reason, on one cluster (ClusterA say) the reap does not occur on the timeout.

Replication Goes Wrong

There is now an object on ClusterA, a tombstone with a vector clock like [{VnodeA1, {1, TS1}}, {VnodeB1, {1, TS2}}], that does not exist on ClusterB.
nextgenrepl full-sync now runs with ClusterA as the src, and ClusterB as the snk.
The aae_exchange discovers that A > B for K1 as K1 not_found on B. It puts a reference on the replication queue.
The fetch, which is run via riak_kv_get_fsm pulls across the tombstone object. But as it prompts a read (via riak_kv_get_fsm), the read final action spots that this is a tombstone, and so triggers a maybe_delete, which causes the tombstone to be reaped from ClusterA.
The push of the tombstone to ClusterB, is performed as expected on VnodeB2 and VnodeB3. However on VnodeB1 there is key amnesia - this vnode previously coordinated the write, but has no backend memory because of the reap.
Due to key amnesia, the object is written with an updated vector clock.
The riak_client:push function which has replicated the object prompts a GET of the object after the asis PUT. This GET is intended to prompt any delete actions (e.g. it is expected to reap). However, in this case it prompts a read-repair instead (there can only be one final action), because the tombstone's vclock differs at VnodeB1 from VnodeB2 and VnodeB3 due to the new actor epoch generated by key amnesia.
This means that the object has been reaped from ClusterA, but exists on ClusterB with a vector clock following read repair like [{VnodeA1, {1, TS1}}, {VnodeB1, {1, TS2}}, {VnodeB1.1, {1, TS3}}].
Now if full-sync runs from B to A - the same thing occurs in reverse. The net effect is the tombstone is reaped from ClusterB, but re-added to ClusterA with an expanded vector clock due to the new actor epoch (triggered by the object update history from VnodeA1) e.g.: [{VnodeA1, {1, TS1}}, {VnodeB1, {1, TS2}}, {VnodeB1.1, {1, TS3}}, {VnodeA1.1, {1, TS4}}]

... this object can then loop around indefinitely, forever increasing the size of the vector clock.

The text was updated successfully, but these errors were encountered:

martinsumner · 2022-03-07T18:31:23Z

Initial thoughts are, that the best solution is to change the riak_kv_get_fsm so as not to trigger delete on fetch (in the riak_kv_get_fsm). This represents a change from riak_repl (as riak_repl was push rather than pull-based, and so did not use fetch), and hence why we have not seen this before.

This should prevent the cycle.

There is perhaps also a question about whether it is right to consider new_actor_epoch on tombstones when delete_mode is not keep. Should key amnesia be the expectation here with replicated tombstones due to automatic reaping?

I think the safest thing is to leave key amnesia as it was, as it appears to be only this replication scenario that can trigger this cycle.

martinsumner · 2022-03-08T00:29:40Z

Changing the fetch behaviour to avoid the prompted reap on the src cluster may cause other problems. In particular, if we have the same situation - but this time there have been no previous coordinated PUTs on either cluster (the object originates from ClusterC).

There could now be a rotation, whereby the tombstone is on A. Full-sync prompts A -> B. There is no amnesia on B, hence no read repair required, and so the GET post PUSH on B will now prompt the reap. However, the Fetch from A hasn't prompted a reap now ... so we are just rotating tombstones again.

martinsumner · 2022-03-08T00:33:41Z

There is a demonstration of the original problem in this test

martinsumner · 2022-03-08T10:06:08Z

The kv679 behaviour, or handling key amnesia in all delete_modes, is implicitly tested (in that the default non-keep mode is used in all the tests). So I don't think it is potentially unsafe to change this behaviour.

The alternative is to end the rule of a single final action only in riak_kv_get_fsm. If a repair is prompted, and the repaired object is a tombstone, it should also prompt a delete (which is really a reap request).

martinsumner · 2022-03-08T20:07:52Z

#1814

martinsumner mentioned this issue Mar 9, 2022

Allow for a dual final action - repair and delete #1814

Merged

martinsumner added 3.0.10 Bug labels Mar 9, 2022

martinsumner self-assigned this Mar 9, 2022

martinsumner mentioned this issue Mar 22, 2022

Tombstone pause, reap pause #1819

Closed

martinsumner closed this as completed May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication, delete_mode not keep and key amnesia #1813

Replication, delete_mode not keep and key amnesia #1813

martinsumner commented Mar 7, 2022

martinsumner commented Mar 7, 2022

martinsumner commented Mar 8, 2022 •

edited

Loading

martinsumner commented Mar 8, 2022

martinsumner commented Mar 8, 2022

martinsumner commented Mar 8, 2022

Replication, delete_mode not keep and key amnesia #1813

Replication, delete_mode not keep and key amnesia #1813

Comments

martinsumner commented Mar 7, 2022

Scenario

Replication Goes Wrong

martinsumner commented Mar 7, 2022

martinsumner commented Mar 8, 2022 • edited Loading

martinsumner commented Mar 8, 2022

martinsumner commented Mar 8, 2022

martinsumner commented Mar 8, 2022

martinsumner commented Mar 8, 2022 •

edited

Loading