-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication, delete_mode not keep and key amnesia #1813
Comments
Initial thoughts are, that the best solution is to change the riak_kv_get_fsm so as not to trigger delete on fetch (in the riak_kv_get_fsm). This represents a change from This should prevent the cycle. There is perhaps also a question about whether it is right to consider new_actor_epoch on tombstones when delete_mode is not keep. Should key amnesia be the expectation here with replicated tombstones due to automatic reaping? I think the safest thing is to leave key amnesia as it was, as it appears to be only this replication scenario that can trigger this cycle. |
Changing the fetch behaviour to avoid the prompted reap on the src cluster may cause other problems. In particular, if we have the same situation - but this time there have been no previous coordinated PUTs on either cluster (the object originates from ClusterC). There could now be a rotation, whereby the tombstone is on A. Full-sync prompts A -> B. There is no amnesia on B, hence no read repair required, and so the GET post PUSH on B will now prompt the reap. However, the Fetch from A hasn't prompted a reap now ... so we are just rotating tombstones again. |
There is a demonstration of the original problem in this test |
The kv679 behaviour, or handling key amnesia in all delete_modes, is implicitly tested (in that the default non-keep mode is used in all the tests). So I don't think it is potentially unsafe to change this behaviour. The alternative is to end the rule of a single final action only in riak_kv_get_fsm. If a repair is prompted, and the repaired object is a tombstone, it should also prompt a delete (which is really a reap request). |
Scenario
There are two clusters: ClusterA, ClusterB. Both clusters are configured with the standard delete_mode (reaping on 3s timeout).
An object key (K1) belongs to a Preflist of {VnodeA1, VnodeA2, VnodeA3} on ClusterA, and {VnodeB1, VnodeB2, VnodeB3} on ClusterB.
An object is initiated with that key K1 on ClusterA (VnodeA1 say, acting as coordinator). The creation is replicated.
The object is deleted on ClusterB (VnodeB1 say, acting as coordinator). The deletion is replicated.
For some reason, on one cluster (ClusterA say) the reap does not occur on the timeout.
Replication Goes Wrong
[{VnodeA1, {1, TS1}}, {VnodeB1, {1, TS2}}]
, that does not exist on ClusterB.asis
PUT. This GET is intended to prompt any delete actions (e.g. it is expected to reap). However, in this case it prompts a read-repair instead (there can only be one final action), because the tombstone's vclock differs at VnodeB1 from VnodeB2 and VnodeB3 due to the new actor epoch generated by key amnesia.[{VnodeA1, {1, TS1}}, {VnodeB1, {1, TS2}}, {VnodeB1.1, {1, TS3}}]
.[{VnodeA1, {1, TS1}}, {VnodeB1, {1, TS2}}, {VnodeB1.1, {1, TS3}}, {VnodeA1.1, {1, TS4}}]
... this object can then loop around indefinitely, forever increasing the size of the vector clock.
The text was updated successfully, but these errors were encountered: