Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication, delete_mode not keep and key amnesia #1813

Closed
martinsumner opened this issue Mar 7, 2022 · 5 comments
Closed

Replication, delete_mode not keep and key amnesia #1813

martinsumner opened this issue Mar 7, 2022 · 5 comments
Assignees

Comments

@martinsumner
Copy link
Contributor

Scenario

There are two clusters: ClusterA, ClusterB. Both clusters are configured with the standard delete_mode (reaping on 3s timeout).

An object key (K1) belongs to a Preflist of {VnodeA1, VnodeA2, VnodeA3} on ClusterA, and {VnodeB1, VnodeB2, VnodeB3} on ClusterB.

An object is initiated with that key K1 on ClusterA (VnodeA1 say, acting as coordinator). The creation is replicated.

The object is deleted on ClusterB (VnodeB1 say, acting as coordinator). The deletion is replicated.

For some reason, on one cluster (ClusterA say) the reap does not occur on the timeout.

Replication Goes Wrong

  1. There is now an object on ClusterA, a tombstone with a vector clock like [{VnodeA1, {1, TS1}}, {VnodeB1, {1, TS2}}], that does not exist on ClusterB.
  2. nextgenrepl full-sync now runs with ClusterA as the src, and ClusterB as the snk.
  3. The aae_exchange discovers that A > B for K1 as K1 not_found on B. It puts a reference on the replication queue.
  4. The fetch, which is run via riak_kv_get_fsm pulls across the tombstone object. But as it prompts a read (via riak_kv_get_fsm), the read final action spots that this is a tombstone, and so triggers a maybe_delete, which causes the tombstone to be reaped from ClusterA.
  5. The push of the tombstone to ClusterB, is performed as expected on VnodeB2 and VnodeB3. However on VnodeB1 there is key amnesia - this vnode previously coordinated the write, but has no backend memory because of the reap.
  6. Due to key amnesia, the object is written with an updated vector clock.
  7. The riak_client:push function which has replicated the object prompts a GET of the object after the asis PUT. This GET is intended to prompt any delete actions (e.g. it is expected to reap). However, in this case it prompts a read-repair instead (there can only be one final action), because the tombstone's vclock differs at VnodeB1 from VnodeB2 and VnodeB3 due to the new actor epoch generated by key amnesia.
  8. This means that the object has been reaped from ClusterA, but exists on ClusterB with a vector clock following read repair like [{VnodeA1, {1, TS1}}, {VnodeB1, {1, TS2}}, {VnodeB1.1, {1, TS3}}].
  9. Now if full-sync runs from B to A - the same thing occurs in reverse. The net effect is the tombstone is reaped from ClusterB, but re-added to ClusterA with an expanded vector clock due to the new actor epoch (triggered by the object update history from VnodeA1) e.g.: [{VnodeA1, {1, TS1}}, {VnodeB1, {1, TS2}}, {VnodeB1.1, {1, TS3}}, {VnodeA1.1, {1, TS4}}]

... this object can then loop around indefinitely, forever increasing the size of the vector clock.

@martinsumner
Copy link
Contributor Author

Initial thoughts are, that the best solution is to change the riak_kv_get_fsm so as not to trigger delete on fetch (in the riak_kv_get_fsm). This represents a change from riak_repl (as riak_repl was push rather than pull-based, and so did not use fetch), and hence why we have not seen this before.

This should prevent the cycle.

There is perhaps also a question about whether it is right to consider new_actor_epoch on tombstones when delete_mode is not keep. Should key amnesia be the expectation here with replicated tombstones due to automatic reaping?

I think the safest thing is to leave key amnesia as it was, as it appears to be only this replication scenario that can trigger this cycle.

@martinsumner
Copy link
Contributor Author

martinsumner commented Mar 8, 2022

Changing the fetch behaviour to avoid the prompted reap on the src cluster may cause other problems. In particular, if we have the same situation - but this time there have been no previous coordinated PUTs on either cluster (the object originates from ClusterC).

There could now be a rotation, whereby the tombstone is on A. Full-sync prompts A -> B. There is no amnesia on B, hence no read repair required, and so the GET post PUSH on B will now prompt the reap. However, the Fetch from A hasn't prompted a reap now ... so we are just rotating tombstones again.

@martinsumner
Copy link
Contributor Author

There is a demonstration of the original problem in this test

@martinsumner
Copy link
Contributor Author

The kv679 behaviour, or handling key amnesia in all delete_modes, is implicitly tested (in that the default non-keep mode is used in all the tests). So I don't think it is potentially unsafe to change this behaviour.

The alternative is to end the rule of a single final action only in riak_kv_get_fsm. If a repair is prompted, and the repaired object is a tombstone, it should also prompt a delete (which is really a reap request).

@martinsumner
Copy link
Contributor Author

#1814

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant