Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TictacAAE fullsync - need for further safety measures #1775

Closed
martinsumner opened this issue Nov 10, 2020 · 3 comments
Closed

TictacAAE fullsync - need for further safety measures #1775

martinsumner opened this issue Nov 10, 2020 · 3 comments

Comments

@martinsumner
Copy link
Contributor

In part related to #1765. When testing of intra-cluster AAE with very large vnodes, and large deltas (either due to genuine deltas or false deltas when awaiting cache repairs) then the behaviour of Tictac AAE was neither efficient or sufficiently conservative.

Some of the measures related to intra-cluster AAE, will also lead to improvement in inter-cluster AAE (such as active pruning of the AAE runner queue). However, there still exists the risk that inter-cluster AAE behaviour may not be predictable, conservative or efficient in the face of large deltas with large stores.

Following changes proposed:

  • Update the NextGenREPL doc to make the explicit the potential need for operator intervention to use aae_fold repl_keys_range to resolve large deltas, rather than wait for this to be resolved via fullsync. Currently this fullsync method is optimised for reconciliation (i.e. confirming no deltas) and resolving small deltas. Deltas of <10K may take an unexpected length of time to resolve.

  • Update the NextGenREPL doc to make explicit that the participate_in_coverage option should be used when recovering a crashed node, if using TictacAAE fullsync. by disabling coverage participation for a crashed node until the tree caches have been built - it will prevent excessive work due to false deltas before the cache rebuilds have completed.

  • Allow the max_results and scan_timeout on AAE exchanges to be configurable for those prompted by inter-cluster AAE, just as they are for intra-cluster AAE.

  • Backoff further in the riak_kv_ttaaefs_manager gen_server when a previous exchange completes but ends in the state waiting_all_results (the state is returned to the gen_server via the reply function), given that this state as a final state indicates that the exchange did not complete (and probably due to timeout).

  • Change the default configurations to be more friendly to repair than reconciliation (e.g. in large clusters reconciling every 15 minutes per node is fine, but repairing at that frequency is likely to lead to backlogs).

@martinsumner
Copy link
Contributor Author

After further investigation, some issues have arisen.

  1. There is a fundamental bug in the per-bucket full-sync. There is conflicting view in Riak between the external and internal clients on the format of a modified range - {date, Low, High} or {Low, High}. This isn't handled correctly in riak_kv_ttaaefs_manager, and so the modified range was being ignore on the local client.

  2. The behaviour of fetch_clocks for clock_compare is different between full fullsync (e.g. nval based) and partial fullsync (e.g. bucket based). This difference isn't an immediate problem, but perhaps increases the potential for confusion in the long term. Currently nval fetch_clocks will use the aae_runner for each vnode query, and attempt to re-write the segments (potentially correcting) in the tree cache as it runs through the query. the per-bucket alternative uses the AF3_QUEUE riak_core node_worker_pool, and does not re-write/repair segments.

The nextgenrepl_ttaaefs_manual riak_test has been extended to catch (1) - basho/riak_test#1352. A fix has been tested, as well as a switch to control the behaviour for (2) - #1776.

@martinsumner
Copy link
Contributor Author

If the intention is not to repair tree caches as part of fetch_clocks when running full ttaaefs fullsync, then it becomes easier to consider supporting a schedule that includes hour and day syncs.

In an hour or day sync, a complete comparison would be made between the AAEtree caches, but in the hour sync it will assume any delta was in the last hour of changes (by modified date). likewise the day sync will assume any delta is in the last day of changes.

These fetch_clocks queries can then be run with a last-modified-date range, and on very large stores this will be much faster than a full nval compare.

@martinsumner
Copy link
Contributor Author

#1776

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant