Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mas i981 resethashtreetokens #1712

Merged

Conversation

martinsumner
Copy link
Contributor

This is mitigation for the broader problems referred to in - basho/riak#981.

This change is tested in:
https://github.com/nhs-riak/riak_test/blob/mas-i981-resethashtreetokens/tests/verify_aae_resettoken.erl

The idea is that in some circumstances we want to temporarily ensure that writes aren't blocked by AAE hashtree_token depletion. This might be to prove that this isn't an issue behind slow writes, or because there is a need for coordinated AAE tree rebuilds to mitigate some other issue.

This can be managed through riak attach:

Get the current min and max token count across the cluster:

{Min, Max} = riak_kv_util:return_hashtree_tokens().

Set the max token count to be in a range between very large numbers on each vnode:

riak_kv_util:reset_hashtree_tokens(200000, 250000).

After completing any related work, reset back to the original range:

riak_kv_util:reset_hashtree_tokens(Min, Max).

This will work from any node in the cluster, across the cluster, in a healthy cluster. There is no need to run the commands on each node.

Simple utility to report and reset the AAE hashtree tokens
Currently unable to pinpoint delay, and understand potential issues.
@Bob-The-Marauder
Copy link

The code looks good and the functionality makes sense. What might be an ideal additional feature would be to allow an automated option that, when enabled, acts as follows:

AAE hashtree clearing begins:

  1. See how many tokens there are in the pool.
  2. Use the tools here to set token value for the pool to a large number capped at a sensible value.
  3. Perform full AAE hashtree clearing process.
  4. Finish clearing, write out AAE queue.
  5. Reset hashtree tokens back to the value in 1.

This would allow users who regularly encounter basho/riak#981 to avoid the problem without the need to write creative scripts that invoke the above controls for every time AAE needs to clear the hashtrees.

@martinsumner
Copy link
Contributor Author

@Bob-The-Marauder

Automated coordination between the hashtree clearing/rebuild process and the aae token pool (which is on the process dictionary of the vnode process, not the hashtree process), I think would be quite hard. There is significant potential for race (or deadlock) conditions if these two processes try to coordinate activity. Resolving these would require new states to be defined (on the kv_index_hashtree process), and then consideration as to how all messages should be handled in those new states. Lots of the sort of risky, hard to test work, that one would prefer to avoid.

I think effort would be better spent on the root causes (e.g. the need for co-ordinated rebuilds due to AAE/TTL conflict, the need for separate AAE stores, the long time spent clearing trees), than on automating the mitigation.

@Bob-The-Marauder
Copy link

That makes sense. So far, I am only aware of two companies that have run into this issue and one of them simply opted to turn off AAE (even though they were not using TTL). The tools above provide sufficient usage to allow a work around to be added to a custom AAE script for automating the mitigation. That should be good enough for now. Unless you have any further edits, +1 from me.

@martinsumner martinsumner merged commit 70889c5 into basho:develop-2.9 Sep 2, 2019
@martinsumner martinsumner deleted the mas-i981-resethashtreetokens branch September 2, 2019 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants