modify RenewPeriodic to retry failed Renew until TTL elapses #912
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
So we've come across an interesting condition operating our consul cluster. Basically what we are seeing is that during a leader election all of our locks were being TTL'd out. Digging in a bit further it appears that during a leader election, the renew call fails and RenewPeriodic returns. After TTL time elapses the sessions are invalidated and the locks are lost.
I've made a modification to RenewPeriodic to keep trying if there is an error, up until the TTL passes. This should allow our sessions locks to survive a leader election, so long as it takes less than TTL/2 for the election to finish.
Here is an example set of logs that show what is happening:
It's another question as to why an apparently healthy cluster lost communication and forced an election but I have not figured out the root cause of that yet.
Thanks!