-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lease/lessor: recheck if exprired lease is revoked #10693
Conversation
Codecov Report
@@ Coverage Diff @@
## master #10693 +/- ##
==========================================
- Coverage 62.47% 62.04% -0.44%
==========================================
Files 392 392
Lines 37158 37197 +39
==========================================
- Hits 23215 23078 -137
- Misses 12381 12544 +163
- Partials 1562 1575 +13
Continue to review full report at Codecov.
|
cc @jpbetz |
The way logic is split between Setting Something like this might be more readable and less confusing:
For "clean", I don't know if we really need any more than a |
Sorry, I’m not thinking straight today. Ignore my above comment about binary search of leaseHeap, that won’t work. It might be that adjusting the expire time like suggested in the PR is the most efficient solution. If we do that, heap.Fix() is more efficient than Pop/Push. |
Another approach would be to peek and pop from the heap until a non-expired lease is found and to put all the expired leases in a slice separate from the heap. This avoids most (all?) of the complications of having to look up all the expired leases from the heap. |
@jpbetz Actually the lease is expired checked by |
My thinking with
The more I think about it, the less certain I am that we need any backoff on retry at this point. We might not even need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just ramping up on the lease code. Inserted couple thoughts.
lease/lessor.go
Outdated
// Candidate expirations are caught up, reinsert this item | ||
// and no need to revoke (nothing is expiry) | ||
return l, false, false | ||
} | ||
// if the lease is actually expired, add to the removal list. If it is not expired, we can ignore it because another entry will have been inserted into the heap | ||
|
||
heap.Pop(&le.leaseHeap) // O(log N) | ||
|
||
// recheck if revoke is complete after retry interval | ||
item.time = now.Add(expiredleaseRetryInterval).UnixNano() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fix the original issue. My concerns:
-
Not sure how much overhead this is adding. Pushing back to heap is O(logn), this is an additional cost for every expiring lease. Eventually we might want to benchmark the performance.
-
After push back, we have mismatched information on
item.time
between leaseMap and leaseHeap. I do not think it is causing any issue now, but it might be confusing to have the same lease ID with different time. Alternatively, can we push lease back to heap without modifying the time, but simply delay the push, as opposed to push back right away. There is another potential benefit of delayed push: potentially we can check if the lease is revoked successfully at the end of delay, therefore avoid pushing back to heap in case lease is revoked successfully. Conceptually something like:
select {
case <-l.revokec: // lease is revoked successfully
case <-time.After(expiredleaseRetryInterval): // push lease back to heap (no need to modify lease's time)
}
Just some preliminary thoughts, not sure how hard it is to properly implement (I am aware that the current solution is very straight forward and simple).
master:
the branch:
|
If keepalive still |
Benchmark result looks good. Thanks! I am afraid my previous comments on |
@jingyih
|
@nolouch @jingyih Examples with go's new feature --benchtime Nx b.N = 1
b.N = 2
b.N = 10
|
lease/lessor.go
Outdated
heap.Pop(&le.leaseHeap) // O(log N) | ||
// recheck if revoke is complete after retry interval | ||
item.time = now.Add(le.expiredLeaseRetryInterval).UnixNano() | ||
le.leaseHeap.Push(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We find expired lease and push them to leaseHeap again before send them to expiredC.
So do these in revokeExpiredLeases
maybe better?
And we check lease is really expired in findExpiredLeases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emmm.... What if the lease actually expired but do not delete from the leaseMap
?
@nolouch @jingyih Line 603 in e899023
It seems like the reason for revoke failure.
The buffer size of Just some preliminary thoughts, I have not confirmed it, will do it later. |
Maybe we can fix this with another simple solution. In
@nolouch need some advice. |
@yuqitao
|
@j2gg0s I think your method also work. but same issue is you should hold all expired itme with same lease if the lease not expired. also if the array length grow, you need to iterator it , maybe also expensive. |
my mistake |
I use the failpoint test and works fine, PTAL again @jpbetz @jingyih. and setting
|
Is I ask this just for my own confusion about this test. Rechecking the lease can fix the problem revoking failure occurs occasionally. But if the reason for revoke failure is something which occurs frequently, eg some env problem, there will be too many retry leases. |
@yuqitao Thanks.
It's a memory operator, and I only maintain one item for one lease now, so I just delay some time to delete the item. If you worry about the duplicate |
It appears that the lease benchmarks are broken. Let's get #10710 reviewed and merged so we can do benchmarks properly. |
@jpbetz Lease checkpointing has similar behavior. Is it OK if the leases are already popped out of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look at unit test later. Otherwise the code change looks good. Let's wait for the benchmarking to be fixed so that we can ensure there is no substantial performance regression.
PTAL @jingyih |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm after the fixes.
Signed-off-by: nolouch <nolouch@gmail.com>
lgtm Please note that the lease benchmark is fixed in another PR and the result comparison is posted at #10710 (comment).
|
Fix #10686:
This is an easy way to try to fix the issue, and I'm not sure if this is a good method. I need some suggestions about it and then add the test.