-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leases wait for entries to be applied #13690
Conversation
Codecov Report
@@ Coverage Diff @@
## main #13690 +/- ##
==========================================
- Coverage 72.68% 72.58% -0.10%
==========================================
Files 467 467
Lines 38278 38291 +13
==========================================
- Hits 27822 27795 -27
- Misses 8653 8685 +32
- Partials 1803 1811 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
verified. it works. |
a5ba619
to
a641fde
Compare
I found a similar issue 6978, which was fixed by PR 7015 more than 5 years ago. It only fixed the HTTP entries. etcd doesn't support gRPC that time, and the fix isn't ported to gRPC later. So I just ported the fix for gRPC as well. Will update the existing test cases TestV3LeaseRenewStress and TestV3LeaseTimeToLiveStress later. |
6092356
to
05b4471
Compare
I updated the integration test and the 3.6 changelog. This PR is ready for review. @serathius @spzala @ptabor |
05b4471
to
e232d87
Compare
Just rebased this PR. |
68f90a6
to
314db9c
Compare
The
Just raised an issue to track this failure. |
41cec48
to
fb40b4e
Compare
Resolved all comments. PTAL @serathius @ptabor @spzala |
fb40b4e
to
24c1f4e
Compare
Just rebased this PR. |
24c1f4e
to
5360780
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, based on fact that this restores behavior from V2 api (bothApplyWait
and 1 second timeout).
d1d8a94
to
2493d87
Compare
I think we need @ptabor opinion here. In #13868 (comment) he brought up that making |
cc @endocrimes |
Pasting my discussion with @endocrimes about implementing LeaseList. danielle 5:38 PM serathius 6:12 PM In v2, revoked leases didn't go through raft, leader just send SYNC request through raft that didn't specify what leases should be revoked. This means that each member could remove different leases depending on time on server. So all requests about leases still needed to go through leader. I assume that when implementing V3 api this approach was blindly copied. Same problem affects other lease methods see #13690 In V3 revoking a lease goes through raft. RevokeLease entry specifies what lease is removed and all members need to agree through qorum. This means that it should be enough to use linearizableReadNotify aka (check if member is up to day with leader apply index) and the return the local response without contacting the leader. I'm not super familiar with lease code and this is just based on quickly reading through the code. We should double check with @Piotr Tabor and @spzala. Looking at #13690 I think that we should open an issue which describes this incorrect behavior in V3 API and removes going through leader. |
What this PR tries to resolve is definitely a BUG, and I think it's different to what we discussed on the |
The following test flake should can be resolved by this PR.
|
@ptabor @serathius @spzala Could you take a look at this PR? Once it's merged, I can backport to 3.5. |
I talked with @ptabor and we agreed that the current lease implementation is bad and fact that is not linearizable is a big problem. This PR addresses some of the issues but not all of them (checking I will work on reproduction and create a proposal to rewrite the API for v3.5.3. @endocrimes this will also cover your work on LeaseLeases and LeaseList. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will use different approach to fix the issue.
Sounds good to me. Can we move this one out of 3.5.3? I don't think it's a blocker for 3.5.3 |
When etcdserver receives a LeaseRenew request, it may be still in progress of processing the LeaseGrantRequest on exact the same leaseID. Accordingly it may return a TTL=0 to client due to the leaseID not found error. So the leader should wait for the appliedID to be available before processing client requests.
2493d87
to
1b3d6cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another round, and I think this patch is the best we can get in the shorterm:
- It does NOT fixes issues when leader changes, and it's not the goal of this patch.
waitApplied
index seems to not 'get stuck' as I initially thought when the member is disconnected, as ApplyWait() is just waiting whether current CommitteIndex (already received hardState) is applied. This process happens when all the needed entries are in the hard-state, so the application process should eventually converge...- The solution that changes reporting 'TTL' does not address race with 'Grant' that the problem solves.
Thanks @ptabor for the feedback, which basically makes sense. Just one comment/question on your comment |
As explained in #13915, checking Based on @ptabor comment I think that this change makes sense and addresses one part of the problem I missed, possibility that there is a LeaseRevoke request in hard state that is waiting to be applied. With a need to cut v3.5.3 soon, I think this is the best fix we can prepare for now. |
FYI integration-1-cpu workflow is failing for the third time. Possibly not related to this issue, however still very worrying as the error in the last one was
Possibly this is result of enabling ETCD_VERIFY in tests. |
I will take a deep dive into this tomorrow. |
Yes, the failure should not be related to this PR. Previously all test were gree, I just rebased this PR. The test are green again. |
@ahrtr Can you backport the change to release-3.5? |
Sure, will do today. |
Fix issue 13675.
When etcdserver receives a LeaseRenew request, it may be still in progress of processing the LeaseGrantRequest on exact the same leaseID. Accordingly it may return a TTL=0 to client due too the leaseID not found error. Please see my analysis in the issue 13675.
Before I head on adding an e2e or integration test, I'd like to get more feedback. cc @serathius @ptabor @spzala
@liberize, please kindly let me know whether this resolves your issue. Please clone my branch, build etcd binaries and verify the fix, thanks. I verified in my local environment, and confirmed that it works.