-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lessor: fix le.itemMap leak after lease id revoked #16035
Conversation
I've been trying to repro this a bit further with a testcase. It seems when you do a grant -> attach -> revoke, the keys attached are not freed from the itemMap. Is this the same leakage you've observed? just for reference on further discussion: func TestLessorGrantAttachRevokeDetach(t *testing.T) {
lg := zap.NewNop()
dir, be := NewTestBackend(t)
defer os.RemoveAll(dir)
defer be.Close()
le := newLessor(lg, be, clusterLatest(), LessorConfig{MinLeaseTTL: minLeaseTTL})
defer le.Stop()
le.SetRangeDeleter(func() TxnDelete { return newFakeDeleter(be) })
// grant a lease with long term (100 seconds) to
// avoid early termination during the test.
l, err := le.Grant(1, 100)
if err != nil {
t.Fatalf("could not grant lease for 100s ttl (%v)", err)
}
items := []LeaseItem{
{"foo"},
{"bar"},
}
if err := le.Attach(l.ID, items); err != nil {
t.Fatalf("failed to attach items to the lease: %v", err)
}
if err := le.Revoke(l.ID); err != nil {
t.Fatalf("failed to revoke lease: %v", err)
}
err = le.Detach(l.ID, items[0:1])
assert.Equal(t, ErrLeaseNotFound, err)
assert.Equal(t, 0, len(le.leaseMap))
assert.Equal(t, 2, len(le.itemMap)) <-- leak
} |
@tjungblu almost yes, when lease is revoked, the keys attached with it will delete forever, and the lessor.ItemMap will never delete them. Use my code, it will PASS just like: |
I assume the rangeDeleter uses this code below? https://github.com/etcd-io/etcd/blob/main/server/storage/mvcc/kvstore_txn.go#L304-L315 Looking from the other angle of the revoke, the lease in the rangeDeleter won't find the lease anymore because it was already deleted earlier? https://github.com/etcd-io/etcd/blob/main/server/lease/lessor.go#L330-L344 edit: looks like this ordering was changed in #14080 |
it seems not. in Revoke it delete leaseMap, but in the rangeDeleter, it call GetLease with LeaseItem. In GetLease, it find lease from itemMap. |
@qsyqian @qsyqian @tjungblu Please try to add a test case to intentionally reproduce the issue; and the apply the patch in this PR to fix the issue. |
server/lease/lessor_test.go
Outdated
func TestLessorGrantAttachRevokeDetach(t *testing.T) { | ||
lg := zap.NewNop() | ||
dir, be := NewTestBackend(t) | ||
defer os.RemoveAll(dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is unnecessary, because the data directory is created via t.TempDir()
, and it will be cleanuped automatically when the test finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This fix needs to be backported to 3.4 and 3.5. Could you raise an issue to track this bug fix? We keep the issue open until all backport and changelog are done. We see many times that bug fixes were forgotten to be backported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you @qsyqian
FWIW, if you run the e2e test, it should reproduce the issue with the following error message. It does not fail the test and we could possibly use log assertion to guarantee detach in RangeDeleter succeeds with no error.
Trying to understand deeply about this issue in k8s etcd usage, why I don't see those lease not found error messages in production etcd clusters log scanning by past 1 day? == (We are using etcd 3.5.7). I expect event TTL expired will trigger this issue. Okay, release-3.5 lessor and release-3.4 is confirmed not impacted. Only main branch is impacted introduced by #14080. I can remove the backport tag if we are on the same page :p |
yes, I think you are right. In |
@serathius added two ut for kv store. Also add one |
server/lease/lessor.go
Outdated
@@ -572,6 +572,9 @@ func (le *lessor) Detach(id LeaseID, items []LeaseItem) error { | |||
|
|||
l := le.leaseMap[id] | |||
if l == nil { | |||
for _, it := range items { | |||
delete(le.itemMap, it) | |||
} | |||
return ErrLeaseNotFound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for review. With your suggested change if the lease which attched a lot keys and spend a lot of time for
for _, key := range keys {
txn.DeleteRange([]byte(key), nil)
}
And in this time, le.Lookup()
or le.Leases()
will return the lease which just revoked.
I don't know whether it's right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in this time, le.Lookup() or le.Leases() will return the lease which just revoked.
Yeah, that's a valid concern..
Maybe acquiring le.Lock() and defer the Unlock
and then txn.DeleteRange()
is the right choice. It could be slow when the lease is attached a lot keys but correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chaochn47 . Indeed #14080 introduced the leak. The lease logic is tightly coupled with the mvcc (key space operation), it's really spaghetti code. When mvcc deletes a key, it will automatically detach the lease attached to the key. We should execute delete(le.leaseMap, id)
after removing all the keys.
Probably we should think about how to decouple them, and improve the readability and maintainability. This can be discussed separately.
Will send a PR to rollback the change.
Actually the deadlock fixed in #14080 doesn't likely happen because all lease operations (grant, revoke, checkpoint) are executed sequentially in the applying workflow. If anyone still thinks it's possible to run into a deadlock, please create a test to reproduce it firstly.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
ping @qsyqian |
2d1aaa9
to
d619fd9
Compare
@serathius @chaochn47 fix it in another way. add step to delete le.itemMap in le.Revoke() after txn.End(). |
9aefb51
to
563662f
Compare
cc @ahrtr @chaochn47 for re-review. |
e5e6d19
to
7453f3f
Compare
@ahrtr updated, remove changes in lessor.go and also change the commit message. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ahrtr, qsyqian The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When lease is revoked, we should delete item in le.itemMap.
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.