-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a data corruption bug in revoking lease when upgrading cluster from v3.2 to v3.3/v3.4+ #11689
Comments
LeaseRevoke may fail to apply when authentication is enabled and upgrading cluster from etcd-3.2 go etcd-3.3
LeaseRevoke may fail to apply when authentication is enabled and upgrading cluster from etcd-3.2 to etcd-3.3
LeaseRevoke may fail to apply when authentication is enabled and upgrading cluster from etcd-3.2 to etcd-3.3
Since only impacts cluster using auth and leases so Kubernetes is not impacted, right? |
@jpbetz it affects our Kubernetes cluster because our etcd cluster enables auth. |
Ah |
If authentication is enabled,it has a certain probability to encounter this issue in the k8s cluster scenario. |
LeaseRevoke may fail to apply when authentication is enabled and upgrading cluster from etcd-3.2 to etcd-3.3
Lease revoke may come from v3rpc user request [1] or internally in etcdserver [2]. Does this happen in both cases? [1] etcd/etcdserver/api/v3rpc/lease.go Line 46 in 282cce7
[2] Line 854 in 282cce7
|
@jingyih Yes, this happen in both cases. But lease revoke internally in etcdserver [2] may be easier to trigger this bug, because once the lease expires, the lease revoke call is triggered, and it doesn't carry the authentication information. |
pr #11691 fixed this issue. |
What happened:
recently, our team(TencentCloud k8s team) encountered another serious data inconsistency bug when upgrading the cluster(3.2->3.3). the number of keys every node is inconsistent. The cluster does not work when you deploy/update workload.
How to trouble-shooting it:
we add debugging log and use simple chaos monkey tool to reproduce it. we successfully reproduced it again. etcd is very hard to troubleshooting data inconsistency due to lack of log.
node A(3.2+,Leader)
node B(3.2+,Follower)
node C(3.3+,Follower)
node C(3.3+) failed to apply lease_revoke command(error:auth: user name is empty). this error will continue to amplify, causing the mvcc revision to diverge very fast, failing to execute txn command, and data corruption.
How to fix it:
In the upgrade documentation, it is better to add this bug description. Users must backup data and be careful of the cluster upgrade operation, it is a high-risk operation.
we have added a pr #11691 to release-3.2 to ensure that auth info is not nil.
if user want to upgrade cluster from 3.2 to 3.3/3.4, user can firstly upgrade the cluster to the 3.2 latest version.
do you have any other better suggestions?
@jingyih @mitake
How this bug was introduced:
pr #8031(protecting lease revoking with auth) limits the users who can revoke leases. If the user isn't granted write permission of keys which are attached the lease, the revoking request will be denied.
3.0+/3.1+/3.2+ do not limit the users who can revoke leases,so user name is empty.
Impact:
it is possible to encounter it when authentication is enabled and upgrading cluster from v3.0/v3.1/v3.2 to v3.3/v3.4.
The text was updated successfully, but these errors were encountered: