-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd server DB out of sync undetected #12535
Comments
@frans42 There are number of failures in logs alluding to fact that nodes have a problem communicating with each other. Can you check your network? I understand, I'm not answering your questions directly but would suggest looking at fixing the root cause. |
Hi @agargi, thanks for looking at this. The ups and downs of individual members are expected and the result of maintenance on the hosts that run the members. Maybe a bit more background is useful. The three member-hosts also run ceph monitor and manager daemons. If there was a networking problem, ceph would tell me. The network outages you see are the result firmware+OS updates that involved a shut-down of each host. In fact, I would assume that a noisy network should never be a root cause as the whole point of a distributed data base (KV store) is to survive noisy connections. The problem I report here is something I also observed without any maintenance. It happens quite regularly and I do not even have much IO requests on the etcd cluster. It is that the etcd members seem not to compare checksums of their respective data bases as part of the health check. It seems that a member can loose a commit but move to the latest raft without noticing, leading to a silent corruption of the data base. I don't have a test etcd cluster at hand to reproduce this for debugging, unfortunately, and was hoping that some clue about the de-synchronisation is in the log or current state of the DB. I need to get the etcd cluster synchronised very soon. If there are any diagnostic commands I can run on the individual DBs or members, please let me know as soon as possible. I would like to mention that I have found many reports of the same issue, a silent corruption of the combined data (members out of sync), which seems to be a long-standing issue with etcd. I'm somewhat surprised that a simple test for checksum seems not to be part of the internal health check. I can detect the issue with the output of "etcdctl endpoint hashkv --cluster", so why can't the cluster? Best regards, Frank |
I would like to add some context to my previous comment. What I'm referring to when I say that the networking is not relevant here is a mismatch between observed and expected behaviour. I understand that networking problems could be the reason for the de-synchronisation of etcd, but this is actually not the point of this ticket. The point is, that an etcd cluster in this broken state should not serve requests. What is the expected behaviour? I will make the example with 3 member instances. I will only consider the case with all members up but the DB increasingly inconsistent. Case A: All DBs identical: all members serve read-write requests. Case B: One DB out of sync: the remaining 2 members in quorum serve read-write requests; the out-of-sync member goes into health-err state and stops serving any requests. (In fact, I would expect a periodic re-sync attempt to be able to join quorum again.) Any operation with etcdctl should print a warning message and return with a special error code indicating that the operation was OK, but the cluster is not. This can be done with OR-ing a bit into the usual return codes to allow returning 2 pieces of information with one byte value. Case C: all three DBs out of sync: all members go into health-err state and every member stops serving requests. My etcd cluster clearly is in case C and should stop serving requests. This is completely independent of why it got into this state. In fact, if etcd is not build for self-healing in case B, I would actually expect that it stops serving requests already when 1 member is down/1 DB copy is out of sync to protect the copies in quorum. Then, I would now not be in a situation with 3 different copies, where nobody knows any more which one is authoritative. If what I describe in cases A-C is the expected behaviour, then it is not implemented and I would fix it. If cases A-C are not expected behaviour, then please let me know what is and how I can check for and ensure consistent copies of the DB instances. Thanks and best regards, Frank |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Hi, I added additional information as requested. Could someone please look at this? |
As long as you are using 'linearizable' reads, they should be served only if you have an active quorum. |
I think we are running into similar issue. One or many of our members loose their sync. Create a election Listen to election on same endpoint
This is a graph from etcd_debugging_mvcc_current_revision where we can clearly see that the revision is not in sync anymore. The only logs we can see in about that time is this on endpoint1. Which seems to be when our leaderelector client tries to query this endpoint but it does not contain the same leases as the other 2 nodes.
|
Hi @jonaz, thanks for adding this data. This is exactly what I'm observing too. The members get out of sync and stay out of sync, it is not a temporary "catching up" as mentioned by @ptabor. Future commits can lead to members serving the same data again, but apart from the commit history being inconsistent I also observe that the members tend to be more and more out of sync (the frequency of failed uniform commits increases). |
I was debugging this now and trying some config flags and stuff to see something broken... And low and behold it started and was synced again, the difference was that it purged a snap file when started.. the node was online but we have turned off etcd for like a week... We also found out that the service file could be tuned a little bit like this We will try with this service file and hope it works |
The above did not help. We have 3 clusters with this problem. They are only used for leases and have very little traffic. All 3 gets out of sync periodicly and we'll have to stop entire cluster and scrap data folder and start them again to make them work. The symptom is that lease is instant on one node but the other two lags 30-60 seconds, sometimes more. We are running 3.4.14 |
Hi @ptabor, I don't know what a serializable read would be and if our access pattern qualifies. However, the members were not "lagging" in the usual sense of being in the process of synchronization. The members stayed out of sync for days. The only way to recover is to wipe the out-of-sync-members as @jonaz writes above and re-deploy fresh instances. My impression is that the members actually believe they are in sync. That's why they don't even make an attempt to sync. However, the condition of being out of sync is extremely easy to detect by comparing the haskkv values and I'm surprised that (1) this is not part of the internal cluster health check and (2) that there is no etcdctl command to force a re-sync from a specific source instance. This would help a lot with fixing a broken cluster. |
It happened again now at 14:22 in one of our production clusters. Nothing happend to the VM. No vmotion at that time. No logs in etcd execept for this exactly when the revision started to get out of sync:
We restarted the out of sync node with ETCD_EXPERIMENTAL_INITIAL_CORRUPT_CHECK=true
A thing to note is that this cluster is only ever used for leadeelections and nothing else. |
@frans42 are you also using leader election? Do you have a custom software built for it like we do? We suspect that something we do in the client is causing this. |
Hi @jonaz, we use a standard set-up and don't do anything special. Leaders are elected by whatever etcd has implemented. It looks like the devs are not going to address this. I plan to move away from etcd, it simply is too unreliable with this simple health-check and self-healing missing. Even if I purposefully corrupted the DB, it should detect and, if possible, fix it. However, it doesn't even detect an accidental corruption and with the lack of interest to fix this, I cannot rely on it. The same applies to your situation. It is completely irrelevant how you manage to throw an etcd instance off. The etcd cluster should detect and fix the problem or err out. However, it simply continues to answer requests and happily distributes inconsistent data. That's not acceptable for a distributed data base. If the idea of the devs is that a user adds a cron job for health checks herself, then the least that should be available is a command to force a re-sync of an instance against the others without the pain of removing and re-adding an instance. |
@ptabor we saw a similar issue today in our company's internal deployed cluster. We may need to do some further investigation on this. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
We have not seen this lately. But we have also not seen major VM problems which we think was the initial cause. We have also stopped calculating our own leaseid and let concurrency.NewSession do it for us. |
@jonaz aws team faced the same data inconsistency problem as well. After log diving, it appeared the Under this condition, the lessor Lines 308 to 332 in a905430
Due to the kube-apiserver usage (specifically optimistic lock on the ResourceVersion) of etcd Txn like the following, the revision difference will be amplified and cause more serious cascading failures like
From your log post in the previous comment
The lease ID Lines 247 to 258 in a905430
For this "bad" etcd cluster, we also dumped the db file, inspected the lease bucket and found out there were multiple corrupted leases.
After editing the
Interestingly, the corrupted leases (lease ID < 0) will be ignored when recovering lessor from the db file. So the key values associated those corrupted leases will never be deleted. Lines 770 to 776 in a905430
/cc @wilsonwang371 @ptabor Does the above explanation make sense to you? |
@chaochn47 Thanks for digging into the issue. I would be interesting into looking into this more deeply. Looks like this bug was firstly detected on v3.3 so it affects last 3 releases. To make sure we correctly fix it we should look into adding a test that can verify the fix across multiple branches. Could you prepare a minimal reproduction steps or a script that would trigger the issue? |
@serathius Thanks for looking into this issue! Even though the negative leaseID existence may be a result of network (peer to peer raft entries replication) or disk IO (lessor to backend persistence) bitflip corruption, I think we could still edit the clientv3 Lines 214 to 227 in a905430
Yeah, we will update on this issue very soon. |
Update: Just attached PR chaochn47#5 in my forked etcd repository for a reproduce integration test. PTAL, thanks!! The fix can be as easy as to change Line 775 in a905430
to and then let corrupted lease value to fail For runtime data consistency check, we might need to start creating a |
Run a simulation script in a 1.21 kubernetes cluster.
Followed by The revision gap across 3 etcd nodes became larger and larger. The
|
@chaochn47 Hi Chao, which verison of etcd did you use. Did you use the latest etcd? There was a patch fixing this issue. #13505 |
@wilsonwang371 Hi wilson, we are using etcd version Lines 351 to 363 in 72d3e38
xref. https://github.com/etcd-io/etcd/blob/v3.4.18/mvcc/watchable_store.go#L351-L363 For my education, is the lease bucket data corruption a victim of deep copying during boltdb re-mmap? How will read impact the write path? |
No, actually I am not sure about this yet. The root cause of our internal observation of DB out of sync is due to #13505 |
Seems like the root cause for Etcd diverging is lease with negative ID. As such leases are not loaded during init, restarted members fail to revoke those leases. Questions still remains, how such leases were created? Is this bug in server lease id generation, lease persisting, client or application. I don't think that it's data disk corruption as flipping sign bit is too specific. I will try to look through Etcd and Kubernetes lease ID handling code. |
We will report back to community once we know how it happened
From my understanding, Kubernetes just use the client v3 Grant API to cache the last used leaseID generated from the etcd server. Lines 214 to 227 in a905430
// leaseManager is used to manage leases requested from etcd. If a new write
// needs a lease that has similar expiration time to the previous one, the old
// lease will be reused to reduce the overhead of etcd, since lease operations
// are expensive. In the implementation, we only store one previous lease,
// since all the events have the same ttl.
type leaseManager struct {
client *clientv3.Client // etcd client used to grant leases
leaseMu sync.Mutex
prevLeaseID clientv3.LeaseID
prevLeaseExpirationTime time.Time
// The period of time in seconds and percent of TTL that each lease is
// reused. The minimum of them is used to avoid unreasonably large
// numbers.
leaseReuseDurationSeconds int64
leaseReuseDurationPercent float64
leaseMaxAttachedObjectCount int64
leaseAttachedObjectCount int64
} |
I just get time to take a look at this issue, I see some |
I run 3 etcd server instances and I frequently observe that the data bases get out of sync without the etcd cluster detecting a cluster health issue. Repeated
get
-requests of a key will return different (versions of) values depending on which server the local proxy queries.The only way to detect this problem is to compare the hashkv of all endpoints. All other health checks return "healthy". A typical status check looks like this (using v3 api):
I couldn't figure out how to check the actual cluster health with etcdctl. All I seem to be able to do is check endpoint health, but the implication "all end points healthy" -> "cluster is healthy" does clearly not hold.
The cluster is running etcd 3.3.11 on Centos 7.7 with the stock packages. I attached a number of files:
etcd.info.txt - some info collected according to instructions.
etcd-ceph-01.conf.txt, etcd-ceph-02.conf.txt, etcd-ceph-03.conf.txt - the 3 config files of the etcd members
etcd-gnosis.conf.txt - a config file for a client using the etcd proxy service
etcd.log - the result of
grep -e "etcd:" /var/log/messages
, the etcd log goes to syslog; this log should cover at least one occasion of loss of DB consistencyI cannot attach the data bases, because they contain credentials. However, I can - for some time - run commands on the data bases and post results. I have a little bit of time before I need to synchronize the servers again.
In the mean time, I could use some help with recovering. This is the first time all 3 instances are different. Usually, I have only 1 out-of-sync server and can get back to normal by removing and re-adding it. However, here I have now 3 different instances and it is no longer trivial to decide which copy is the latest. Therefore, if you could help me with these questions, I would be grateful:
Thanks for your help.
The text was updated successfully, but these errors were encountered: