Inconsistent data in an etcd cluster #9630

lbernail · 2018-04-25T22:16:47Z

We have a 3 node etcd cluster that we used as a backend for a kubernetes cluster and on one of the nodes the data is inconsistent with the others:

Member list

etcdctl member list
76c74df0105143e4, started, etcd1, https://172.30.171.85:2380, https://172.30.171.85:2379
b4a97ffa7975df71, started, etcd2, https://172.30.173.252:2380, https://172.30.173.252:2379
bba515b5b42ffb5c, started, etcd0, https://172.30.167.81:2380, https://172.30.167.81:2379

Status

etcdctl endpoint status
https://172.30.167.81:2379, bba515b5b42ffb5c, 3.2.18+git, 1.2 GB, false, 2, 3115003
https://172.30.171.85:2379, 76c74df0105143e4, 3.2.18+git, 1.2 GB, true, 2, 3115003
https://172.30.173.252:2379, b4a97ffa7975df71, 3.2.18+git, 851 MB, false, 2, 3115003

Data inconsistency
OK Node

etcdctl --endpoints https://172.30.167.81:2379 get --prefix --keys-only /registry/deployments/datadog/datadog-agent-kube-state-metrics --consistency="l"
/registry/deployments/datadog/datadog-agent-kube-state-metrics

Inconsistent Node: key is missing

etcdctl --endpoints https://172.30.173.252:2379 get --prefix --keys-only /registry/deployments/datadog/datadog-agent-kube-state-metrics --consistency="l"

Possible cause
We manage our cluster with terraform and we upgraded it. The upgrade involved replacing the etcd instances but we kept the data and wal directories (on EBS drives on AWS) and the new nodes had the same IP as the initial ones and the same etcd version. However etcd was probably not cleanly shut down.

etcd version: We were using a custom build from the 3.2 branch because 3.2.19 had not been released yet and we needed this PR: #9570
Our etcd was built from this commit: https://github.com/roboll/etcd/commit/d45053c068950a5672a22d1192249313dbcbca26 with go 1.10 (binary available here: https://github.com/roboll/etcd/releases/tag/v3.2.19-datadog). Even if this is not an official release we believe that this should not have happened.

We are keeping the cluster in this state to be able to diagnose what happened. We are happy to send more details.

The text was updated successfully, but these errors were encountered:

gyuho · 2018-04-25T22:21:36Z

We manage our cluster with terraform and we upgraded it. The upgrade involved replacing the etcd instances but we kept the data and wal directories (on EBS drives on AWS) and the new nodes had the same IP as the initial ones and the same etcd version. However etcd was probably not cleanly shut down.

Is that replaced node (new node) the one with different data size?

data and wal directories (on EBS drives on AWS)

What do we mean by "replace" node? Was it done via member remove and member add?

Also, have you ever run defrag command? Maybe two other nodes db files are still fragmented.

gyuho · 2018-04-25T22:22:08Z

Btw, the TLS fix has been released with https://github.com/coreos/etcd/releases/tag/v3.2.19 yesterday.

lbernail · 2018-04-25T22:33:31Z

Is that replaced node (new node) the one with different data size?

We replaced the 3 nodes at the same time. I know this is a very bad idea and if the cluster had failed to come up when the 3 new nodes had started I think it would have make sense. But the cluster elected a leader successfully and started serving requests.

What do we mean by "replace" node? Was it done via member remove and member add?

No terraform just shut down the three nodes and created new instances with the same data disks and ip addresses.

I will run the defrag command.

(yes we plan to use the new release as soon as possible)

gyuho · 2018-04-25T22:37:51Z

No terraform just shut down the three nodes and created new instances with the same data disks and ip addresses.

I see. So, before replace happens, data sizes were consistent?

I will run the defrag command.

Yes, please try, and let us know if it still returns different numbers.

lbernail · 2018-04-25T22:45:20Z

I just did a defrag:

etcdctl endpoint status
https://172.30.167.81:2379, bba515b5b42ffb5c, 3.2.18+git, 336 MB, false, 2, 3125697
https://172.30.171.85:2379, 76c74df0105143e4, 3.2.18+git, 336 MB, true, 2, 3125697
https://172.30.173.252:2379, b4a97ffa7975df71, 3.2.18+git, 135 MB, false, 2, 3125697

So data sizes are still different (and some keys are still missing when accessing the 3rd node)

Before the replace the cluster was working fine and I did not look into etcd but I assume everything was ok

gyuho · 2018-04-25T22:54:00Z

@lbernail As long as Raft index (the last column in endpoint status output) stays the same, I think this is the same issue with #7116 (comment) and #8009, which were resolved in v3.3 with new bolt DB.

Can you also check

ETCDCTL_API=3 ./bin/etcdctl --endpoints EP1 get foo --write-out json --consistency="s"

for each node?

The header's revision should be same across the cluster, if they have the same data.

lbernail · 2018-04-25T23:08:18Z

Thank you

Would the database size issue explain data inconsistency between nodes?

etcdctl --endpoints <inconsistent_node> get /registry/deployments/datadog/datadog-agent-kube-state-metrics --write-out json --consistency="s"
{"header":{"cluster_id":18088641622189011759,"member_id":13018076911647448945,"revision":1670966,"raft_term":2}}

etcdctl --endpoints <ok node 1> get /registry/deployments/datadog/datadog-agent-kube-state-metrics --write-out json --consistency="s"
{"header":{"cluster_id":18088641622189011759,"member_id":13521237326406089564,"revision":1447034,"raft_term":2},"kvs":[{"key":"L3JlZ2lzdHJ5L2RlcGxveW1lbnRzL2RhdGFkb2cvZGF0YWRvZy1hZ2VudC1rdWJlLXN0YXRlLW1ldHJpY3M=","create_revision":627026,"mod_revision":1141362,"version":397,"value":"xxx"}],"count":1}

etcdctl --endpoints <ok node 2> get /registry/deployments/datadog/datadog-agent-kube-state-metrics --write-out json --consistency="s"
{"header":{"cluster_id":18088641622189011759,"member_id":8558895310302168036,"revision":1447416,"raft_term":2},"kvs":[{"key":"L3JlZ2lzdHJ5L2RlcGxveW1lbnRzL2RhdGFkb2cvZGF0YWRvZy1hZ2VudC1rdWJlLXN0YXRlLW1ldHJpY3M=","create_revision":627026,"mod_revision":1141362,"version":397,"value":"xxx"}],"count":1}

gyuho · 2018-04-25T23:57:02Z

"revision":1670966,"raft_term":2}}

"revision":1447034,"raft_term":2}

"revision":1447416,"raft_term":2}

Strange. All revisions are different. Are they still receiving writes? Also seems like WAL file were not kept on migration? Raft term is only 2, which means there was only one leader election.

lbernail · 2018-04-26T18:20:58Z

The kubernetes cluster is still up (in a pretty bad state) so it is still reading and writing to etcd.

The disks were supposed to be identical on the new instances and the raft term is low because the cluster had only been up for a few hours.

After more investigation it turns out we had a disk issue: we use an EBS disk for data and one for wal and when disks were reattached they were swapped (the wal disk was attached as the data disk and the data disk as the wal disk). This is due to how nvme drives are listed on new c5/m5 instances on ubuntu and we have just fixed it.

So what actually happened is we had a 3 node cluster, we stopped them and restarted them in this state:

instance 1: data and wal intact
instance 2 and 3: no data and wal

In this situation, we would have expected the cluster to not come up at all (which makes sense) but it came up in the bad state described above (serving inconsistent data depending on the node we were reaching)

xiang90 · 2018-04-26T18:32:36Z

we would have expected the cluster to not come up at all (which makes sense) but it came up in the bad state described above (serving inconsistent data depending on the node we were reaching)

you probably reused the old cluster token and exact same configuration. etcd really cannot distinguish this setup from a fresh new cluster setup. The new two nodes might form a new cluster on themselves, disrupt the old one.

Also I think the one "old" node also has some issues since its term went back to 2. Without more information, it is hard for us to debug this situation. If you can reproduce it reliably, we can look into it more.

xiang90 closed this as completed Apr 26, 2018

njhill mentioned this issue Apr 3, 2019

Inconsistent Revisions Across Members ( v3.3.3 ) #10594

Closed

wswcfan mentioned this issue Feb 24, 2020

a data corruption bug in all etcd3 version when authentication is enabled #11651

Closed

sanjitp mentioned this issue Mar 16, 2022

Data inconsistency in etcd version 3.3.11 #13503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent data in an etcd cluster #9630

Inconsistent data in an etcd cluster #9630

lbernail commented Apr 25, 2018

gyuho commented Apr 25, 2018

gyuho commented Apr 25, 2018

lbernail commented Apr 25, 2018 •

edited

Loading

gyuho commented Apr 25, 2018

lbernail commented Apr 25, 2018 •

edited

Loading

gyuho commented Apr 25, 2018 •

edited

Loading

lbernail commented Apr 25, 2018

gyuho commented Apr 25, 2018

lbernail commented Apr 26, 2018

xiang90 commented Apr 26, 2018

Inconsistent data in an etcd cluster #9630

Inconsistent data in an etcd cluster #9630

Comments

lbernail commented Apr 25, 2018

gyuho commented Apr 25, 2018

gyuho commented Apr 25, 2018

lbernail commented Apr 25, 2018 • edited Loading

gyuho commented Apr 25, 2018

lbernail commented Apr 25, 2018 • edited Loading

gyuho commented Apr 25, 2018 • edited Loading

lbernail commented Apr 25, 2018

gyuho commented Apr 25, 2018

lbernail commented Apr 26, 2018

xiang90 commented Apr 26, 2018

lbernail commented Apr 25, 2018 •

edited

Loading

lbernail commented Apr 25, 2018 •

edited

Loading

gyuho commented Apr 25, 2018 •

edited

Loading