Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent data in an etcd cluster #9630

Closed
lbernail opened this issue Apr 25, 2018 · 10 comments
Closed

Inconsistent data in an etcd cluster #9630

lbernail opened this issue Apr 25, 2018 · 10 comments

Comments

@lbernail
Copy link

We have a 3 node etcd cluster that we used as a backend for a kubernetes cluster and on one of the nodes the data is inconsistent with the others:

Member list

etcdctl member list
76c74df0105143e4, started, etcd1, https://172.30.171.85:2380, https://172.30.171.85:2379
b4a97ffa7975df71, started, etcd2, https://172.30.173.252:2380, https://172.30.173.252:2379
bba515b5b42ffb5c, started, etcd0, https://172.30.167.81:2380, https://172.30.167.81:2379

Status

etcdctl endpoint status
https://172.30.167.81:2379, bba515b5b42ffb5c, 3.2.18+git, 1.2 GB, false, 2, 3115003
https://172.30.171.85:2379, 76c74df0105143e4, 3.2.18+git, 1.2 GB, true, 2, 3115003
https://172.30.173.252:2379, b4a97ffa7975df71, 3.2.18+git, 851 MB, false, 2, 3115003

Data inconsistency
OK Node

etcdctl --endpoints https://172.30.167.81:2379 get --prefix --keys-only /registry/deployments/datadog/datadog-agent-kube-state-metrics --consistency="l"
/registry/deployments/datadog/datadog-agent-kube-state-metrics

Inconsistent Node: key is missing

etcdctl --endpoints https://172.30.173.252:2379 get --prefix --keys-only /registry/deployments/datadog/datadog-agent-kube-state-metrics --consistency="l"

Possible cause
We manage our cluster with terraform and we upgraded it. The upgrade involved replacing the etcd instances but we kept the data and wal directories (on EBS drives on AWS) and the new nodes had the same IP as the initial ones and the same etcd version. However etcd was probably not cleanly shut down.

etcd version: We were using a custom build from the 3.2 branch because 3.2.19 had not been released yet and we needed this PR: #9570
Our etcd was built from this commit: https://github.com/roboll/etcd/commit/d45053c068950a5672a22d1192249313dbcbca26 with go 1.10 (binary available here: https://github.com/roboll/etcd/releases/tag/v3.2.19-datadog). Even if this is not an official release we believe that this should not have happened.

We are keeping the cluster in this state to be able to diagnose what happened. We are happy to send more details.

@gyuho
Copy link
Contributor

gyuho commented Apr 25, 2018

We manage our cluster with terraform and we upgraded it. The upgrade involved replacing the etcd instances but we kept the data and wal directories (on EBS drives on AWS) and the new nodes had the same IP as the initial ones and the same etcd version. However etcd was probably not cleanly shut down.

Is that replaced node (new node) the one with different data size?

data and wal directories (on EBS drives on AWS)

What do we mean by "replace" node? Was it done via member remove and member add?

Also, have you ever run defrag command? Maybe two other nodes db files are still fragmented.

@gyuho
Copy link
Contributor

gyuho commented Apr 25, 2018

Btw, the TLS fix has been released with https://github.com/coreos/etcd/releases/tag/v3.2.19 yesterday.

@lbernail
Copy link
Author

lbernail commented Apr 25, 2018

Is that replaced node (new node) the one with different data size?

We replaced the 3 nodes at the same time. I know this is a very bad idea and if the cluster had failed to come up when the 3 new nodes had started I think it would have make sense. But the cluster elected a leader successfully and started serving requests.

What do we mean by "replace" node? Was it done via member remove and member add?

No terraform just shut down the three nodes and created new instances with the same data disks and ip addresses.

I will run the defrag command.

(yes we plan to use the new release as soon as possible)

@gyuho
Copy link
Contributor

gyuho commented Apr 25, 2018

No terraform just shut down the three nodes and created new instances with the same data disks and ip addresses.

I see. So, before replace happens, data sizes were consistent?

I will run the defrag command.

Yes, please try, and let us know if it still returns different numbers.

@lbernail
Copy link
Author

lbernail commented Apr 25, 2018

I just did a defrag:

etcdctl endpoint status
https://172.30.167.81:2379, bba515b5b42ffb5c, 3.2.18+git, 336 MB, false, 2, 3125697
https://172.30.171.85:2379, 76c74df0105143e4, 3.2.18+git, 336 MB, true, 2, 3125697
https://172.30.173.252:2379, b4a97ffa7975df71, 3.2.18+git, 135 MB, false, 2, 3125697

So data sizes are still different (and some keys are still missing when accessing the 3rd node)

Before the replace the cluster was working fine and I did not look into etcd but I assume everything was ok

@gyuho
Copy link
Contributor

gyuho commented Apr 25, 2018

@lbernail As long as Raft index (the last column in endpoint status output) stays the same, I think this is the same issue with #7116 (comment) and #8009, which were resolved in v3.3 with new bolt DB.

Can you also check

ETCDCTL_API=3 ./bin/etcdctl --endpoints EP1 get foo --write-out json --consistency="s"

for each node?

The header's revision should be same across the cluster, if they have the same data.

@lbernail
Copy link
Author

Thank you

Would the database size issue explain data inconsistency between nodes?

etcdctl --endpoints <inconsistent_node> get /registry/deployments/datadog/datadog-agent-kube-state-metrics --write-out json --consistency="s"
{"header":{"cluster_id":18088641622189011759,"member_id":13018076911647448945,"revision":1670966,"raft_term":2}}
etcdctl --endpoints <ok node 1> get /registry/deployments/datadog/datadog-agent-kube-state-metrics --write-out json --consistency="s"
{"header":{"cluster_id":18088641622189011759,"member_id":13521237326406089564,"revision":1447034,"raft_term":2},"kvs":[{"key":"L3JlZ2lzdHJ5L2RlcGxveW1lbnRzL2RhdGFkb2cvZGF0YWRvZy1hZ2VudC1rdWJlLXN0YXRlLW1ldHJpY3M=","create_revision":627026,"mod_revision":1141362,"version":397,"value":"xxx"}],"count":1}
etcdctl --endpoints <ok node 2> get /registry/deployments/datadog/datadog-agent-kube-state-metrics --write-out json --consistency="s"
{"header":{"cluster_id":18088641622189011759,"member_id":8558895310302168036,"revision":1447416,"raft_term":2},"kvs":[{"key":"L3JlZ2lzdHJ5L2RlcGxveW1lbnRzL2RhdGFkb2cvZGF0YWRvZy1hZ2VudC1rdWJlLXN0YXRlLW1ldHJpY3M=","create_revision":627026,"mod_revision":1141362,"version":397,"value":"xxx"}],"count":1}

@gyuho
Copy link
Contributor

gyuho commented Apr 25, 2018

"revision":1670966,"raft_term":2}}

"revision":1447034,"raft_term":2}

"revision":1447416,"raft_term":2}

Strange. All revisions are different. Are they still receiving writes? Also seems like WAL file were not kept on migration? Raft term is only 2, which means there was only one leader election.

@lbernail
Copy link
Author

The kubernetes cluster is still up (in a pretty bad state) so it is still reading and writing to etcd.

The disks were supposed to be identical on the new instances and the raft term is low because the cluster had only been up for a few hours.

After more investigation it turns out we had a disk issue: we use an EBS disk for data and one for wal and when disks were reattached they were swapped (the wal disk was attached as the data disk and the data disk as the wal disk). This is due to how nvme drives are listed on new c5/m5 instances on ubuntu and we have just fixed it.

So what actually happened is we had a 3 node cluster, we stopped them and restarted them in this state:

  • instance 1: data and wal intact
  • instance 2 and 3: no data and wal

In this situation, we would have expected the cluster to not come up at all (which makes sense) but it came up in the bad state described above (serving inconsistent data depending on the node we were reaching)

@xiang90
Copy link
Contributor

xiang90 commented Apr 26, 2018

we would have expected the cluster to not come up at all (which makes sense) but it came up in the bad state described above (serving inconsistent data depending on the node we were reaching)

you probably reused the old cluster token and exact same configuration. etcd really cannot distinguish this setup from a fresh new cluster setup. The new two nodes might form a new cluster on themselves, disrupt the old one.

Also I think the one "old" node also has some issues since its term went back to 2. Without more information, it is hard for us to debug this situation. If you can reproduce it reliably, we can look into it more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants