Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force new cluster causes etcd to panic when learner members are added #12285

Closed
galal-hussein opened this issue Sep 11, 2020 · 0 comments · Fixed by #12288
Closed

Force new cluster causes etcd to panic when learner members are added #12285

galal-hussein opened this issue Sep 11, 2020 · 0 comments · Fixed by #12288

Comments

@galal-hussein
Copy link
Contributor

etcd version: v3.4.13

Issue:

the issue was seen in k3s k3s-io/k3s#2131 when --cluster-reset is passed which internally passing a --force-new-cluster to etcd node, If --force-new-cluster flag is passed to a cluster where it had previous learner members added, etcd panics with the following error:

{"level":"info","ts":"2020-09-11T16:10:24.260Z","caller":"etcdmain/etcd.go:134","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2020-09-11T16:10:24.260Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://0.0.0.0:2380"]}
{"level":"info","ts":"2020-09-11T16:10:24.262Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379","http://0.0.0.0:4001"]}
{"level":"info","ts":"2020-09-11T16:10:24.263Z","caller":"embed/etcd.go:302","msg":"starting an etcd server","etcd-version":"3.4.13","git-sha":"ae9734ed2","go-version":"go1.12.17","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"etcd0","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":true,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://172.31.35.28:2380"],"listen-peer-urls":["http://0.0.0.0:2380"],"advertise-client-urls":["http://172.31.35.28:2379","http://172.31.35.28:4001"],"listen-client-urls":["http://0.0.0.0:2379","http://0.0.0.0:4001"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
2020-09-11 16:10:24.263932 W | pkg/fileutil: check file permission: directory "/var/lib/etcd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
{"level":"info","ts":"2020-09-11T16:10:24.264Z","caller":"etcdserver/backend.go:80","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"118.408µs"}
{"level":"panic","ts":"2020-09-11T16:10:24.265Z","caller":"etcdserver/raft.go:704","msg":"unknown ConfChange Type","type":"ConfChangeAddLearnerNode","stacktrace":"go.etcd.io/etcd/etcdserver.getIDs\n\t/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/raft.go:704\ngo.etcd.io/etcd/etcdserver.restartAsStandaloneNode\n\t/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/raft.go:611\ngo.etcd.io/etcd/etcdserver.NewServer\n\t/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:482\ngo.etcd.io/etcd/embed.StartEtcd\n\t/tmp/etcd-release-3.4.13/etcd/release/etcd/embed/etcd.go:214\ngo.etcd.io/etcd/etcdmain.startEtcd\n\t/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:302\ngo.etcd.io/etcd/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:144\ngo.etcd.io/etcd/etcdmain.Main\n\t/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/main.go:46\nmain.main\n\t/tmp/etcd-release-3.4.13/etcd/release/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"}
panic: unknown ConfChange Type

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00013f550, 0xc000205180, 0x1, 0x1)
	/home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/zapcore/entry.go:229 +0x546
go.uber.org/zap.(*Logger).Panic(0xc00007b860, 0x10a7a53, 0x17, 0xc000205180, 0x1, 0x1)
	/home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.uber.org/zap@v1.10.0/logger.go:225 +0x7f
go.etcd.io/etcd/etcdserver.getIDs(0xc00007b860, 0x0, 0xc00012c900, 0x7, 0x8, 0x0, 0x0, 0x0)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/raft.go:704 +0x582
go.etcd.io/etcd/etcdserver.restartAsStandaloneNode(0x7ffc2b9e6ddd, 0x5, 0x0, 0x0, 0x0, 0x0, 0xc00019e700, 0x2, 0x2, 0xc000125380, ...)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/raft.go:611 +0x255
go.etcd.io/etcd/etcdserver.NewServer(0x7ffc2b9e6ddd, 0x5, 0x0, 0x0, 0x0, 0x0, 0xc00019e700, 0x2, 0x2, 0xc000125380, ...)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:482 +0x122f
go.etcd.io/etcd/embed.StartEtcd(0xc000202000, 0xc00035c000, 0x0, 0x0)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/embed/etcd.go:214 +0x988
go.etcd.io/etcd/etcdmain.startEtcd(0xc000202000, 0x10963d6, 0x6, 0xc000125701, 0x2)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
	/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:144 +0x2ef9
go.etcd.io/etcd/etcdmain.Main()
	/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/main.go:46 +0x38
main.main()
	/tmp/etcd-release-3.4.13/etcd/release/etcd/main.go:28 +0x20

To reproduce

  • start a 1 member etcd cluster, for example:
docker run -d -v /var/lib/etcd:/var/lib/etcd -p 4001:4001 -p 2380:2380 -p 2379:2379  --name etcd quay.io/coreos/etcd:v3.4.13 etcd  --name etcd0  -advertise-client-urls http://172.31.35.28:2379,http://172.31.35.28:4001  -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001  -initial-advertise-peer-urls http://172.31.35.28:2380  -listen-peer-urls http://0.0.0.0:2380  -initial-cluster-token etcd-cluster-1  --data-dir /var/lib/etcd -initial-cluster etcd0=http://172.31.35.28:2380 -initial-cluster-state new
  • add another member as a learner
etcdctl member add --learner etcd1 --peer-urls="http://172.31.34.59:2380"
  • start the other etcd node
docker run -d -v /var/lib/etcd:/var/lib/etcd -p 4001:4001 -p 2380:2380 -p 2379:2379  --name etcd quay.io/coreos/etcd:v3.4.13 etcd  --name etcd1  -advertise-client-urls http://172.31.34.59:2379,http://172.31.34.59:4001  --data-dir /var/lib/etcd -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001  -initial-advertise-peer-urls http://172.31.34.59:2380  -listen-peer-urls http://0.0.0.0:2380  -initial-cluster-token etcd-cluster-1  -initial-cluster etcd0=http://172.31.35.28:2380,etcd1=http://172.31.34.59:2380  -initial-cluster-state existing
  • promote the learner node
etcdctl member promote 9b83f879a67d44eb
  • stop both etcd0 and 1 and then start etcd0 with --force-new-cluster
docker run -d -v /var/lib/etcd:/var/lib/etcd -p 4001:4001 -p 2380:2380 -p 2379:2379  --name etcd quay.io/coreos/etcd:v3.4.13 etcd  --name etcd0  -advertise-client-urls http://172.31.35.28:2379,http://172.31.35.28:4001  -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001  -initial-advertise-peer-urls http://172.31.35.28:2380  -listen-peer-urls http://0.0.0.0:2380  -initial-cluster-token etcd-cluster-1  --data-dir /var/lib/etcd -initial-cluster etcd0=http://172.31.35.28:2380 -initial-cluster-state new --force-new-cluster --logger=zap --log-level=debug

The error appears to be in this line

lg.Panic("unknown ConfChange Type", zap.String("type", cc.Type.String()))
, it appears that getIds() function doesnt handle config change "ConfChangeAddLearnerNode" when getting the ID of members in a given snapshot

galal-hussein added a commit to galal-hussein/etcd that referenced this issue Sep 11, 2020
To fix a panic that happens when force-new-cluster flag is passed to
etcd node if the cluster had learner nodes added from before

Fixes etcd-io#12285
galal-hussein added a commit to galal-hussein/etcd that referenced this issue Sep 11, 2020
To fix a panic that happens when trying to get ids of etcd members in
force new cluster mode, the issue happen if the cluster previously had
etcd learner nodes added to the cluster

Fixes etcd-io#12285
galal-hussein added a commit to galal-hussein/etcd that referenced this issue Sep 14, 2020
To fix a panic that happens when trying to get ids of etcd members in
force new cluster mode, the issue happen if the cluster previously had
etcd learner nodes added to the cluster

Fixes etcd-io#12285
galal-hussein added a commit to galal-hussein/etcd that referenced this issue Sep 14, 2020
To fix a panic that happens when trying to get ids of etcd members in
force new cluster mode, the issue happen if the cluster previously had
etcd learner nodes added to the cluster

Fixes etcd-io#12285
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant