Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd crashes when started with empty data #10254

Closed
DylanBLE opened this issue Nov 12, 2018 · 6 comments
Closed

etcd crashes when started with empty data #10254

DylanBLE opened this issue Nov 12, 2018 · 6 comments

Comments

@DylanBLE
Copy link

DylanBLE commented Nov 12, 2018

I have an etcd cluster with 3 members(etcd0, etcd1, etcd2) . One member(etcd2) had some disk issue so I removed it's data and restarted it. What I'am expecting was it can automatically sync data from other members and worked again. Unfortunally it keeped restarting with the following logs:

2018-11-12 17:02:10.939877 I | rafthttp: peer f4de961432718866 became active
2018-11-12 17:02:10.939909 I | rafthttp: established a TCP streaming connection with peer f4de961432718866 (stream MsgApp v2 reader)
2018-11-12 17:02:10.940014 I | rafthttp: established a TCP streaming connection with peer f4de961432718866 (stream Message reader)
2018-11-12 17:02:10.940105 I | raft: 73655dee2479774f [term: 1] received a MsgHeartbeat message with higher term from 947e503bbb1de38 [term: 3]
2018-11-12 17:02:10.940123 I | raft: 73655dee2479774f became follower at term 3
2018-11-12 17:02:10.940157 C | raft: tocommit(2111) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(2111) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

goroutine 97 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42010d240, 0xeccda6, 0x5d, 0xc420194660, 0x2, 0x2)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc420256380, 0x83f)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc420390000, 0x8, 0x73655dee2479774f, 0x947e503bbb1de38, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1100 +0x54
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.stepFollower(0xc420390000, 0x8, 0x73655dee2479774f, 0x947e503bbb1de38, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)

The start command is

etcd --name=etcd2 --trusted-ca-file=/certs/etcd/ca.crt --cert-file=/certs/etcd/server.cert --key-file=/certs/etcd/server.key --listen-client-urls=https://0.0.0.0:2679 --advertise-client-urls=https://192.168.0.103:2679 --data-dir=/data/etcd2 --initial-advertise-peer-urls=http://192.168.0.103:2680 --listen-peer-urls=http://0.0.0.0:2680 --initial-cluster-token=licence-etcd-cluster --initial-cluster=etcd0=http://192.168.0.103:2480,etcd1=http://192.168.0.103:2580,etcd2=http://192.168.0.103:2680 --initial-cluster-state=existing

etcd version

etcd Version: 3.1.11
Git SHA: 960f4604b
Go Version: go1.8.5
Go OS/Arch: linux/amd64
@DylanBLE
Copy link
Author

I know snapshot can fix this problem but it requires to do restore on all members.
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster

@DylanBLE
Copy link
Author

One work-around way is to stop one more member(etcd1) which makes the cluster unhealthy (with only etcd0 left). Then I started etcd2 and then etcd1, the cluster got recovered.

@hexfusion
Copy link
Contributor

@DylanBLE the proper way to handle a single member failure would be to remove the member (etcd2) and then add a new replacement. etcd reports the errors above because the loss of the data-dir was unexpected.

ref docs: replace-a-failed-machine

@DylanBLE
Copy link
Author

@hexfusion I see. I also found the reason why my work-around solution is working. With only etcd0, the cluster had no leader. When etcd2(with no data) joined, etcd0 will start an election and send out snapshot at the same time.
etcd0 log shows:

2018-11-12 11:18:03.394802 I | rafthttp: start to send database snapshot [index: 520199, to 70fdfbbf67a24ce9]...
2018-11-12 11:18:03.417956 I | etcdserver: wrote database snapshot out [total bytes: 45056]
2018-11-12 11:18:04.364519 I | rafthttp: database snapshot [index: 520199, to: 70fdfbbf67a24ce9] sent out successfully

etcd2 accepted the snapshot the recovered the data.
etcd2 log shows:

2018-11-12 19:18:02.593816 I | raft: 70fdfbbf67a24ce9 [term: 1] received a MsgVote message with higher term from 8afebeb48743e686 [term: 125]                                                              
2018-11-12 19:18:02.593845 I | raft: 70fdfbbf67a24ce9 became follower at term 125                                                                                                                          
2018-11-12 19:18:02.593875 I | raft: 70fdfbbf67a24ce9 [logterm: 0, index: 0, vote: 0] cast MsgVote for 8afebeb48743e686 [logterm: 117, index: 520199] at term 125                                          
2018-11-12 19:18:02.622861 I | raft: raft.node: 70fdfbbf67a24ce9 elected leader 8afebeb48743e686 at term 125       
2018-11-12 19:18:03.397449 I | rafthttp: receiving database snapshot [index:520199, from 8afebeb48743e686] ...                                                                                             
2018-11-12 19:18:04.363794 I | snap: saved database snapshot to disk [total bytes: 45056]                                                                                                                  
2018-11-12 19:18:04.363828 I | rafthttp: received and saved database snapshot [index: 520199, from: 8afebeb48743e686] successfully                                                                         
2018-11-12 19:18:04.364077 I | raft: 70fdfbbf67a24ce9 [commit: 0, lastindex: 0, lastterm: 0] starts to restore snapshot [index: 520199, term: 117]                                                         
2018-11-12 19:18:04.364129 I | raft: log [committed=0, applied=0, unstable.offset=1, len(unstable.Entries)=0] starts to restore snapshot [index: 520199, term: 117]                                        
2018-11-12 19:18:04.364153 I | raft: 70fdfbbf67a24ce9 restored progress of f4a74d3a67559a2 [next = 520200, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]                    
2018-11-12 19:18:04.364165 I | raft: 70fdfbbf67a24ce9 restored progress of 70fdfbbf67a24ce9 [next = 520200, match = 520199, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]               
2018-11-12 19:18:04.364175 I | raft: 70fdfbbf67a24ce9 restored progress of 8afebeb48743e686 [next = 520200, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]   
2018-11-12 19:18:04.364183 I | raft: 70fdfbbf67a24ce9 [commit: 520199] restored snapshot [index: 520199, term: 117]                                                                                        
2018-11-12 19:18:04.364356 I | etcdserver: applying snapshot at index 0...      

@hexfusion
Copy link
Contributor

@DylanBLE sounds like we can close this then, please let us know if you have any other issues.

@zhanw15
Copy link
Contributor

zhanw15 commented Nov 5, 2020

I know snapshot can fix this problem but it requires to do restore on all members.
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster

I think it's a nice plan. ETCDCTL_API=3 etcdctl move-leader xxxx change leader can work also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants