etcd crashes when started with empty data #10254

DylanBLE · 2018-11-12T09:15:08Z

I have an etcd cluster with 3 members(etcd0, etcd1, etcd2) . One member(etcd2) had some disk issue so I removed it's data and restarted it. What I'am expecting was it can automatically sync data from other members and worked again. Unfortunally it keeped restarting with the following logs:

2018-11-12 17:02:10.939877 I | rafthttp: peer f4de961432718866 became active
2018-11-12 17:02:10.939909 I | rafthttp: established a TCP streaming connection with peer f4de961432718866 (stream MsgApp v2 reader)
2018-11-12 17:02:10.940014 I | rafthttp: established a TCP streaming connection with peer f4de961432718866 (stream Message reader)
2018-11-12 17:02:10.940105 I | raft: 73655dee2479774f [term: 1] received a MsgHeartbeat message with higher term from 947e503bbb1de38 [term: 3]
2018-11-12 17:02:10.940123 I | raft: 73655dee2479774f became follower at term 3
2018-11-12 17:02:10.940157 C | raft: tocommit(2111) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(2111) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

goroutine 97 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42010d240, 0xeccda6, 0x5d, 0xc420194660, 0x2, 0x2)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc420256380, 0x83f)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc420390000, 0x8, 0x73655dee2479774f, 0x947e503bbb1de38, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1100 +0x54
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.stepFollower(0xc420390000, 0x8, 0x73655dee2479774f, 0x947e503bbb1de38, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)

The start command is

etcd --name=etcd2 --trusted-ca-file=/certs/etcd/ca.crt --cert-file=/certs/etcd/server.cert --key-file=/certs/etcd/server.key --listen-client-urls=https://0.0.0.0:2679 --advertise-client-urls=https://192.168.0.103:2679 --data-dir=/data/etcd2 --initial-advertise-peer-urls=http://192.168.0.103:2680 --listen-peer-urls=http://0.0.0.0:2680 --initial-cluster-token=licence-etcd-cluster --initial-cluster=etcd0=http://192.168.0.103:2480,etcd1=http://192.168.0.103:2580,etcd2=http://192.168.0.103:2680 --initial-cluster-state=existing

etcd version

etcd Version: 3.1.11
Git SHA: 960f4604b
Go Version: go1.8.5
Go OS/Arch: linux/amd64

The text was updated successfully, but these errors were encountered:

DylanBLE · 2018-11-12T09:16:33Z

I know snapshot can fix this problem but it requires to do restore on all members.
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster

DylanBLE · 2018-11-12T09:20:23Z

One work-around way is to stop one more member(etcd1) which makes the cluster unhealthy (with only etcd0 left). Then I started etcd2 and then etcd1, the cluster got recovered.

hexfusion · 2018-11-12T13:32:39Z

@DylanBLE the proper way to handle a single member failure would be to remove the member (etcd2) and then add a new replacement. etcd reports the errors above because the loss of the data-dir was unexpected.

ref docs: replace-a-failed-machine

DylanBLE · 2018-11-13T03:03:51Z

@hexfusion I see. I also found the reason why my work-around solution is working. With only etcd0, the cluster had no leader. When etcd2(with no data) joined, etcd0 will start an election and send out snapshot at the same time.
etcd0 log shows:

2018-11-12 11:18:03.394802 I | rafthttp: start to send database snapshot [index: 520199, to 70fdfbbf67a24ce9]...
2018-11-12 11:18:03.417956 I | etcdserver: wrote database snapshot out [total bytes: 45056]
2018-11-12 11:18:04.364519 I | rafthttp: database snapshot [index: 520199, to: 70fdfbbf67a24ce9] sent out successfully

etcd2 accepted the snapshot the recovered the data.
etcd2 log shows:

2018-11-12 19:18:02.593816 I | raft: 70fdfbbf67a24ce9 [term: 1] received a MsgVote message with higher term from 8afebeb48743e686 [term: 125]                                                              
2018-11-12 19:18:02.593845 I | raft: 70fdfbbf67a24ce9 became follower at term 125                                                                                                                          
2018-11-12 19:18:02.593875 I | raft: 70fdfbbf67a24ce9 [logterm: 0, index: 0, vote: 0] cast MsgVote for 8afebeb48743e686 [logterm: 117, index: 520199] at term 125                                          
2018-11-12 19:18:02.622861 I | raft: raft.node: 70fdfbbf67a24ce9 elected leader 8afebeb48743e686 at term 125       
2018-11-12 19:18:03.397449 I | rafthttp: receiving database snapshot [index:520199, from 8afebeb48743e686] ...                                                                                             
2018-11-12 19:18:04.363794 I | snap: saved database snapshot to disk [total bytes: 45056]                                                                                                                  
2018-11-12 19:18:04.363828 I | rafthttp: received and saved database snapshot [index: 520199, from: 8afebeb48743e686] successfully                                                                         
2018-11-12 19:18:04.364077 I | raft: 70fdfbbf67a24ce9 [commit: 0, lastindex: 0, lastterm: 0] starts to restore snapshot [index: 520199, term: 117]                                                         
2018-11-12 19:18:04.364129 I | raft: log [committed=0, applied=0, unstable.offset=1, len(unstable.Entries)=0] starts to restore snapshot [index: 520199, term: 117]                                        
2018-11-12 19:18:04.364153 I | raft: 70fdfbbf67a24ce9 restored progress of f4a74d3a67559a2 [next = 520200, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]                    
2018-11-12 19:18:04.364165 I | raft: 70fdfbbf67a24ce9 restored progress of 70fdfbbf67a24ce9 [next = 520200, match = 520199, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]               
2018-11-12 19:18:04.364175 I | raft: 70fdfbbf67a24ce9 restored progress of 8afebeb48743e686 [next = 520200, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]   
2018-11-12 19:18:04.364183 I | raft: 70fdfbbf67a24ce9 [commit: 520199] restored snapshot [index: 520199, term: 117]                                                                                        
2018-11-12 19:18:04.364356 I | etcdserver: applying snapshot at index 0...

hexfusion · 2018-11-13T13:36:23Z

@DylanBLE sounds like we can close this then, please let us know if you have any other issues.

zhanw15 · 2020-11-05T08:29:43Z

I know snapshot can fix this problem but it requires to do restore on all members.
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster

I think it's a nice plan. ETCDCTL_API=3 etcdctl move-leader xxxx change leader can work also.

hexfusion added area/clientv3 type/question and removed area/clientv3 labels Nov 12, 2018

hexfusion closed this as completed Nov 13, 2018

knisbet mentioned this issue Jun 2, 2020

etcd panics with corrupted raft log errors gravitational/gravity#1645

Closed

kingster mentioned this issue May 20, 2022

Panic: tocommit(289) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? flipkart-incubator/nexus#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd crashes when started with empty data #10254

etcd crashes when started with empty data #10254

DylanBLE commented Nov 12, 2018 •

edited

Loading

DylanBLE commented Nov 12, 2018

DylanBLE commented Nov 12, 2018

hexfusion commented Nov 12, 2018

DylanBLE commented Nov 13, 2018

hexfusion commented Nov 13, 2018

zhanw15 commented Nov 5, 2020

etcd crashes when started with empty data #10254

etcd crashes when started with empty data #10254

Comments

DylanBLE commented Nov 12, 2018 • edited Loading

DylanBLE commented Nov 12, 2018

DylanBLE commented Nov 12, 2018

hexfusion commented Nov 12, 2018

DylanBLE commented Nov 13, 2018

hexfusion commented Nov 13, 2018

zhanw15 commented Nov 5, 2020

DylanBLE commented Nov 12, 2018 •

edited

Loading