etcd panics with corrupted raft log errors #1645

r0mant · 2020-06-01T23:09:11Z

Describe the bug

We've seen a few times where etcd fails to start on a master node with the following panic:

May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: tocommit(65072) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: panic: tocommit(65072) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: goroutine 135 [running]:
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc0001deb40, 0xfd977a, 0x5d, 0xc0001a82c0, 0x2, 0x2)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:83 +0x135
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc0001fa770, 0xfe30)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/log.go:191 +0x131
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc00028a500, 0x8, 0x3c157c6302242672, 0x6c1ddcb80fda6234, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/raft.go:1195 +0x54
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.stepFollower(0xc00028a500, 0x8, 0x3c157c6302242672, 0x6c1ddcb80fda6234, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/raft.go:1141 +0x404
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*raft).Step(0xc00028a500, 0x8, 0x3c157c6302242672, 0x6c1ddcb80fda6234, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/raft.go:869 +0x1376
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*node).run(0xc000448060, 0xc00028a500)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/node.go:323 +0xf40
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: created by github.com/coreos/etcd/raft.RestartNode
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/node.go:223 +0x31c
May 21 16:29:42 dmitri-centos-node-2 systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 21 16:29:42 dmitri-centos-node-2 systemd[1]: etcd.service: Failed with result 'exit-code'.
May 21 16:29:42 dmitri-centos-node-2 systemd[1]: Failed to start Etcd Service.

The only way to get out of this situation is to rebuild the faulty etcd member like described in RedHat's KB article: https://access.redhat.com/solutions/4145881.

The issue was observed on etcd v3.3.11 (Gravity 5.5.40) and v3.3.20.

There are also related issues on etcd Github issue tracker, none of them seems to provide any resolution/RCA though, e.g. etcd-io/etcd#10951 or etcd-io/etcd#10817 (which seems to indicate the issue can be triggered by reboot).

To Reproduce

Unclear, it's intermitted and seems to happen under a couple of circumstances:

In internal testing this was seen when adding a new node to the cluster - etcd panicked on the joined node.
In the field, possibly the node was restarted?

We should try to reproduce it on 5.5.40.

Expected behavior

Etcd doesn't panic.

Logs

Environment (please complete the following information):

OS [e.g. Redhat 7.4]:
Gravity [e.g. 5.5.4]: 5.5.40
Platform [e.g. Vmware, AWS]:

Additional context

The text was updated successfully, but these errors were encountered:

knisbet · 2020-06-02T15:30:30Z

I did a cursory glance at some issues, and the ones I've found appear to suggest that this can occur when the etcd files are legitimately missing or corrupted. So I think we'd need to take a closer look at the systems we've seen this occur on, to see what the state of the filesystem is. It used to be a big issue where certain hardware would lie about fsyncs being completed, so that might be something else to consider since one customer ha it occur multiple times.

References:

hanfengyizu · 2021-03-19T02:49:08Z

I modified etcd/raft/raft.go , just ignore the panic and then this problem fixed, but i am not sure whether this is ok. @r0mant

`func (r *raft) handleHeartbeat(m pb.Message) {

  defer func() {
	e := recover()
	if e != nil{
		return
	}
}()
r.raftLog.commitTo(m.Commit)
r.send(pb.Message{To: m.From, Type: pb.MsgHeartbeatResp, Context: m.Context})

}`

knisbet · 2021-03-19T03:20:41Z

Well that panic AFAIK is telling you etcd is about to lose data and some sort of file corruption has occurred, so recovering and ignoring the panic are at best likely to lose data and at worst leave things in an inconsistent state that further goes out of control. I really wouldn't try and make a core logic change like that without talking with the etcd developers.

r0mant added kind/bug Something isn't working support-load Mark issues that increase support load port/5.5 Requires port to version/5.5.x port/6.1 Requires port to version/6.1.x priority/1 Medium priority port/7.0 Requires port to version/7.0.x labels Jun 1, 2020

r0mant added this to the Gravity Reliability Engineering Improvements milestone Jun 1, 2020

kingster mentioned this issue May 20, 2022

Panic: tocommit(289) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? flipkart-incubator/nexus#34

Open

wadells closed this as completed Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd panics with corrupted raft log errors #1645

etcd panics with corrupted raft log errors #1645

r0mant commented Jun 1, 2020 •

edited

Loading

knisbet commented Jun 2, 2020

hanfengyizu commented Mar 19, 2021

knisbet commented Mar 19, 2021

etcd panics with corrupted raft log errors #1645

etcd panics with corrupted raft log errors #1645

Comments

r0mant commented Jun 1, 2020 • edited Loading

knisbet commented Jun 2, 2020

hanfengyizu commented Mar 19, 2021

knisbet commented Mar 19, 2021

r0mant commented Jun 1, 2020 •

edited

Loading