Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

etcd panics with corrupted raft log errors #1645

Closed
r0mant opened this issue Jun 1, 2020 · 3 comments
Closed

etcd panics with corrupted raft log errors #1645

r0mant opened this issue Jun 1, 2020 · 3 comments
Labels
kind/bug Something isn't working port/5.5 Requires port to version/5.5.x port/6.1 Requires port to version/6.1.x port/7.0 Requires port to version/7.0.x priority/1 Medium priority support-load Mark issues that increase support load

Comments

@r0mant
Copy link
Contributor

r0mant commented Jun 1, 2020

Describe the bug

We've seen a few times where etcd fails to start on a master node with the following panic:

May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: tocommit(65072) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: panic: tocommit(65072) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: goroutine 135 [running]:
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc0001deb40, 0xfd977a, 0x5d, 0xc0001a82c0, 0x2, 0x2)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:83 +0x135
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc0001fa770, 0xfe30)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/log.go:191 +0x131
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc00028a500, 0x8, 0x3c157c6302242672, 0x6c1ddcb80fda6234, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/raft.go:1195 +0x54
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.stepFollower(0xc00028a500, 0x8, 0x3c157c6302242672, 0x6c1ddcb80fda6234, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/raft.go:1141 +0x404
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*raft).Step(0xc00028a500, 0x8, 0x3c157c6302242672, 0x6c1ddcb80fda6234, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/raft.go:869 +0x1376
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*node).run(0xc000448060, 0xc00028a500)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/node.go:323 +0xf40
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: created by github.com/coreos/etcd/raft.RestartNode
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/node.go:223 +0x31c
May 21 16:29:42 dmitri-centos-node-2 systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 21 16:29:42 dmitri-centos-node-2 systemd[1]: etcd.service: Failed with result 'exit-code'.
May 21 16:29:42 dmitri-centos-node-2 systemd[1]: Failed to start Etcd Service.

The only way to get out of this situation is to rebuild the faulty etcd member like described in RedHat's KB article: https://access.redhat.com/solutions/4145881.

The issue was observed on etcd v3.3.11 (Gravity 5.5.40) and v3.3.20.

There are also related issues on etcd Github issue tracker, none of them seems to provide any resolution/RCA though, e.g. etcd-io/etcd#10951 or etcd-io/etcd#10817 (which seems to indicate the issue can be triggered by reboot).

To Reproduce

Unclear, it's intermitted and seems to happen under a couple of circumstances:

  • In internal testing this was seen when adding a new node to the cluster - etcd panicked on the joined node.
  • In the field, possibly the node was restarted?

We should try to reproduce it on 5.5.40.

Expected behavior

Etcd doesn't panic.

Logs

Environment (please complete the following information):

  • OS [e.g. Redhat 7.4]:
  • Gravity [e.g. 5.5.4]: 5.5.40
  • Platform [e.g. Vmware, AWS]:

Additional context

@r0mant r0mant added kind/bug Something isn't working support-load Mark issues that increase support load port/5.5 Requires port to version/5.5.x port/6.1 Requires port to version/6.1.x priority/1 Medium priority port/7.0 Requires port to version/7.0.x labels Jun 1, 2020
@knisbet
Copy link
Contributor

knisbet commented Jun 2, 2020

I did a cursory glance at some issues, and the ones I've found appear to suggest that this can occur when the etcd files are legitimately missing or corrupted. So I think we'd need to take a closer look at the systems we've seen this occur on, to see what the state of the filesystem is. It used to be a big issue where certain hardware would lie about fsyncs being completed, so that might be something else to consider since one customer ha it occur multiple times.

References:

@hanfengyizu
Copy link

I modified etcd/raft/raft.go , just ignore the panic and then this problem fixed, but i am not sure whether this is ok. @r0mant

`func (r *raft) handleHeartbeat(m pb.Message) {

  defer func() {
	e := recover()
	if e != nil{
		return
	}
}()
r.raftLog.commitTo(m.Commit)
r.send(pb.Message{To: m.From, Type: pb.MsgHeartbeatResp, Context: m.Context})

}`

@knisbet
Copy link
Contributor

knisbet commented Mar 19, 2021

Well that panic AFAIK is telling you etcd is about to lose data and some sort of file corruption has occurred, so recovering and ignoring the panic are at best likely to lose data and at worst leave things in an inconsistent state that further goes out of control. I really wouldn't try and make a core logic change like that without talking with the etcd developers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Something isn't working port/5.5 Requires port to version/5.5.x port/6.1 Requires port to version/6.1.x port/7.0 Requires port to version/7.0.x priority/1 Medium priority support-load Mark issues that increase support load
Projects
None yet
Development

No branches or pull requests

4 participants