Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible WAL corruption with loss of un-fsynced writes:max entry size limit exceeded #14098

Closed
aphyr opened this issue Jun 7, 2022 · 10 comments

Comments

@aphyr
Copy link

aphyr commented Jun 7, 2022

What happened?

The lazyfs filesystem lets us simulate the effects of a power failure by losing writes which were not explicitly fsync'ed to disk. When we run etcd 3.5.3 on lazyfs, killing etcd and then losing un-fsynced writes can reliably put etcd into an unbootable state. Every time we try to start the node, it complains:

{"level":"fatal","ts":"2022-06-07T15:08:26.505-0400","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

We're still sanding bugs off of lazyfs, so it's possible this might be an issue in the filesystem itself. That said, this might also point to a problem with how etcd writes WAL files, so I'd like to check and see if this looks plausible to y'all. I know there's been some issues with data file corruption on process crash in the past; this approach might help find more bugs like that!

I've attached a full test run from Jepsen, which includes tarballs of the data directories for each node. Take a look at n1/ as an example: 20220607T150758.000-0400.zip.

This happens both with and without --experimental-initial-corrupt-check.

What did you expect to happen?

I expect that etcd ought to start up without crashing, even if we lose un-fsynced writes.

How can we reproduce it (as minimally and precisely as possible)?

Check out https://github.com/jepsen-io/etcd at adfc820826a947625c94d836b4017b4eaac7064d, and run:

lein run test -w append --concurrency 2n --time-limit 300 --rate 500 --lazyfs --nemesis kill

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380 --listen-client-urls http://192.168.122.101:2379 --advertise-client-urls http://192.168.122.101:2379 --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380 --initial-cluster n1=http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380 --snapshot-count 100 --experimental-initial-corrupt-check

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

root@n1:/opt/etcd# ./etcdctl member list -w table --endpoints=n1:2379
+------------------+---------+------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  | NAME |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+------+-----------------------------+-----------------------------+------------+
| 1153c9690d2b2284 | started |   n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 |      false |
| 4824313a421b2502 | started |   n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 |      false |
| 4d6e27d122507e9c | started |   n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 |      false |
| a1ffd5acd6a88a6a | started |   n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 |      false |
| afa39e55dee6dc2e | started |   n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 |      false |
+------------------+---------+------+-----------------------------+-----------------------------+------------+

root@n1:/opt/etcd# ./etcdctl --endpoints=http://192.168.122.105:2379,http://192.168.122.103:2379,http://192.168.122.104:2379,http://192.168.122.102:2379,http://192.168.122.101:2379 endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.122.105:2379 | 4824313a421b2502 |   3.5.3 |  4.2 MB |      true |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.103:2379 | 1153c9690d2b2284 |   3.5.3 |  4.2 MB |     false |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.104:2379 | 4d6e27d122507e9c |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.102:2379 | a1ffd5acd6a88a6a |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.101:2379 | afa39e55dee6dc2e |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

{"level":"fatal","ts":"2022-06-07T15:13:18.464-0400","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
@aphyr aphyr added the type/bug label Jun 7, 2022
@ahrtr
Copy link
Member

ahrtr commented Jun 12, 2022

Is this issue the same as 14102 ?

@aphyr
Copy link
Author

aphyr commented Jun 13, 2022

Possibly! However, since they're different kinds of faults and yield different etcd crashes, I suspect they're different issues. I also wouldn't be surprised if #14102 turns out to encompass a half-dozen different issues, just based on the number of distinct ways I've seen it fail so far.

@ahrtr
Copy link
Member

ahrtr commented Jun 14, 2022

This is also related to 14025

@ahrtr
Copy link
Member

ahrtr commented Jun 17, 2022

One WAL entry's size is 13563782407139376 bytes, see log below. It's about 13563TB, obviously it isn't correct.

2022/06/17 16:12:16 Failed reading WAL: wal: max entry size limit exceeded, recBytes: 13563782407139376, fileSize(64000000) - offset(196120) - padBytes(0) = entryLimit(63803880)

There are two possible reasons:

  1. It's the filesystem's bug, and it persisted a wrong value for the WAL entry size;
  2. It's caused by bit-flip somehow.

@ahrtr
Copy link
Member

ahrtr commented Jun 17, 2022

I think the best thing to do for now is to let etcd fail to get started in this situation (data files corrupted, including WAL file) , and it's exactly the current behavior.

In the future, we may deliver a solution to recover the data file from a point in time.

@aphyr
Copy link
Author

aphyr commented Jun 21, 2022

So we've traced this behavior to (we think) an issue with lazyfs: truncation filled with ASCII '0' characters (0x30), rather than 0x00. Etcd's WAL reader scanned for 0x00 to determine the end of the file, and in this case got 0x30 and... maybe interpreted those as a part of the size field?

I'm not exactly sure what the correct behavior here is, filesystem-wise (perhaps @devzizu could chime in?), but for the time being we've replaced truncated bytes with 0x00, and that seems to have eliminated this particular crash.

Instead, we get a new kind of crash! Here's an example:

{"level":"panic","ts":"2022-06-21T13:24:29.359-0400","logger":"raft","caller":"etcdserver/zap_raft.go:101","msg":"tocommit(56444) is out of range [lastIndex(2894)]. Was the raft log corrupted, truncated, or lost?","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101\ngo.etcd.io/etcd/raft/v3.(*raftLog).commitTo\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237\ngo.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508\ngo.etcd.io/etcd/raft/v3.stepFollower\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434\ngo.etcd.io/etcd/raft/v3.(*raft).Step\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975\ngo.etcd.io/etcd/raft/v3.(*node).run\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356"}
{"level":"info","ts":"2022-06-21T13:24:29.359-0400","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"a1ffd5acd6a88a6a"}
panic: tocommit(56444) is out of range [lastIndex(2894)]. Was the raft log corrupted, truncated, or lost?

goroutine 167 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00021e480, 0x0, 0x0, 0x0)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*SugaredLogger).log(0xc00012c028, 0x4, 0x124ecb9, 0x5d, 0xc0012a4100, 0x2, 0x2, 0x0, 0x0, 0x0)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:227 +0x111
go.uber.org/zap.(*SugaredLogger).Panicf(...)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:159
go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc0000dc090, 0x124ecb9, 0x5d, 0xc0012a4100, 0x2, 0x2)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101 +0x7d
go.etcd.io/etcd/raft/v3.(*raftLog).commitTo(0xc0001f6000, 0xdc7c)
	/go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237 +0x135
go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508 +0x54
go.etcd.io/etcd/raft/v3.stepFollower(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434 +0x478
go.etcd.io/etcd/raft/v3.(*raft).Step(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975 +0xa55
go.etcd.io/etcd/raft/v3.(*node).run(0xc000238180)
	/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356 +0x798
created by go.etcd.io/etcd/raft/v3.RestartNode
	/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:244 +0x330

@devzizu
Copy link

devzizu commented Jun 22, 2022

Hey!

I'm not exactly sure what the correct behavior here is, filesystem-wise (perhaps @devzizu could chime in?), but for the time being we've replaced truncated bytes with 0x00, and that seems to have eliminated this particular crash.

That's right, any filesystem should return null bytes on read operations (0x00) for the truncated file (in case of increasing the size). My apologies with the LazyFS's bug, as @aphyr said, I was writing 0x30 (ascii '0') instead of 0x00, because that helped me debugging at the time. Also, I thought it wouldn't be a huge deal because I was thinking that applications relied on some kind of max readable offset.

Feel free to report bugs or ask me anything about LazyFS, it will be a pleasure to help!

@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 21, 2022
@serathius
Copy link
Member

Closing per #14098 (comment)

@SELVAKUMARU
Copy link

SELVAKUMARU commented Oct 24, 2023

error starting etcd: wal: max entry size limit in sensu-go

what is the solution for this? Thanking You

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants