Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL not cleaned up on failed creation #10688

Closed
joshcc3 opened this issue Apr 28, 2019 · 0 comments · Fixed by #10689
Closed

WAL not cleaned up on failed creation #10688

joshcc3 opened this issue Apr 28, 2019 · 0 comments · Fixed by #10689

Comments

@joshcc3
Copy link
Contributor

joshcc3 commented Apr 28, 2019

I ran etcd (default config) with the data-dir located in a bind mounted directory in a container running on windows. This Fsync is not supported and on the first run it Panics without cleaning up the WAL dir. As a consequence, on the subsequent run the node enters 'restart' mode however since the peers haven't been added it isn't able to make 'progress' and choose a leader and times out trying to publish its information into the cluster.
While I dont think etcd needs to support this type of storage, I do think that getting to this stage is a bug. I'm interested in contributing to etcd and would be interested in tackling this issue.

On the first run (fails to fsync)

root@ea793ad78415:/workdir/etcd# rm -rf default.etcd
root@ea793ad78415:/workdir/etcd# ./build && ./bin/etcd --logger zap
{"level":"warn","ts":"2019-04-28T17:34:03.360Z","caller":"etcdmain/etcd.go:119","msg":"'data-dir' was empty; using default","data-dir":"default.etcd"}
{"level":"info","ts":"2019-04-28T17:34:03.361Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":"2019-04-28T17:34:03.361Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://localhost:2379"]}
{"level":"info","ts":"2019-04-28T17:34:03.362Z","caller":"embed/etcd.go:297","msg":"starting an etcd server","etcd-version":"3.3.0+git","git-sha":"efcc1088f","go-versio
n":"go1.11.4","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":false,"name":"default","data-dir":"default.etcd","wal-dir":""
,"wal-dir-dedicated":"","member-dir":"default.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance
":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"adv
ertise-client-urls":["http://localhost:2379"],"listen-client-urls":["http://localhost:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-clust
er":"default=http://localhost:2380","initial-cluster-state":"new","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt
-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","
discovery-proxy":""}
{"level":"info","ts":"2019-04-28T17:34:03.384Z","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"default.etcd/member/snap/db","took":"3.8817ms"}
{"level":"info","ts":"2019-04-28T17:34:03.403Z","caller":"wal/wal.go:252","msg":"closing WAL to release flock and retry directory renaming","from":"default.etcd/member/
wal.tmp","to":"default.etcd/member/wal"}
{"level":"warn","ts":"2019-04-28T17:34:03.426Z","caller":"wal/wal.go:202","msg":"failed to fsync the parent data directory file","parent-dir-path":"default.etcd/member"
,"dir-path":"default.etcd/member/wal","error":"sync default.etcd/member: invalid argument"}
{"level":"panic","ts":"2019-04-28T17:34:03.426Z","caller":"etcdserver/raft.go:432","msg":"failed to create WAL","error":"sync default.etcd/member: invalid argument","st
acktrace":"go.etcd.io/etcd/etcdserver.startNode\n\t/workdir/etcd/etcdserver/raft.go:432\ngo.etcd.io/etcd/etcdserver.NewServer\n\t/workdir/etcd/etcdserver/server.go:399\
ngo.etcd.io/etcd/embed.StartEtcd\n\t/workdir/etcd/embed/etcd.go:209\ngo.etcd.io/etcd/etcdmain.startEtcd\n\t/workdir/etcd/etcdmain/etcd.go:302\ngo.etcd.io/etcd/etcdmain.
startEtcdOrProxyV2\n\t/workdir/etcd/etcdmain/etcd.go:160\ngo.etcd.io/etcd/etcdmain.Main\n\t/workdir/etcd/etcdmain/main.go:46\nmain.main\n\t/workdir/etcd/main.go:28\nrun
time.main\n\t/usr/local/go/src/runtime/proc.go:201"}
panic: failed to create WAL

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000084d10, 0xc00006cec0, 0x1, 0x1)
        /code/raft-go/pkg/mod/go.uber.org/zap@v1.9.1/zapcore/entry.go:229 +0x515
go.uber.org/zap.(*Logger).Panic(0xc00006f2c0, 0xf1cefb, 0x14, 0xc00006cec0, 0x1, 0x1)
        /code/raft-go/pkg/mod/go.uber.org/zap@v1.9.1/logger.go:225 +0x7f
go.etcd.io/etcd/etcdserver.startNode(0xf117a1, 0x7, 0x0, 0x0, 0x0, 0x0, 0xc00019cd80, 0x1, 0x1, 0xc00019cc00, ...)
        /workdir/etcd/etcdserver/raft.go:432 +0x33c
go.etcd.io/etcd/etcdserver.NewServer(0xf117a1, 0x7, 0x0, 0x0, 0x0, 0x0, 0xc00019cd80, 0x1, 0x1, 0xc00019cc00, ...)
        /workdir/etcd/etcdserver/server.go:399 +0x4187
go.etcd.io/etcd/embed.StartEtcd(0xc000288000, 0xc0003ec000, 0x0, 0x0)
        /workdir/etcd/embed/etcd.go:209 +0x959
go.etcd.io/etcd/etcdmain.startEtcd(0xc000288000, 0xf0e862, 0x3, 0xf0f801, 0x5)
        /workdir/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
        /workdir/etcd/etcdmain/etcd.go:160 +0x33a2
go.etcd.io/etcd/etcdmain.Main()
        /workdir/etcd/etcdmain/main.go:46 +0x37
main.main()
        /workdir/etcd/main.go:28 +0x20

Initial entries to bootstrap peers have not been added to the log but it still exists

root@ea793ad78415:/workdir/etcd# ls -lart default.etcd/member/wal
total 125000
-rwxr-xr-x 1 root root 64000000 Apr 28 17:34 0.tmp
-rwxr-xr-x 1 root root 64000000 Apr 28 17:34 0000000000000000-0000000000000000.wal
drwxrwxrwx 2 root root        0 Apr 28 17:34 .
drwxrwxrwx 2 root root        0 Apr 28 17:34 ..

Subsequent run where it times out trying to publish info:

root@ea793ad78415:/workdir/etcd# ./bin/etcd --logger zap
{"level":"warn","ts":"2019-04-28T17:34:15.203Z","caller":"etcdmain/etcd.go:119","msg":"'data-dir' was empty; using default","data-dir":"default.etcd"}
{"level":"info","ts":"2019-04-28T17:34:15.206Z","caller":"etcdmain/etcd.go:134","msg":"server has been already initialized","data-dir":"default.etcd","dir-type":"member
"}
{"level":"info","ts":"2019-04-28T17:34:15.206Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":"2019-04-28T17:34:15.207Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://localhost:2379"]}
{"level":"info","ts":"2019-04-28T17:34:15.209Z","caller":"embed/etcd.go:297","msg":"starting an etcd server","etcd-version":"3.3.0+git","git-sha":"efcc1088f","go-versio
n":"go1.11.4","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"default","data-dir":"default.etcd","wal-dir":"",
"wal-dir-dedicated":"","member-dir":"default.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance"
:true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"adve
rtise-client-urls":["http://localhost:2379"],"listen-client-urls":["http://localhost:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluste
r":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval
":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2019-04-28T17:34:15.224Z","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"default.etcd/member/snap/db","took":"2.4036ms"}
{"level":"info","ts":"2019-04-28T17:34:15.249Z","caller":"etcdserver/raft.go:498","msg":"restarting local member","cluster-id":"cdf818194e3a8c32","local-member-id":"8e9
e05c52164694d","commit-index":0}
{"level":"info","ts":"2019-04-28T17:34:15.250Z","caller":"raft/raft.go:712","msg":"8e9e05c52164694d became follower at term 0"}
{"level":"info","ts":"2019-04-28T17:34:15.250Z","caller":"raft/raft.go:389","msg":"newRaft 8e9e05c52164694d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, la
stterm: 0]"}
{"level":"warn","ts":"2019-04-28T17:34:15.264Z","caller":"auth/store.go:1288","msg":"simple token is not cryptographically signed"}
{"level":"info","ts":"2019-04-28T17:34:15.273Z","caller":"etcdserver/quota.go:98","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size
-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":"2019-04-28T17:34:15.277Z","caller":"etcdserver/server.go:774","msg":"starting etcd server","local-member-id":"8e9e05c52164694d","local-server-vers
ion":"3.3.0+git","cluster-version":"to_be_decided"}
{"level":"info","ts":"2019-04-28T17:34:15.279Z","caller":"embed/etcd.go:239","msg":"now serving peer/client/metrics","local-member-id":"8e9e05c52164694d","initial-adver
tise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://localhost:2379"],"listen-client-urls":["http://l
ocalhost:2379"],"listen-metrics-urls":[]}
{"level":"info","ts":"2019-04-28T17:34:15.279Z","caller":"embed/etcd.go:574","msg":"serving peer traffic","address":"127.0.0.1:2380"}
{"level":"info","ts":"2019-04-28T17:34:15.280Z","caller":"etcdserver/server.go:662","msg":"starting initial election tick advance","election-ticks":10}
{"level":"warn","ts":"2019-04-28T17:34:22.280Z","caller":"etcdserver/server.go:1853","msg":"failed to publish local member to cluster through raft","local-member-id":"8
e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://localhost:2379]}","request-path":"/0/members/8e9e05c52164694d/attributes","publish-timeout"
:"7s","error":"etcdserver: request timed out"}
^C

I think (one of?) 2 things could be done:

  • Delete the WAL dir and panic if anything after the rename fails so that it's not left in an inconsistent state after a panic.
  • If no entries are detected in a WAL on startup then add them in.

Let me know your thoughts?
I'm not too familiar with the code base yet but I think that's the cause?

joshcc3 pushed a commit to joshcc3/etcd that referenced this issue Apr 28, 2019
  delete <data-dir>/member/wal if any operation after the rename in
  wal.Create fails to avoid reading an inconsistent WAL on restart.

  Fixes etcd-io#10688
joshcc3 pushed a commit to joshcc3/etcd that referenced this issue Apr 29, 2019
  delete <data-dir>/member/wal if any operation after the rename in
  wal.Create fails to avoid reading an inconsistent WAL on restart.

  Fixes etcd-io#10688
joshcc3 pushed a commit to joshcc3/etcd that referenced this issue Apr 30, 2019
delete <data-dir>/member/wal if any operation after the rename in
wal.Create fails to avoid reading an inconsistent WAL on restart.

Fixes etcd-io#10688
joshcc3 pushed a commit to joshcc3/etcd that referenced this issue May 4, 2019
delete <data-dir>/member/wal if any operation after the rename in
wal.Create fails to avoid reading an inconsistent WAL on restart.

Fixes etcd-io#10688
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant