Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: cleanup wal directory if creation fails #10689

Merged
merged 3 commits into from
May 10, 2019
Merged

Conversation

joshcc3
Copy link
Contributor

@joshcc3 joshcc3 commented Apr 28, 2019

delete /member/wal if any operation after the rename in
wal.Create fails to avoid reading an inconsistent WAL on restart.

Fixes #10688

Please read https://github.com/etcd-io/etcd/blob/master/CONTRIBUTING.md#contribution-flow.

@codecov-io
Copy link

codecov-io commented Apr 28, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@caee28a). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #10689   +/-   ##
=========================================
  Coverage          ?   71.39%           
=========================================
  Files             ?      394           
  Lines             ?    36674           
  Branches          ?        0           
=========================================
  Hits              ?    26185           
  Misses            ?     8638           
  Partials          ?     1851
Impacted Files Coverage Δ
wal/wal.go 53.33% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update caee28a...f7f7e9c. Read the comment docs.

@joshcc3
Copy link
Contributor Author

joshcc3 commented Apr 29, 2019

I fixed the commit format error, but I don't think the test failures are related to my change?

@spzala
Copy link
Member

spzala commented Apr 29, 2019

I fixed the commit format error, but I don't think the test failures are related to my change?

@joshcc3 for commit title error, please make sure there is not empty space prefix.

@joshcc3
Copy link
Contributor Author

joshcc3 commented Apr 30, 2019

@spzala Looks good now I think?

wal/wal.go Outdated
@@ -195,6 +195,7 @@ func Create(lg *zap.Logger, dirpath string, metadata []byte) (*WAL, error) {
zap.Error(perr),
)
}
w.cleanupWAL(lg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a defer func to do this?

defer func() {
    if err!= nil {
        w.cleanupWAL(lg)
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we do not need to repeat the cleanup 3 times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@xiang90
Copy link
Contributor

xiang90 commented May 1, 2019

@joshcc3 I am not sure if we should remove the entire wal dir. it is not good for debugging purpose. how about doing a rename to wal.broken or something?

@joshcc3
Copy link
Contributor Author

joshcc3 commented May 2, 2019

@joshcc3 I am not sure if we should remove the entire wal dir. it is not good for debugging purpose. how about doing a rename to wal.broken or something?

Sure! I've added a timestamp in as well

@joshcc3
Copy link
Contributor Author

joshcc3 commented May 2, 2019

@xiang90 When I ran this by setting perr at the end of the wal.Create function I get the following errors on 2 seperate runs. The expected behavior is to Panic inside the startNode function.
Is there a way to prevent this?
(It runs successfully when the artifically introduced error is removed.)

First run with no default.etcd

2019-05-02 10:56:37.349656 E | etcdmain: error verifying flags, unknown logger option "zap[A". See 'etcd --help'.
[jcoutin@loowjcoutin2 etcd]$ ./bin/etcd --logger zap
{"level":"warn","ts":"2019-05-02T10:56:38.973+0100","caller":"etcdmain/etcd.go:119","msg":"'data-dir' was empty; using default","data-dir":"default.etcd"}
{"level":"info","ts":"2019-05-02T10:56:38.973+0100","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":"2019-05-02T10:56:38.974+0100","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://localhost:2379"]}
{"level":"info","ts":"2019-05-02T10:56:38.974+0100","caller":"embed/etcd.go:297","msg":"starting an etcd server","etcd-version":"3.3.0+git","git-sha":"4f270cf8e","go-version":"go1.12.4","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":false,"name":"default","data-dir":"default.etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"default.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://localhost:2379"],"listen-client-urls":["http://localhost:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"default=http://localhost:2380","initial-cluster-state":"new","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2019-05-02T10:56:38.977+0100","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"default.etcd/member/snap/db","took":"2.643108ms"}
{"level":"info","ts":"2019-05-02T10:56:38.988+0100","caller":"etcdserver/raft.go:452","msg":"starting local member","local-member-id":"8e9e05c52164694d","cluster-id":"cdf818194e3a8c32"}
{"level":"info","ts":"2019-05-02T10:56:38.988+0100","caller":"raft/raft.go:712","msg":"8e9e05c52164694d became follower at term 0"}
{"level":"info","ts":"2019-05-02T10:56:38.988+0100","caller":"raft/raft.go:389","msg":"newRaft 8e9e05c52164694d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]"}
{"level":"info","ts":"2019-05-02T10:56:38.988+0100","caller":"raft/raft.go:712","msg":"8e9e05c52164694d became follower at term 1"}
{"level":"warn","ts":"2019-05-02T10:56:39.006+0100","caller":"auth/store.go:1288","msg":"simple token is not cryptographically signed"}
{"level":"info","ts":"2019-05-02T10:56:39.008+0100","caller":"etcdserver/quota.go:98","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":"2019-05-02T10:56:39.011+0100","caller":"etcdserver/server.go:774","msg":"starting etcd server","local-member-id":"8e9e05c52164694d","local-server-version":"3.3.0+git","cluster-version":"to_be_decided"}
{"level":"info","ts":"2019-05-02T10:56:39.011+0100","caller":"etcdserver/server.go:640","msg":"started as single-node; fast-forwarding election ticks","local-member-id":"8e9e05c52164694d","forward-ticks":9,"forward-duration":"900ms","election-ticks":10,"election-timeout":"1s"}
{"level":"fatal","ts":"2019-05-02T10:56:39.011+0100","caller":"etcdserver/raft.go:233","msg":"failed to save Raft hard state and entries","error":"seek default.etcd/member/wal.tmp/0000000000000000-0000000000000000.wal: file already closed","stacktrace":"go.etcd.io/etcd/etcdserver.(*raftNode).start.func1\n\t/data/home/jcoutin/tmp/etcd/etcdserver/raft.go:233"}

After the initial run with:

[jcoutin@loowjcoutin2 etcd]$ ls default.etcd/member/
snap  wal.broken.20190502.105638.988452

It fails with the same error consistently

[jcoutin@loowjcoutin2 etcd]$ ./bin/etcd --logger zap
{"level":"warn","ts":"2019-05-02T10:57:48.511+0100","caller":"etcdmain/etcd.go:119","msg":"'data-dir' was empty; using default","data-dir":"default.etcd"}
{"level":"info","ts":"2019-05-02T10:57:48.511+0100","caller":"etcdmain/etcd.go:134","msg":"server has been already initialized","data-dir":"default.etcd","dir-type":"member"}
{"level":"info","ts":"2019-05-02T10:57:48.511+0100","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":"2019-05-02T10:57:48.513+0100","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://localhost:2379"]}
{"level":"info","ts":"2019-05-02T10:57:48.514+0100","caller":"embed/etcd.go:297","msg":"starting an etcd server","etcd-version":"3.3.0+git","git-sha":"4f270cf8e","go-version":"go1.12.4","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":false,"name":"default","data-dir":"default.etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"default.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://localhost:2379"],"listen-client-urls":["http://localhost:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"default=http://localhost:2380","initial-cluster-state":"new","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2019-05-02T10:57:48.514+0100","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"default.etcd/member/snap/db","took":"107.887µs"}
{"level":"info","ts":"2019-05-02T10:57:48.519+0100","caller":"etcdserver/raft.go:452","msg":"starting local member","local-member-id":"8e9e05c52164694d","cluster-id":"cdf818194e3a8c32"}
{"level":"info","ts":"2019-05-02T10:57:48.519+0100","caller":"raft/raft.go:712","msg":"8e9e05c52164694d became follower at term 0"}
{"level":"info","ts":"2019-05-02T10:57:48.519+0100","caller":"raft/raft.go:389","msg":"newRaft 8e9e05c52164694d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]"}
{"level":"info","ts":"2019-05-02T10:57:48.519+0100","caller":"raft/raft.go:712","msg":"8e9e05c52164694d became follower at term 1"}
{"level":"warn","ts":"2019-05-02T10:57:48.528+0100","caller":"auth/store.go:1288","msg":"simple token is not cryptographically signed"}
{"level":"info","ts":"2019-05-02T10:57:48.529+0100","caller":"etcdserver/quota.go:98","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":"2019-05-02T10:57:48.530+0100","caller":"etcdserver/server.go:774","msg":"starting etcd server","local-member-id":"8e9e05c52164694d","local-server-version":"3.3.0+git","cluster-version":"to_be_decided"}
{"level":"info","ts":"2019-05-02T10:57:48.530+0100","caller":"etcdserver/server.go:640","msg":"started as single-node; fast-forwarding election ticks","local-member-id":"8e9e05c52164694d","forward-ticks":9,"forward-duration":"900ms","election-ticks":10,"election-timeout":"1s"}
{"level":"fatal","ts":"2019-05-02T10:57:48.530+0100","caller":"etcdserver/server.go:816","msg":"failed to purge wal file","error":"open default.etcd/member/wal: no such file or directory","stacktrace":"go.etcd.io/etcd/etcdserver.(*EtcdServer).purgeFile\n\t/data/home/jcoutin/tmp/etcd/etcdserver/server.go:816\ngo.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1\n\t/data/home/jcoutin/tmp/etcd/etcdserver/server.go:2426"}

@xiang90
Copy link
Contributor

xiang90 commented May 2, 2019

@joshcc3 probably just rename the wal dir is not good enough. etcd server probably checks the existence of data dir with other hints too.

@jingyih
Copy link
Contributor

jingyih commented May 2, 2019

I did a quick search. etcd server is checking the file extension.

haveWAL := wal.Exist(cfg.WALDir())

etcd/wal/util.go

Lines 30 to 36 in 616592d

func Exist(dir string) bool {
names, err := fileutil.ReadDir(dir, fileutil.WithExt(".wal"))
if err != nil {
return false
}
return len(names) != 0
}

delete <data-dir>/member/wal if any operation after the rename in
wal.Create fails to avoid reading an inconsistent WAL on restart.

Fixes etcd-io#10688
@joshcc3
Copy link
Contributor Author

joshcc3 commented May 4, 2019

@xiang90 Ignore my previous comment, there was an issue with my test - this change works as expected locally.
@jingyih Since the change renames the WAL directory to wal.broken.<timestamp>, the cfg.WALDir() in wal.Exist(cfg.WALDir()) will not match the expected wal and this should work as expected.

The failing builds seem to be caused by a some sort of network connectivity issue?
All of the failing ones, fail when trying to download requirements with:

go: sigs.k8s.io/yaml@v1.1.0: unrecognized import path "sigs.k8s.io/yaml" (https fetch: Get https://sigs.k8s.io/yaml?go-get=1: net/http: TLS handshake timeout)
go: error loading module requirements

@spzala
Copy link
Member

spzala commented May 4, 2019

@xiang90 Ignore my previous comment, there was an issue with my test - this change works as expected locally.
@jingyih Since the change renames the WAL directory to wal.broken., the cfg.WALDir() in wal.Exist(cfg.WALDir()) will not match the expected wal and this should work as expected.

The failing builds seem to be caused by a some sort of network connectivity issue?
All of the failing ones, fail when trying to download requirements with:

go: sigs.k8s.io/yaml@v1.1.0: unrecognized import path "sigs.k8s.io/yaml" (https fetch: Get https://sigs.k8s.io/yaml?go-get=1: net/http: TLS handshake timeout)
go: error loading module requirements

@joshcc3 the TLS handshake timeout is not related to your changes. It's created by another PR merge recently and the PR author is notified of it to investigate.

wal/wal.go Outdated Show resolved Hide resolved
wal/wal.go Outdated Show resolved Hide resolved
@@ -223,6 +230,30 @@ func Create(lg *zap.Logger, dirpath string, metadata []byte) (*WAL, error) {
return w, nil
}

func (w *WAL) cleanupWAL(lg *zap.Logger) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wondering, how about adding a test in the https://github.com/etcd-io/etcd/blob/master/wal/wal_test.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a good idea, however I'm not sure how to reproduce the case that I encountered (a failing fsync on a writable directory). Are there examples of platform specific tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spzala Any comments on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we just need to test if this func works correctly. basically just the rename part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something similar to the ones in the wral_test.go that are created for different wral functions.

Rename wal with '.suffix.<timestamp>' instead of delete it and call cleanup when perr in a 'defer'ed statement.
wal/wal.go Outdated
var err error
if err = w.Close(); err != nil {
if lg != nil {
lg.Panic("failed to closeup WAL during cleanup", zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments! Not trying to be picky but when you update the changes, I would suggest to use 'close WAL' instead of 'closeup WAL` in the log messages (here and on LOC #239)

wal/wal_test.go Outdated
pattern := fmt.Sprintf(`%s.broken\.[\d]{8}\.[\d]{6}\.[\d]{1,6}?`, filepath.Base(p))
match, _ := regexp.MatchString(pattern, fnames[0])
if !match {
t.Fatalf("match = false, expected true for %v with pattern %v", fnames[0], pattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use t.error to check test result here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90, what is the merge process/requirements?

@xiang90
Copy link
Contributor

xiang90 commented May 10, 2019

lgtm

To add test coverage of wal cleanup.
@xiang90 xiang90 merged commit d8c8902 into etcd-io:master May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

WAL not cleaned up on failed creation
5 participants