-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: cleanup wal directory if creation fails #10689
Conversation
Codecov Report
@@ Coverage Diff @@
## master #10689 +/- ##
=========================================
Coverage ? 71.39%
=========================================
Files ? 394
Lines ? 36674
Branches ? 0
=========================================
Hits ? 26185
Misses ? 8638
Partials ? 1851
Continue to review full report at Codecov.
|
I fixed the commit format error, but I don't think the test failures are related to my change? |
@joshcc3 for commit title error, please make sure there is not empty space prefix. |
@spzala Looks good now I think? |
wal/wal.go
Outdated
@@ -195,6 +195,7 @@ func Create(lg *zap.Logger, dirpath string, metadata []byte) (*WAL, error) { | |||
zap.Error(perr), | |||
) | |||
} | |||
w.cleanupWAL(lg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a defer func to do this?
defer func() {
if err!= nil {
w.cleanupWAL(lg)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we do not need to repeat the cleanup 3 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@joshcc3 I am not sure if we should remove the entire wal dir. it is not good for debugging purpose. how about doing a rename to wal.broken or something? |
Sure! I've added a timestamp in as well |
@xiang90 When I ran this by setting perr at the end of the wal.Create function I get the following errors on 2 seperate runs. The expected behavior is to Panic inside the startNode function. First run with no default.etcd
After the initial run with:
It fails with the same error consistently
|
@joshcc3 probably just rename the wal dir is not good enough. etcd server probably checks the existence of data dir with other hints too. |
delete <data-dir>/member/wal if any operation after the rename in wal.Create fails to avoid reading an inconsistent WAL on restart. Fixes etcd-io#10688
@xiang90 Ignore my previous comment, there was an issue with my test - this change works as expected locally. The failing builds seem to be caused by a some sort of network connectivity issue?
|
@joshcc3 the |
@@ -223,6 +230,30 @@ func Create(lg *zap.Logger, dirpath string, metadata []byte) (*WAL, error) { | |||
return w, nil | |||
} | |||
|
|||
func (w *WAL) cleanupWAL(lg *zap.Logger) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, wondering, how about adding a test in the https://github.com/etcd-io/etcd/blob/master/wal/wal_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a good idea, however I'm not sure how to reproduce the case that I encountered (a failing fsync on a writable directory). Are there examples of platform specific tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spzala Any comments on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we just need to test if this func works correctly. basically just the rename part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something similar to the ones in the wral_test.go
that are created for different wral functions.
Rename wal with '.suffix.<timestamp>' instead of delete it and call cleanup when perr in a 'defer'ed statement.
wal/wal.go
Outdated
var err error | ||
if err = w.Close(); err != nil { | ||
if lg != nil { | ||
lg.Panic("failed to closeup WAL during cleanup", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments! Not trying to be picky but when you update the changes, I would suggest to use 'close WAL' instead of 'closeup WAL` in the log messages (here and on LOC #239)
wal/wal_test.go
Outdated
pattern := fmt.Sprintf(`%s.broken\.[\d]{8}\.[\d]{6}\.[\d]{1,6}?`, filepath.Base(p)) | ||
match, _ := regexp.MatchString(pattern, fnames[0]) | ||
if !match { | ||
t.Fatalf("match = false, expected true for %v with pattern %v", fnames[0], pattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use t.error to check test result here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiang90, what is the merge process/requirements?
lgtm |
To add test coverage of wal cleanup.
delete /member/wal if any operation after the rename in
wal.Create fails to avoid reading an inconsistent WAL on restart.
Fixes #10688
Please read https://github.com/etcd-io/etcd/blob/master/CONTRIBUTING.md#contribution-flow.