Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: memiavl snapshot rewriting is not triggered #1034

Merged
merged 18 commits into from
May 25, 2023

Conversation

yihuang
Copy link
Collaborator

@yihuang yihuang commented May 23, 2023

Solution:

  • add configs to trigger the snapshot rewriting.

👮🏻👮🏻👮🏻 !!!! REFERENCE THE PROBLEM YOUR ARE SOLVING IN THE PR TITLE AND DESCRIBE YOUR SOLUTION HERE !!!! DO NOT FORGET !!!! 👮🏻👮🏻👮🏻

PR Checklist:

  • Have you read the CONTRIBUTING.md?
  • Does your PR follow the C4 patch requirements?
  • Have you rebased your work on top of the latest master?
  • Have you checked your code compiles? (make)
  • Have you included tests for any non-trivial functionality?
  • Have you checked your code passes the unit tests? (make test)
  • Have you checked your code formatting is correct? (go fmt)
  • Have you checked your basic code style is fine? (golangci-lint run)
  • If you added any dependencies, have you checked they do not contain any known vulnerabilities? (go list -json -m all | nancy sleuth)
  • If your changes affect the client infrastructure, have you run the integration test?
  • If your changes affect public APIs, does your PR follow the C4 evolution of public contracts?
  • If your code changes public APIs, have you incremented the crate version numbers and documented your changes in the CHANGELOG.md?
  • If you are contributing for the first time, please read the agreement in CONTRIBUTING.md now and add a comment to this pull request stating that your PR is in accordance with the Developer's Certificate of Origin.

Thank you for your code, it's appreciated! :)

Solution:
- add configs to trigger the snapshot rewriting.
@yihuang yihuang requested a review from a team as a code owner May 23, 2023 09:01
@yihuang yihuang requested review from JayT106, calvinaco and mmsqe and removed request for a team May 23, 2023 09:01
CHANGELOG.md Outdated Show resolved Hide resolved
Signed-off-by: yihuang <huang@crypto.com>
opts := rs.opts
opts.CreateIfMissing = true
opts.InitialStores = initialStores
opts.TargetVersion = uint32(version)

Check failure

Code scanning / gosec

Potential integer overflow by integer type conversion

Potential integer overflow by integer type conversion
memiavl/db.go Fixed Show fixed Hide fixed
memiavl/db.go Fixed Show fixed Hide fixed
memiavl/db.go Fixed Show fixed Hide fixed
@codecov
Copy link

codecov bot commented May 23, 2023

Codecov Report

Merging #1034 (b2d8885) into main (37a840a) will increase coverage by 24.11%.
The diff coverage is 54.85%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #1034       +/-   ##
===========================================
+ Coverage   22.43%   46.55%   +24.11%     
===========================================
  Files          50       82       +32     
  Lines        3013     7091     +4078     
===========================================
+ Hits          676     3301     +2625     
- Misses       2272     3443     +1171     
- Partials       65      347      +282     
Impacted Files Coverage Δ
app/config/config.go 100.00% <ø> (ø)
memiavl/tree.go 79.86% <ø> (ø)
memiavl/db.go 55.78% <51.31%> (ø)
memiavl/multitree.go 73.22% <61.53%> (ø)
app/memiavl.go 100.00% <100.00%> (ø)

... and 44 files with indirect coverage changes

@@ -372,18 +404,26 @@
return nil
}

// rewriteIfApplicable execute the snapshot rewrite strategy according to current height
func (db *DB) rewriteIfApplicable(height int64) {
if height%int64(db.snapshotInterval) != 0 {

Check failure

Code scanning / gosec

Potential integer overflow by integer type conversion

Potential integer overflow by integer type conversion
@yihuang yihuang requested a review from mmsqe May 24, 2023 02:48
memiavl/db.go Show resolved Hide resolved
Comment on lines +92 to +100
for key := range rs.stores {
store := rs.stores[key]
if store.GetStoreType() == types.StoreTypeIAVL {
rs.stores[key], err = rs.loadCommitStoreFromParams(rs.db, key, rs.storesParams[key])
if err != nil {
panic(fmt.Errorf("inconsistent store map, store %s not found", key.Name()))
}
}
}

Check failure

Code scanning / gosec

the value in the range statement should be _ unless copying a map: want: for key := range m

expected exactly 1 statement (either append, delete, or copying to another map) in a range with a map, got 2
memiavl/db.go Fixed Show fixed Hide fixed
memiavl/db.go Fixed Show fixed Hide fixed
Comment on lines +92 to +100
for key := range rs.stores {
store := rs.stores[key]
if store.GetStoreType() == types.StoreTypeIAVL {
rs.stores[key], err = rs.loadCommitStoreFromParams(rs.db, key, rs.storesParams[key])
if err != nil {
panic(fmt.Errorf("inconsistent store map, store %s not found", key.Name()))
}
}
}

Check warning

Code scanning / CodeQL

Iteration over map

Iteration over map may be a possible source of non-determinism
memiavl/db.go Fixed Show fixed Hide fixed
@yihuang yihuang enabled auto-merge May 24, 2023 06:50
@yihuang yihuang disabled auto-merge May 24, 2023 07:44
if initialVersion > 1 {
return int64(index) + int64(initialVersion) - 1
}
return int64(index)

Check failure

Code scanning / gosec

Potential integer overflow by integer type conversion

Potential integer overflow by integer type conversion
// walVersion converts wal index to version, reverse of walIndex
func walVersion(index uint64, initialVersion uint32) int64 {
if initialVersion > 1 {
return int64(index) + int64(initialVersion) - 1

Check failure

Code scanning / gosec

Potential integer overflow by integer type conversion

Potential integer overflow by integer type conversion
// walVersion converts wal index to version, reverse of walIndex
func walVersion(index uint64, initialVersion uint32) int64 {
if initialVersion > 1 {
return int64(index) + int64(initialVersion) - 1

Check failure

Code scanning / gosec

Potential integer overflow by integer type conversion

Potential integer overflow by integer type conversion
@yihuang yihuang force-pushed the memiavl-snapshot-strategy branch from 12c2522 to 46b7a07 Compare May 24, 2023 09:13
@yihuang yihuang enabled auto-merge May 24, 2023 09:22
memiavl/db.go Fixed Show fixed Hide fixed
memiavl/db.go Fixed Show fixed Hide fixed
memiavl/db.go Fixed Show fixed Hide fixed
Copy link
Collaborator

@mmsqe mmsqe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems async wal writing goroutine quit unexpectedly when file already closed

@yihuang
Copy link
Collaborator Author

yihuang commented May 25, 2023

seems async wal writing goroutine quit unexpectedly when file already closed

where did you see that error?

@mmsqe
Copy link
Collaborator

mmsqe commented May 25, 2023

seems async wal writing goroutine quit unexpectedly when file already closed

where did you see that error?

I was trying ibc test

1:57PM ERR CONSENSUS FAILURE!!! err="async wal writing goroutine quit unexpectedly: write /private/tmp/pytest-of-mavistan/pytest-1/ibc1/cronos_777-1/node0/data/memiavl.db/wal/00000000000000000076: file already closed" module=consensus server=node stack="goroutine 201 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x65\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2()\n\tgithub.com/tendermint/tendermint/consensus/state.go:732 +0x4c\npanic({0x10250eca0, 0xc000012720})\n\truntime/panic.go:884 +0x213\ngithub.com/crypto-org-chain/cronos/store/rootmulti.(*Store).Commit(0xc000c09e40)\n\tgithub.com/crypto-org-chain/cronos/store/rootmulti/store.go:88 +0x5bf\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc000eafc00)\n\tgithub.com/cosmos/cosmos-sdk/baseapp/abci.go:313 +0x166\ngithub.com/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc001a21020)\n\tgithub.com/tendermint/tendermint/abci/client/local_client.go:264 +0xb6\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0xc000e8cb40?)\n\tgithub.com/tendermint/tendermint/proxy/app_conn.go:93 +0x22\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).Commit(_, {{{0xb, 0x0}, {0xc000fb0b24, 0x7}}, {0xc000fb0b80, 0xc}, 0x1, 0x56, {{0xc0005dd580, ...}, ...}, ...}, ...)\n\tgithub.com/tendermint/tendermint/state/execution.go:228 +0x269\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc000fb0b24, 0x7}}, {0xc000fb0b80, 0xc}, 0x1, 0x56, {{0xc0005dd580, ...}, ...}, ...}, ...)\n\tgithub.com/tendermint/tendermint/state/execution.go:180 +0x6ee\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc000610e00, 0x56)\n\tgithub.com/tendermint/tendermint/consensus/state.go:1661 +0xafd\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc000610e00, 0x56)\n\tgithub.com/tendermint/tendermint/consensus/state.go:1570 +0x2ff\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit.func1()\n\tgithub.com/tendermint/tendermint/consensus/state.go:1505 +0xaa\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc000610e00, 0x56, 0x0)\n\tgithub.com/tendermint/tendermint/consensus/state.go:1543 +0xccf\ngithub.com/tendermint/tendermint/consensus.(*State).addVote(0xc000610e00, 0xc0006217c0, {0xc0013ce060, 0x28})\n\tgithub.com/tendermint/tendermint/consensus/state.go:2164 +0x18dc\ngithub.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc000610e00, 0xc0006217c0, {0xc0013ce060?, 0xc006bc9c08?})\n\tgithub.com/tendermint/tendermint/consensus/state.go:1962 +0x2c\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc000610e00, {{0x1039dc940, 0xc001b34778}, {0xc0013ce060, 0x28}})\n\tgithub.com/tendermint/tendermint/consensus/state.go:861 +0x3ff\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc000610e00, 0x0)\n\tgithub.com/tendermint/tendermint/consensus/state.go:768 +0x3f0\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\tgithub.com/tendermint/tendermint/consensus/state.go:379 +0x12d\n"


// truncate WAL until the earliest remaining snapshot
earliestVersion, err := firstSnapshotVersion(db.dir)
if err := db.wal.TruncateFront(uint64(earliestVersion + 1)); err != nil {

Check failure

Code scanning / gosec

Potential integer overflow by integer type conversion

Potential integer overflow by integer type conversion
memiavl/db.go Fixed Show fixed Hide fixed
if err != nil {
return nil, err
}

if err := mtree.CatchupWAL(wal, int64(opts.TargetVersion)); err != nil {
return nil, err
if opts.TargetVersion == 0 || int64(opts.TargetVersion) > mtree.Version() {

Check failure

Code scanning / gosec

Potential integer overflow by integer type conversion

Potential integer overflow by integer type conversion
return nil, fmt.Errorf("failed to load current version: %w", err)
}

if int64(opts.TargetVersion) < version {

Check failure

Code scanning / gosec

Potential integer overflow by integer type conversion

Potential integer overflow by integer type conversion
memiavl/db.go Fixed Show fixed Hide fixed
@yihuang
Copy link
Collaborator Author

yihuang commented May 25, 2023

seems async wal writing goroutine quit unexpectedly when file already closed

where did you see that error?

I was trying ibc test

1:57PM ERR CONSENSUS FAILURE!!! err="async wal writing goroutine quit unexpectedly: write /private/tmp/pytest-of-mavistan/pytest-1/ibc1/cronos_777-1/node0/data/memiavl.db/wal/00000000000000000076: file already closed" module=consensus server=node stack="goroutine 201 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x65\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2()\n\tgithub.com/tendermint/tendermint/consensus/state.go:732 +0x4c\npanic({0x10250eca0, 0xc000012720})\n\truntime/panic.go:884 +0x213\ngithub.com/crypto-org-chain/cronos/store/rootmulti.(*Store).Commit(0xc000c09e40)\n\tgithub.com/crypto-org-chain/cronos/store/rootmulti/store.go:88 +0x5bf\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc000eafc00)\n\tgithub.com/cosmos/cosmos-sdk/baseapp/abci.go:313 +0x166\ngithub.com/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc001a21020)\n\tgithub.com/tendermint/tendermint/abci/client/local_client.go:264 +0xb6\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0xc000e8cb40?)\n\tgithub.com/tendermint/tendermint/proxy/app_conn.go:93 +0x22\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).Commit(_, {{{0xb, 0x0}, {0xc000fb0b24, 0x7}}, {0xc000fb0b80, 0xc}, 0x1, 0x56, {{0xc0005dd580, ...}, ...}, ...}, ...)\n\tgithub.com/tendermint/tendermint/state/execution.go:228 +0x269\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc000fb0b24, 0x7}}, {0xc000fb0b80, 0xc}, 0x1, 0x56, {{0xc0005dd580, ...}, ...}, ...}, ...)\n\tgithub.com/tendermint/tendermint/state/execution.go:180 +0x6ee\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc000610e00, 0x56)\n\tgithub.com/tendermint/tendermint/consensus/state.go:1661 +0xafd\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc000610e00, 0x56)\n\tgithub.com/tendermint/tendermint/consensus/state.go:1570 +0x2ff\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit.func1()\n\tgithub.com/tendermint/tendermint/consensus/state.go:1505 +0xaa\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc000610e00, 0x56, 0x0)\n\tgithub.com/tendermint/tendermint/consensus/state.go:1543 +0xccf\ngithub.com/tendermint/tendermint/consensus.(*State).addVote(0xc000610e00, 0xc0006217c0, {0xc0013ce060, 0x28})\n\tgithub.com/tendermint/tendermint/consensus/state.go:2164 +0x18dc\ngithub.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc000610e00, 0xc0006217c0, {0xc0013ce060?, 0xc006bc9c08?})\n\tgithub.com/tendermint/tendermint/consensus/state.go:1962 +0x2c\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc000610e00, {{0x1039dc940, 0xc001b34778}, {0xc0013ce060, 0x28}})\n\tgithub.com/tendermint/tendermint/consensus/state.go:861 +0x3ff\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc000610e00, 0x0)\n\tgithub.com/tendermint/tendermint/consensus/state.go:768 +0x3f0\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\tgithub.com/tendermint/tendermint/consensus/state.go:379 +0x12d\n"

is it when quitting the node? it seems don't reproduce here.

if err := mtree.CatchupWAL(wal, int64(opts.TargetVersion)); err != nil {
return nil, err
if opts.TargetVersion == 0 || int64(opts.TargetVersion) > mtree.Version() {
if err := mtree.CatchupWAL(wal, int64(opts.TargetVersion)); err != nil {

Check failure

Code scanning / gosec

Potential integer overflow by integer type conversion

Potential integer overflow by integer type conversion
Comment on lines +271 to +318
go func() {
defer db.pruneSnapshotLock.Unlock()

currentName, err := os.Readlink(currentPath(db.dir))
if err != nil {
db.logger.Error("failed to read current snapshot name", "err", err)
return
}

entries, err := os.ReadDir(db.dir)
if err != nil {
db.logger.Error("failed to read db dir", "err", err)
return
}

counter := db.snapshotKeepRecent
for i := len(entries) - 1; i >= 0; i-- {
name := entries[i].Name()
if !strings.HasPrefix(name, SnapshotPrefix) {
continue
}

if name >= currentName {
// ignore any newer snapshot directories, there could be ongoning snapshot rewrite.
continue
}

if counter > 0 {
counter--
continue
}

db.logger.Info("prune snapshot", "name", name)
if err := os.RemoveAll(filepath.Join(db.dir, name)); err != nil {
db.logger.Error("failed to prune snapshot", "err", err)
}
}

// truncate WAL until the earliest remaining snapshot
earliestVersion, err := firstSnapshotVersion(db.dir)
if err != nil {
db.logger.Error("failed to find first snapshot", "err", err)
}

if err := db.wal.TruncateFront(uint64(earliestVersion + 1)); err != nil {
db.logger.Error("failed to truncate wal", "err", err, "version", earliestVersion+1)
}
}()

Check notice

Code scanning / CodeQL

Spawning a Go routine

Spawning a Go routine may be a possible source of non-determinism
@yihuang yihuang added this pull request to the merge queue May 25, 2023
@mmsqe
Copy link
Collaborator

mmsqe commented May 25, 2023

is it when quitting the node? it seems don't reproduce here.

Strange that I only see once in e2e, maybe I can add some unit test later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants