Crash when trying to open corrupted database #105

tmm1 · 2018-06-27T00:56:07Z

I have an app that takes regular backups of boltdb databases. Sometimes, for unknown reasons, the backups are corrupted.

I also have a restore UI that lets me browse and read from backups. Trying to open and read from these corrupted databases crashes my process. I'm using 4f5275f

unexpected fault address 0x8a6b008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8a6b008 pc=0x42e0e2f]

goroutine 12 [running]:
runtime.throw(0x4a487e4, 0x5)
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc4206eee00 sp=0xc4206eede0 pc=0x402d5b1
runtime.sigpanic()
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc4206eee50 sp=0xc4206eee00 pc=0x4042de1
github.com/coreos/bbolt.(*Cursor).search(0xc4206eefe0, 0xc4206ef118, 0x6, 0x20, 0x63)
	.go/src/github.com/coreos/bbolt/cursor.go:255 +0x5f fp=0xc4206eef08 sp=0xc4206eee50 pc=0x42e0e2f
github.com/coreos/bbolt.(*Cursor).seek(0xc4206eefe0, 0xc4206ef118, 0x6, 0x20, 0x0, 0x0, 0x4063d84, 0x614e000, 0x0, 0x48d8300, ...)
	.go/src/github.com/coreos/bbolt/cursor.go:159 +0xa5 fp=0xc4206eef58 sp=0xc4206eef08 pc=0x42e0725
github.com/coreos/bbolt.(*Bucket).Bucket(0xc4204976d8, 0xc4206ef118, 0x6, 0x20, 0xc4206ef118)
	.go/src/github.com/coreos/bbolt/bucket.go:105 +0xde fp=0xc4206ef010 sp=0xc4206eef58 pc=0x42dc66e
github.com/coreos/bbolt.(*Tx).Bucket(0xc4204976c0, 0xc4206ef118, 0x6, 0x20, 0x6)
	.go/src/github.com/coreos/bbolt/tx.go:101 +0x4f fp=0xc4206ef048 sp=0xc4206ef010 pc=0x42ebbef

test.db.gz

The text was updated successfully, but these errors were encountered:

tmm1 · 2018-06-27T01:14:22Z

I tried to use tx.Check() but it also blows up. Perhaps because I'm using ReadOnly: true?

unexpected fault address 0xaf41008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xaf41008 pc=0x42e6aa7]

goroutine 90 [running]:
runtime.throw(0x4a48764, 0x5)
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc4205e0be0 sp=0xc4205e0bc0 pc=0x402d2e1
runtime.sigpanic()
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc4205e0c30 sp=0xc4205e0be0 pc=0x4042b11
github.com/coreos/bbolt.(*freelist).read(0xc4200bf500, 0xaf41000)
	.go/src/github.com/coreos/bbolt/freelist.go:236 +0x37 fp=0xc4205e0ce0 sp=0xc4205e0c30 pc=0x42e6aa7
github.com/coreos/bbolt.(*DB).loadFreelist.func1()
	.go/src/github.com/coreos/bbolt/db.go:290 +0x12b fp=0xc4205e0d30 sp=0xc4205e0ce0 pc=0x42ef22b
sync.(*Once).Do(0xc42032f050, 0xc420055d78)
	/usr/local/Cellar/go/1.10.2/libexec/src/sync/once.go:44 +0xbe fp=0xc4205e0d68 sp=0xc4205e0d30 pc=0x406379e
github.com/coreos/bbolt.(*DB).loadFreelist(0xc42032ef00)
	.go/src/github.com/coreos/bbolt/db.go:283 +0x4e fp=0xc4205e0d98 sp=0xc4205e0d68 pc=0x42e201e
github.com/coreos/bbolt.(*Tx).check(0xc420384380, 0xc42039a600)
	.go/src/github.com/coreos/bbolt/tx.go:399 +0x47 fp=0xc4205e0fd0 sp=0xc4205e0d98 pc=0x42ed2c7
runtime.goexit()
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4205e0fd8 sp=0xc4205e0fd0 pc=0x405b871
created by github.com/coreos/bbolt.(*Tx).Check
	.go/src/github.com/coreos/bbolt/tx.go:393 +0x67

tmm1 · 2018-06-27T01:17:50Z

Without ReadOnly, Open() crashes right away on a different backup:

unexpected fault address 0x8bf2008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8bf2008 pc=0x42e6aa7]

goroutine 79 [running]:
runtime.throw(0x4a48764, 0x5)
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc42047f0d8 sp=0xc42047f0b8 pc=0x402d2e1
runtime.sigpanic()
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc42047f128 sp=0xc42047f0d8 pc=0x4042b11
github.com/coreos/bbolt.(*freelist).read(0xc4205cf320, 0x8bf2000)
	.go/src/github.com/coreos/bbolt/freelist.go:236 +0x37 fp=0xc42047f1d8 sp=0xc42047f128 pc=0x42e6aa7
github.com/coreos/bbolt.(*DB).loadFreelist.func1()
	.go/src/github.com/coreos/bbolt/db.go:290 +0x12b fp=0xc42047f228 sp=0xc42047f1d8 pc=0x42ef22b
sync.(*Once).Do(0xc42038d050, 0xc42047f270)
	/usr/local/Cellar/go/1.10.2/libexec/src/sync/once.go:44 +0xbe fp=0xc42047f260 sp=0xc42047f228 pc=0x406379e
github.com/coreos/bbolt.(*DB).loadFreelist(0xc42038cf00)
	.go/src/github.com/coreos/bbolt/db.go:283 +0x4e fp=0xc42047f290 sp=0xc42047f260 pc=0x42e201e
github.com/coreos/bbolt.Open(0xc4200edc20, 0x41, 0x180, 0xc42047f388, 0xc4206446b8, 0x0, 0x0)
	.go/src/github.com/coreos/bbolt/db.go:260 +0x38e fp=0xc42047f330 sp=0xc42047f290 pc=0x42e1c4e

test2.db.gz

tmm1 · 2018-06-27T01:28:12Z

Similar issue: boltdb/bolt#698

tmm1 · 2018-06-27T01:34:05Z

Here's my repro code:

func readBackup(file string) error {
	db, err := bolt.Open(file, 0600, &bolt.Options{Timeout: 1 * time.Second, ReadOnly: true})
	if err != nil {
		return err
	}
	defer db.Close()

	db.View(func(tx *bolt.Tx) error {
		if groups := tx.Bucket([]byte("groups")); groups != nil {
			num := groups.Stats().KeyN
			log.Printf("num: %v", num)
		}
	})
	return nil
}

Would be really nice if there was some way I could check to see if the backup was consistent before trying to read it. Ideally bbolt would be able to deal with truncated/corrupted files itself and not crash the entire process.

subbu05 · 2018-12-12T01:40:46Z

defer func() {
	if err := recover(); err != nil {
		fmt.Printf("Corrupted or invalid boltDB file\n",)
	}
}()

Add code to recover.

benma · 2022-11-09T10:00:28Z

I am also running into the issue that Check() on a corrupt DB crashes. Check() should definitely return an error instead of panicking.

panics-on-check.db.zip

cc @serathius - I saw you recently committed to the repo - who to ping? Is this repo still maintained?

Edit: the address fault is a segmentation fault, not a panic, so I this can't even be recovered with recover(). This seems to require a bugfix in this library, as it cannot be worked around really.

serathius · 2022-11-09T12:50:04Z

@benma etcd project still has maintainers, however we are very stretched with work on etcd. We can review PR and fix bugs, but there is no active development on bbolt.

cenkalti · 2023-04-01T00:09:47Z

With https://pkg.go.dev/runtime/debug#SetPanicOnFault , segmentation faults can be turned into panics.

ahrtr · 2023-04-01T00:20:47Z

Check() should definitely return an error instead of panicking.

Agreed.

Fixing corrupted db file is my top priority recently. The most important thing is to figure out how to reproduce the issue. It would be great if anyone provide clues on this. Please do not hesitate to ping me if you have any thoughts. Thanks.

FYI. Recently we added a bbolt surgery clear-page-elements command as a workaround to fix corrupt db file, see #417.

ahrtr · 2023-05-19T06:47:18Z

I am also running into the issue that Check() on a corrupt DB crashes. Check() should definitely return an error instead of panicking.

panics-on-check.db.zip

The DB (panics-on-check.db) was somehow corrupted during the last transaction. The corrupted db can be easily fixed by reverting the meta page (It actually rollback the last transaction).

$ ./bbolt surgery revert-meta-page /tmp/panics-on-check.db --output ./new.db
The meta page is reverted.
$ ./bbolt check ./new.db 
OK

I am almost sure that the corruption isn't caused by bbolt. The db file has 6 pages in total, but the bucket's root page is somehow a huge value 7631988 (0x747474). Most likely it's caused by other issues, e.g. hardware or OS issue?

@benma Do you still remember how was the corrupt file generated? Was there anything unusual (e.g. power off, OS crash, etc.) when the corrupt file being generated? BTW, what's the bbolt version?

$ ./bbolt  page /tmp/panics-on-check.db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=4>
Freelist:   <pgid=5>
HWM:        <pgid=6>
Txn ID:     2
Checksum:   eef96d7a2c1b336e

$ ./bbolt  page /tmp/panics-on-check.db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=3>
Freelist:   <pgid=2>
HWM:        <pgid=4>
Txn ID:     1
Checksum:   264c351a5179480f

$ ./bbolt  page /tmp/panics-on-check.db 4
Page ID:    4
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 1

"bucket": <pgid=7631988,seq=0>

ahrtr · 2023-05-19T09:20:44Z

test.db.gz

The corrupted file provided by @tmm1 seems like a potential bbolt bug. What's your bbolt version?

The freelist page (108) was somehow reset (all fields have zero value).

What's confusing is that two meta pages have exactly the same Root (99), Freelist (108) and HWM (482). Meta 0 has TXN 64920, while meta 1 has TXN 64920; it indicates that the last RW transaction did not change anything. But the freelist should change anyway (It's a potential improvement point, we shouldn't sync freelist if the RW TXN changes nothing)

$ ./bbolt page /tmp/test.db  0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=99>
Freelist:   <pgid=108>
HWM:        <pgid=482>
Txn ID:     64921
Checksum:   aab8d660770b88f7

$ ./bbolt page /tmp/test.db  1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=99>
Freelist:   <pgid=108>
HWM:        <pgid=482>
Txn ID:     64920
Checksum:   929bdcc802b6f642

ahrtr · 2023-05-26T08:01:43Z

test.db.gz

There is even no way to fix the corrupted db file. The file is only 204800 bytes, so it's 50 pages ( 204800/4096 ). Obviously the root page ID (99), Freelist (108) and HWM (482) exceeds the file size. I can't even find the root page in the available 50 pages. It seems that the file was somehow truncated, and the root was in the truncated part.

$ ls -lrt test.db
-rw-r--r-- 1 wachao wheel 204800 May 26 15:15 test.db

benma mentioned this issue May 18, 2023

Is being dependent on the Check method enough for detecting boltdb corruption? #174

Closed

cenkalti added type/bug area/corruption labels May 18, 2023

ahrtr mentioned this issue Jul 14, 2023

panic: invalid page type: 26: 10 #537

Closed

github-actions bot added the stale label May 11, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2024

ahrtr reopened this Jun 1, 2024

ahrtr removed the stale label Jun 1, 2024

github-actions bot added the stale label Aug 31, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when trying to open corrupted database #105

Crash when trying to open corrupted database #105

tmm1 commented Jun 27, 2018 •

edited

Loading

tmm1 commented Jun 27, 2018

tmm1 commented Jun 27, 2018

tmm1 commented Jun 27, 2018

tmm1 commented Jun 27, 2018

subbu05 commented Dec 12, 2018 •

edited

Loading

benma commented Nov 9, 2022 •

edited

Loading

serathius commented Nov 9, 2022

cenkalti commented Apr 1, 2023

ahrtr commented Apr 1, 2023

ahrtr commented May 19, 2023

ahrtr commented May 19, 2023 •

edited

Loading

ahrtr commented May 26, 2023

Crash when trying to open corrupted database #105

Crash when trying to open corrupted database #105

Comments

tmm1 commented Jun 27, 2018 • edited Loading

tmm1 commented Jun 27, 2018

tmm1 commented Jun 27, 2018

tmm1 commented Jun 27, 2018

tmm1 commented Jun 27, 2018

subbu05 commented Dec 12, 2018 • edited Loading

benma commented Nov 9, 2022 • edited Loading

serathius commented Nov 9, 2022

cenkalti commented Apr 1, 2023

ahrtr commented Apr 1, 2023

ahrtr commented May 19, 2023

ahrtr commented May 19, 2023 • edited Loading

ahrtr commented May 26, 2023

tmm1 commented Jun 27, 2018 •

edited

Loading

subbu05 commented Dec 12, 2018 •

edited

Loading

benma commented Nov 9, 2022 •

edited

Loading

ahrtr commented May 19, 2023 •

edited

Loading