Add protection when `mmap` somehow fails #362

ahrtr · 2022-12-21T02:40:40Z

Provent bbolt from panicking in case #262

In (*DB) mmap, it munmap the file firstly, then mmap it again. If the munmap somehow fails, then the db.data is reset to nil. In this case, there is no need to execute rollback.

Signed-off-by: Benjamin Wang wachao@vmware.com

ptabor · 2022-12-21T11:58:20Z

Thank you @ahrtr. This seems alligned with bbolt_unix.go so LGTM.
Seems we should be running our tests on windows as well (#364).

I might missing something but in general the db.data=nil path looks scary to me:

db.meta1, db.meta2 pages keep pointing on some page allocated within the db.data variable...
seems golang will not free the memory (https://groups.google.com/g/golang-nuts/c/yNis7bQG_rY/m/yaJFoSx1hgIJ) but we have state that is inconsistent with some assumptions.
We should have a single method that closes 'data' and all cleans or state depending on 'data' (like meta).
What happens if we post such rollback, will try to beging another transaction (e.g. db.Update(...)).
From quick glance at the code I don't see any protection... and I expect panic to be just delayed till the next transaction attempt.

boreq · 2022-12-22T14:01:44Z

I am on "unix" and since mmap in bolt_unix.go doesn't clear db.data on error something like this would be needed to make this fix work for me:

diff --git a/db.go b/db.go
index a798c39..93a09aa 100644
--- a/db.go
+++ b/db.go
@@ -373,6 +373,7 @@ func (db *DB) mmap(minsz int) error {
 
 	// Memory-map the data file as a byte slice.
 	if err := mmap(db, size); err != nil {
+		db.data = nil
 		return err
 	}

I think it is a better approach as it works for all platforms. That being said there are more edge cases where this can break things. I don't have the time to write a test for this right now I am afraid though.

goroutine 9 [running]:
runtime.throw({0x105c574ab?, 0x1057140d8?})
	runtime/panic.go:1047 +0x40 fp=0x1301c22f0 sp=0x1301c22c0 pc=0x10573d660
runtime.sigpanic()
	runtime/signal_unix.go:842 +0x1a0 fp=0x1301c2320 sp=0x1301c22f0 pc=0x1057543e0
go.etcd.io/bbolt.(*DB).meta(0x1082e8108?)
	go.etcd.io/bbolt@v1.3.6/db.go:949 +0x28 fp=0x1301c2380 sp=0x1301c2330 pc=0x10598e248
go.etcd.io/bbolt.(*Tx).init(0x1301b81c0, 0x1300ccfc0)
	go.etcd.io/bbolt@v1.3.6/tx.go:50 +0xa0 fp=0x1301c23f0 sp=0x1301c2380 pc=0x1059956f0
go.etcd.io/bbolt.(*DB).beginTx(0x1300ccfc0)
	go.etcd.io/bbolt@v1.3.6/db.go:615 +0x1cc fp=0x1301c2480 sp=0x1301c23f0 pc=0x10598c82c
go.etcd.io/bbolt.(*DB).Begin(0x1301c2568?, 0x6c?)
	go.etcd.io/bbolt@v1.3.6/db.go:592 +0x30 fp=0x1301c24a0 sp=0x1301c2480 pc=0x10598c630
go.etcd.io/bbolt.(*DB).View(0x1301c2548?, 0x1301c2578)
	go.etcd.io/bbolt@v1.3.6/db.go:757 +0x30 fp=0x1301c2510 sp=0x1301c24a0 pc=0x10598d3f0
...irrelevant stuff

To sum this up I don't think this fix is as simple as it seems.

ahrtr · 2023-01-10T07:22:44Z

Updated. PTAL @ptabor @boreq

ahrtr · 2023-01-11T07:55:30Z

This PR should can fix the following issues:

ptabor · 2023-01-11T10:59:34Z

db.go

@@ -482,6 +482,14 @@ func (db *DB) mmap(minsz int) error {

 // munmap unmaps the data file from memory.
 func (db *DB) munmap() error {
+	defer func() {
+		db.dataref = nil


I would consider moving it to explicit db.invalidateUnsafe() method
(and comment it should be execute under the exclusive mmapLock and metaLock)

Updated (see below). I prefer not to have the suffix "Unsafe", and not to add the comment something like it should be execute under the exclusive mmapLock and metaLock as well. BoltDB has well-designed lock mechanism (see below), we shouldn't worry about the lock on each single internal method/function.

For write transaction, it requires the rwlock at db.rwlock.Lock(), and require the mmaplock when it needs to allocate more space accordingly needs to remap the db (of course release the lock when it finishes mapping), see db.mmaplock.Lock().

For readonly transaction, it only requires the mmaplock at db.mmaplock.RLock(), to prevent the db file from being remapped by anther write transaction.

The new added db.invalidateUnsafe() is only called in db.munmap(), so it's safe.

func (db *DB) invalidate() { db.dataref = nil db.data = nil db.datasz = 0 db.meta0 = nil db.meta1 = nil }

There is etcd convention (that I though is even golang convention, but I haven't found any signs) to call
methods 'UnsafeFoo' when the caller of such method needs to take care of proper locking before calling such method.

$ git grep Unsafe | grep func | nl

shows 93 cases.

I do think that the call of 'invalidate' is safe in that context. I just want to avoid situation that someone will call 'invalidate' from context where the locks are not taken... because the person haven't read carefully that the body of the method interacts with the db.data and db.meta and these fields must be mutated under the exclusive lock.

I'm not insisting on the "Unsafe" convention... but I don't think comment costs as here.

Yes, etcd has the convention (UnsafeFoo vs Foo), but there is NO such convention in bbolt for now. If we change it to invalidateUnsafe, then it isn't consistent with other internal methods.

So I'd like to keep it as it's for now.

Of course, please feel free to raise a separate ticket if you do think we should follow the same convention as etcd.

db_test.go

ptabor · 2023-01-11T12:53:43Z

db.go

+
+		db.meta0 = nil
+		db.meta1 = nil
+	}()
 	if err := munmap(db); err != nil {
 		return fmt.Errorf("unmap error: " + err.Error())


It's a topic for a separate PR... but I think we should start logging Errors early from the failed syscalls...

There is high probability they will lead to panic later (either on bbolt side, but most likely on customer'a application side) and there is significant probability that customer will not log the error themselves...
And this might lead to hard to diagnose problems of "etcd not working" while there was OOM or no disk space.

Failed syscalls is something that shoud not happen on well configured machines, so it should not make the app logs spammy.

Will think about this separately.

Signed-off-by: Benjamin Wang <wachao@vmware.com>

ptabor

Thank you for this.
Reported #382 to cover it with better (or any) integration testing.

ahrtr · 2023-01-12T22:44:58Z

Thanks @ptabor for the review.

ahrtr mentioned this pull request Dec 21, 2022

panic (unexpected fault address) on db.meta when rollback #262

Closed

ahrtr marked this pull request as draft December 23, 2022 23:34

ahrtr marked this pull request as ready for review January 10, 2023 07:16

ahrtr changed the title ~~skip rollback if db or db.data is nil~~ Add protection when mmap somehow fails Jan 10, 2023

ahrtr requested a review from ptabor January 10, 2023 07:22

ahrtr added this to the 1.3.7 milestone Jan 11, 2023

ptabor approved these changes Jan 11, 2023

View reviewed changes

ahrtr added 2 commits January 12, 2023 07:17

cleanup data and dataaz when unmapping db on Windows platform

63d0cb4

Signed-off-by: Benjamin Wang <wachao@vmware.com>

add protection when mmap somehow fails

eabffad

Signed-off-by: Benjamin Wang <wachao@vmware.com>

ptabor approved these changes Jan 12, 2023

View reviewed changes

ahrtr merged commit ece0e3d into etcd-io:master Jan 12, 2023

This was referenced Jan 13, 2023

SEGV during (*Tx).rollback #188

Closed

fatal error: fault (segmentation violation) during Commit #202

Closed

ahrtr mentioned this pull request Jan 29, 2023

fatal error when rolling back transaction with NoFreelistSync #183

Closed

hsjoberg mentioned this pull request Jun 28, 2023

kvdb: Update bbolt to 1.3.7 in order to address mmap issues lightningnetwork/lnd#7796

Merged

ahrtr mentioned this pull request Dec 28, 2023

Crash when writing data to database #653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add protection when `mmap` somehow fails #362

Add protection when `mmap` somehow fails #362

ahrtr commented Dec 21, 2022

ptabor commented Dec 21, 2022

boreq commented Dec 22, 2022

ahrtr commented Jan 10, 2023

ahrtr commented Jan 11, 2023 •

edited

Loading

ptabor Jan 11, 2023

ahrtr Jan 12, 2023 •

edited

Loading

ptabor Jan 12, 2023

ahrtr Jan 12, 2023

ptabor Jan 11, 2023

ahrtr Jan 12, 2023

ptabor left a comment

ahrtr commented Jan 12, 2023

Add protection when mmap somehow fails #362

Add protection when mmap somehow fails #362

Conversation

ahrtr commented Dec 21, 2022

ptabor commented Dec 21, 2022

boreq commented Dec 22, 2022

ahrtr commented Jan 10, 2023

ahrtr commented Jan 11, 2023 • edited Loading

ptabor Jan 11, 2023

Choose a reason for hiding this comment

ahrtr Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

ptabor Jan 12, 2023

Choose a reason for hiding this comment

ahrtr Jan 12, 2023

Choose a reason for hiding this comment

ptabor Jan 11, 2023

Choose a reason for hiding this comment

ahrtr Jan 12, 2023

Choose a reason for hiding this comment

ptabor left a comment

Choose a reason for hiding this comment

ahrtr commented Jan 12, 2023

Add protection when `mmap` somehow fails #362

Add protection when `mmap` somehow fails #362

ahrtr commented Jan 11, 2023 •

edited

Loading

ahrtr Jan 12, 2023 •

edited

Loading