Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bbolt tests in case of syscalls failures #382

Closed
4 tasks
ptabor opened this issue Jan 12, 2023 · 13 comments
Closed
4 tasks

bbolt tests in case of syscalls failures #382

ptabor opened this issue Jan 12, 2023 · 13 comments

Comments

@ptabor
Copy link
Contributor

ptabor commented Jan 12, 2023

We need to grow bbolt tests to cover following scenario:

  • Transaction rollbacks due to lack of disk space
  • Transaction rollbacks due to lack of ability to mmap (too large file ?)
  • Failing mlock (not enough memory available to the user ?)
  • Failing munmap

It's hard to test, it's not performed in qualification and can hit customer's production badly.

It requires some design:

  • https://pkg.go.dev/io/fs to consider, but likely insufficient (no mmap support)
  • custom (mocked) mmap/unmap/mlock impementation for tests might be good pairing to the above.
@ahrtr
Copy link
Member

ahrtr commented Jan 13, 2023

We can leverage the https://github.com/etcd-io/gofail to intentionally inject some failpoints to mimic some error we want.

Let me setup the initial implementation & demo, and then I might breakdown the tasks so that other contributors can jump in to help.

@CaojiamingAlan
Copy link
Contributor

@ptabor @ahrtr Hi, I am working on this issue. When I use gofail to let munmap return an error, db.close() will return and leave the file locked(because funlock is after munmap), resulting in any future attempts to open the file fails before the process ends. Is this something should be fixed? If the user continuously tries to open the file this might cause a error. I think we may change the order of the cleanups in close, or do not return immediately when encountering an error.

@ptabor
Copy link
Contributor Author

ptabor commented Feb 3, 2023

Yes - sounds like a real issue.
The reason for the test was to identify problems like this one. The DB should be unlocked such that the DB can get reopened and continue working.

@CaojiamingAlan
Copy link
Contributor

Another issue encountered:

When testing this:

Transaction rollbacks due to lack of ability to mmap (too large file ?)

In db.allocate if the file exceeds the original mmap size, it will do a re-map:

if err := db.munmap(); err != nil {
	return err
}
if err := mmap(db, size); err != nil {
	return err
}

However if the munmap succeeds and mmap fails, the db will have its data map set to nil, and any operation will fail afterwards.

I think we should make the re-map procedure atomic.

Should I start a new issue about this, and fix this issue first and then I can proceed to this one?

@ahrtr
Copy link
Member

ahrtr commented Feb 3, 2023

When I use gofail to let munmap return an error, db.close() will return and leave the file locked(because funlock is after munmap), resulting in any future attempts to open the file fails before the process ends

Good catch! Please deliver a PR to fix this issue.

However if the munmap succeeds and mmap fails, the db will have its data map set to nil, and any operation will fail afterwards.

This should be expected. The users should close the db and open it again in such situation. So please fix above issue firstly.

@Elbehery
Copy link
Member

Elbehery commented Apr 3, 2023

Hi,

Is this still open ? .. If so, I can take over ? 🤔

cc @ahrtr @ptabor @spzala

@ahrtr
Copy link
Member

ahrtr commented Apr 7, 2023

Thanks @Elbehery. I already added some failpoints, and will double check before I close this ticket.

@Elbehery
Copy link
Member

Elbehery commented Apr 7, 2023

I am trying to find something to work on :/

@cenkalti
Copy link
Member

@ahrtr Is this finished? What else need to be done? I can help if there is any work to be done.

@ahrtr
Copy link
Member

ahrtr commented May 25, 2023

We need to add one more failpoint and a test case to simulate lack of disk space,

bbolt/tx.go

Line 188 in 504e7be

if err := tx.db.grow(int(tx.meta.Pgid()+1) * tx.db.pageSize); err != nil {

@vianamjr
Copy link
Contributor

Hey @ahrtr I can work on this

@vianamjr
Copy link
Contributor

@ahrtr please let me know if there are more failpoints necessary.
I've started contributing to etcd and this kind of issue is good for understanding the project.
Also if there is some other that you guess it's good for me, let me know.

@ahrtr
Copy link
Member

ahrtr commented Jul 31, 2023

thx for your first contribution to bbolt.

I think we can close this ticket.

@ahrtr ahrtr closed this as completed Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants