Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upon restart after crash 100% cpu is being used #908

Closed
dwoske opened this issue Jun 28, 2019 · 7 comments · Fixed by #920
Closed

upon restart after crash 100% cpu is being used #908

dwoske opened this issue Jun 28, 2019 · 7 comments · Fixed by #920
Labels
area/performance Performance related issues. kind/bug Something is broken. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/accepted We accept to investigate or work on it.
Milestone

Comments

@dwoske
Copy link

dwoske commented Jun 28, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.4 darwin/amd64

What version of Badger are you using?

github.com/dgraph-io/badger v2.0.0-rc.2.0.20190626232749-b116882676f2+incompatible

Does this issue reproduce with the latest master?

Yes, as far as I can tell. I did an update a few days ago.

What are the hardware specifications of the machine (RAM, OS, Disk)?

MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
2.9 GHz Intel Core i5
16 GB 2133 MHz LPDDR3
500 GB Flash Storage

What did you do?

I'm adding a web-api interface on top of badger, so there is an http listener and a program which does not stop running, except in event of a crash.

I open a badger database and commit some some keys and values. Commit works fine, keys and values can be read from the DB as expected.

If I ctrl-c out of the program without doing a db.Close() - then upon restart the CPU is 100% utilized after db.OpenManaged()
Badger is doing something in the background using all the CPU.

As a test, if I do a db.Close() before exiting the program, cpu usage does not peg like that.

However, I can't ensure that db.Close() will always be called - as my machine or process might crash. I never Close the DB during normal operations of the service.

The pprof during this is :

      flat  flat%   sum%        cum   cum%
    15.62s 46.57% 46.57%     33.52s 99.94%  github.com/dgraph-io/badger/y.(*WaterMark).process.func1
     9.04s 26.95% 73.52%      9.72s 28.98%  runtime.mapaccess1_fast64
     7.15s 21.32% 94.84%      8.18s 24.39%  runtime.mapdelete_fast64
     1.71s  5.10% 99.94%      1.71s  5.10%  runtime.newstack
         0     0% 99.94%     33.52s 99.94%  github.com/dgraph-io/badger/y.(*WaterMark).process

I've let this run for several minutes and the CPU does not drop down to normal. There can be any (small) number of key/values in the database.

After the restart, Badger appears to be working fine, I can insert/read entries from the database, but the CPU is pegged.

What did you expect to see?

I expect to see Badger recover somehow (quickly) after a process crash, and not peg the CPU utilization.

What did you see instead?

CPU usage pegged at 100%+


Is there something I can call immediately after Commit which puts the DB in a "good" state, so that if the process exits unexpectedly without calling DB.Close() that a restart won't peg the CPU. I tried calling Sync() directly after each commit without any effect.

@campoy campoy added area/performance Performance related issues. kind/bug Something is broken. status/accepted We accept to investigate or work on it. labels Jun 28, 2019
@campoy
Copy link
Contributor

campoy commented Jun 28, 2019

Thanks for the report, @dwoske

We'll investigate the issue soon, but in the meantime, I wanted to propose a quick fix by catching ctrl+c using the signal package. That should allow you to make sure any deferred statements are called before leaving the program.

I know that doesn't guarantee that the program won't ever crash without closing the database, so we will still make sure the bug is fixed too.

@jarifibrahim
Copy link
Contributor

jarifibrahim commented Jul 1, 2019

Looks like waterMark.Process is taking up too much CPU even when it shouldn't be running in managed mode. Let me dig deeper and get back.

@jarifibrahim
Copy link
Contributor

@dwoske Can you share your badger directory? I would like to understand why watermark.process was utilizing 100% CPU when it shouldn't be doing much work in case DB is opened in managed mode.

@dwoske
Copy link
Author

dwoske commented Jul 2, 2019

cpupeg.tar.gz

This has a pprof dump and my badger directory. I open the DB like this:

         opts := badger.DefaultOptions("/Users/dwoskdo1/emma")
	opts.NumVersionsToKeep = math.MaxInt32
	db, err := badger.OpenManaged(opts)
	if err != nil {
		log.Fatal(err)
	}

I have also seen a problem where if the process exits without doing a DB.Close() -- then upon process restart, open the database, then do a DB.Close() - that DB close never returns. This only happens after a process exit without doing DB.Close() ... open/close, open/close all work fine if they happen as they are supposed to during a single process lifetime. Only when a close is missed, do subsequent closes never return.

@campoy
Copy link
Contributor

campoy commented Jul 2, 2019

Hi @dwoske, that does seem like a different issue.

I tried to write a little program that gives the option to crash without closing the database and everything work correctly with v1.6.0.

Could you confirm your problem still occurs with that one and, if so, file a new issue?

Thanks!

@dwoske
Copy link
Author

dwoske commented Jul 3, 2019

I do still see the problem with 1.6 .. I've published my little test program here : https://github.com/dwoske/btest with the pprof.

The behavior I saw with 1.6, which I didn't notice with the current master was that the first time stopping the program without a close the cpu pegged for about 30 seconds, I ran it again without cleaning my badger directory, it pegged for what seemed a little longer - but cpu still dropped down to normal (at this point I was thinking 1.6 was fine), the third time I started the cpu pegged and didn't drop after about 5 minutes, and I stopped the program - and that is the profile currently in my repo. So, I think the problem exists (for me) in 1.6 as well.

I'll file a new issue. I wonder if this is something with my particular laptop configuration. I can test this tonight when I get home on a different Mac.

@campoy campoy added the priority/P1 Serious issue that requires eventual attention (can wait a bit) label Jul 3, 2019
@campoy campoy added this to the v2.0.0 milestone Jul 3, 2019
@dwoske
Copy link
Author

dwoske commented Jul 4, 2019

FWIW, this did happen on a different Mac. Same kind of behavior, the first couple of times the CPU dropped to normal after some 30+ seconds, and the third time it stayed pegged for several minutes before I stopped the process. Sounds like you may have recreated this as well since the issue is accepted. Thanks.

manishrjain added a commit that referenced this issue Jul 4, 2019
When doneUntil is very large, process loop in watermark can end up in a busy-wait loop. Fix that by choosing to iterate either over integer range or over waiters, depending upon whichever produces a shorter loop.

Fixes #908 .

Changes:
* When doneUntil is very large, process loop in watermark can end up in a busy-wait loop. Fix that.
* Only call notifyAndRemove if a waiter is present.
* Move updating of w.doneUntil before notifying waiters.
jarifibrahim pushed a commit that referenced this issue Mar 12, 2020
When doneUntil is very large, process loop in watermark can end up in a busy-wait loop. Fix that by choosing to iterate either over integer range or over waiters, depending upon whichever produces a shorter loop.

Fixes #908 .

Changes:
* When doneUntil is very large, process loop in watermark can end up in a busy-wait loop. Fix that.
* Only call notifyAndRemove if a waiter is present.
* Move updating of w.doneUntil before notifying waiters.

(cherry picked from commit 1f8c43a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance related issues. kind/bug Something is broken. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/accepted We accept to investigate or work on it.
Development

Successfully merging a pull request may close this issue.

3 participants