Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] LiteDB is locking up in the DiskWriterQueue #2307

Open
dgodwin1175 opened this issue Apr 11, 2023 · 12 comments
Open

[BUG] LiteDB is locking up in the DiskWriterQueue #2307

dgodwin1175 opened this issue Apr 11, 2023 · 12 comments
Labels

Comments

@dgodwin1175
Copy link

dgodwin1175 commented Apr 11, 2023

Version
LiteDB: 5.0.16
.NET Framework: 4.6.2

Describe the bug
Requests to read or write a collection hang up and get no response when being used by multiple threads. The only way to recover from this situation is to kill our process that is hosting the database.
E.g. LiteDatabase.GetCollection(CollectionName)

Note that this is becoming a critical issue for our application and our customers, so would appreciate a prompt response if at all possible.

Steps to Reproduce
Our application has concurrent threads that are simultaneously reading and updating documents in the database. Our LiteDatabase object is a singleton, and we use locking on the object to ensure only one thread is updating the database at a time.
After some time, the threads all block waiting on a lock on the database, and the lock is currently held by the above request to GetCollection(), which never returns.

I have downloaded the latest LiteDB code and added some logging to pinpoint where the problem is and it appears to be in the DiskWriterQueue where the ExecuteQueue function is called by Task.Run(ExecuteQueue), however ExecuteQueue is never actually run, and so the queue is never processed. See below log snippets (these logs are all mine that I have added):

2023-04-11 11:56:58.054 [17] ERROR Got lock in DiskWriter.Run(), queue size is 30
2023-04-11 11:56:58.054 [17] ERROR Run(): oldValue is 0, _running is 1, task is null False
2023-04-11 11:56:58.054 [17] ERROR Creating new task to ExecuteQueue, queue size is 30
2023-04-11 11:56:58.054 [17] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.163 [41] ERROR Got lock in DiskWriter.Run(), queue size is 60
2023-04-11 11:56:58.163 [41] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.163 [41] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.195 [9] ERROR Got lock in DiskWriter.Run(), queue size is 62
2023-04-11 11:56:58.195 [9] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.195 [9] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.210 [38] ERROR Got lock in DiskWriter.Run(), queue size is 63
2023-04-11 11:56:58.226 [38] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.226 [38] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.226 [53] ERROR Got lock in DiskWriter.Run(), queue size is 64
2023-04-11 11:56:58.226 [53] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.226 [53] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.242 [32] ERROR Got lock in DiskWriter.Run(), queue size is 65
2023-04-11 11:56:58.242 [32] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.242 [32] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.242 [44] ERROR Got lock in DiskWriter.Run(), queue size is 66
2023-04-11 11:56:58.242 [44] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.242 [44] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.242 [58] ERROR Got lock in DiskWriter.Run(), queue size is 67
2023-04-11 11:56:58.257 [58] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.257 [58] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.257 [39] ERROR Got lock in DiskWriter.Run(), queue size is 69
2023-04-11 11:56:58.257 [39] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.257 [39] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.273 [4] ERROR Got lock in DiskWriter.Run(), queue size is 70
2023-04-11 11:56:58.273 [4] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.273 [4] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.288 [54] ERROR Got lock in DiskWriter.Run(), queue size is 71
2023-04-11 11:56:58.288 [54] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.288 [54] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.288 [62] ERROR Got lock in DiskWriter.Run(), queue size is 72
2023-04-11 11:56:58.304 [62] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.304 [62] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.320 [40] ERROR Got lock in DiskWriter.Run(), queue size is 73
2023-04-11 11:56:58.320 [62] ERROR ExecuteQueue, queue size is 73
2023-04-11 11:56:58.320 [40] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.351 [40] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.429 [57] ERROR Got lock in DiskWriter.Run(), queue size is 4
2023-04-11 11:56:58.429 [57] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.429 [57] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.476 [62] ERROR ExecuteQueue, page is null, queue size 0
2023-04-11 11:56:58.476 [62] ERROR Queue is empty, exiting ExecuteQueue
2023-04-11 11:56:58.714 [54] ERROR Got lock in DiskWriter.Run(), queue size is 30
2023-04-11 11:56:58.714 [54] ERROR Run(): oldValue is 0, _running is 1, task is null False
2023-04-11 11:56:58.714 [54] ERROR Creating new task to ExecuteQueue, queue size is 30
2023-04-11 11:56:58.714 [54] ERROR Releasing lock in DiskWriter.Run()

Note that after this, ExecuteQueue never runs and so the queue is never emptied and keeps on growing:

2023-04-11 11:56:59.403 [40] ERROR Got lock in DiskWriter.Run(), queue size is 60
2023-04-11 11:56:59.403 [40] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:59.403 [40] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:59.757 [39] ERROR Got lock in DiskWriter.Run(), queue size is 90
2023-04-11 11:56:59.772 [39] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:59.772 [39] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:57:01.401 [55] ERROR Got lock in DiskWriter.Run(), queue size is 120
2023-04-11 11:57:01.401 [55] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:57:01.401 [55] ERROR Releasing lock in DiskWriter.Run()

If anything hits the DiskWriterQueue.Wait() function after this, it waits forever, waiting on _task to complete and empty the queue, but it never does.

The only way to recover from this is to restart our process.

Expected behavior
No locked up threads

Additional context
Note that we tried removing all locking on our side, but this resulted in us hitting "Lock timeout on transaction after 01:00" exceptions, followed by the "Maximum number of transactions reached". Which I assume is just different manifestation of the same problem.

Note that if we repeat the test with version 4.1.4 of LiteDB, we do not see this issue.

@m-omerovic
Copy link

We have run into this issue as well and have been able to reproduce it with a unit test. The root cause is this check here.
https://github.com/mbdavid/LiteDB/blob/master/LiteDB/Engine/Disk/DiskWriterQueue.cs#L95
If this condition is hit, then the _running flag is never set back to 0 and no new ExecuteQueue task is ever run.

So when does this happen? It happens if thread T1 calls EnqueuePage while T2 is running ExecuteQueue and processed the queued item exactly while T1 called Run and it was between these lines https://github.com/mbdavid/LiteDB/blob/master/LiteDB/Engine/Disk/DiskWriterQueue.cs#L59-L61. T2 set _running to 0 just before T1 tries to change it, so now T1 starts a new ExecuteQueue task, while the queue is empty and it's all dead after that.

The fix is to get rid of that if condition, but I want to see if a simpler reproduction test can be written and will then make a change request.

m-omerovic pushed a commit to m-omerovic/LiteDB that referenced this issue Apr 26, 2023
Remove if queue empty check that does not reset running flag. If the queue was already emptied out already by the previous ExecuteQueue, then this way it will always reset running flag back to 0.
@m-omerovic
Copy link

This is my reproduction: master...m-omerovic:LiteDB:master-DiskWriterQueue-multi-threading-issue
This is the PR for the fix: #2317

@dgodwin1175
Copy link
Author

@m-omerovic, I'm not sure if this is the same issue as we've been seeing, but we'll test it out to see if your fix resolves it. Thanks.

@m-omerovic
Copy link

I have a strong suspicion that it is, please let me know if it does not solve your problem.

@Ido00
Copy link

Ido00 commented Aug 6, 2023

Hopefully it's ok to write here without much information.

I suspect I am having the same issue. Everything works on 4.1.4 (like OP wrote). However, issue still persists on 5.0.17. I've added the LiteDB code to my solution and use it directly. When I run it in debug mode, there is no issue. When I run it in release, however, the issue occurs again.

Since I can't reproduce the bug on debug mode, I will try to reproduce/debug using prints and update here. But anyway, hopefully the information I provided (debug vs. release) can help on its own.

@Ido00
Copy link

Ido00 commented Aug 6, 2023

I'll elaborate a bit on what I'm doing. I'm writing a server that concurrently receives messages, inserts them to the DB, then forwards to destination. For every client, an async task is launched to handle that client's connection and receive its messages. LiteDB is used as a singleton. I have a test of 100 clients that concurrently send a message (through the server) to each other.

On 4.1.4 (and 5.0.17 on debug) there is no issue, the test passes quickly and successfully.

However, on 5.0.17 release the issue is as follows: After around 1000-2000 messages (around ~1-2 seconds), every time this line is reached, the ReaderWriterLockSlim _transaction shows a bigger _transaction.WaitingReadCount. That is, more and more tasks are awaiting the reader lock for some reason. (I reach this place from LiteDB.Engine: LiteEngine.Insert -> LiteEngine.AutoTransaction -> TransactionMonitor.GetTransaction -> LockService.EnterTransaction.)

If I add a Console.WriteLine(...); print anywhere in this Run function, the issue does not occur. So I assume it happens somewhere in this function, however, I am not sure how I can further investigate this (since I can't use the debugger and can't use prints).

@dgodwin1175
Copy link
Author

I have a strong suspicion that it is, please let me know if it does not solve your problem.

Unfortunately, have tested the latest code, together with your PR changes and the problem persists. For now we have had to revert back to 4.1.4.

@m-omerovic
Copy link

@Ido00 @dgodwin1175 That's a bummer. I will try to run my original reproduction test on the new release and see what happens. If you guys have any reproduction code, then please share.
I guess debug build and console output will alter the timing and probably decrease the chance of the issue occurring.

@dgodwin1175
Copy link
Author

@Ido00 @dgodwin1175 That's a bummer. I will try to run my original reproduction test on the new release and see what happens. If you guys have any reproduction code, then please share. I guess debug build and console output will alter the timing and probably decrease the chance of the issue occurring.

@m-omerovic, we don't have any test code that we can share at this stage, it may require some effort to write a test harness to replicate what our application is doing. If we get time to do this, we will.
Note that when reproducing on our side, it doesn't make a difference if it's debug or release, and adding all of the additional logging as per my original post, didn't affect our ability to reproduce either.

@tjmoore
Copy link

tjmoore commented Aug 22, 2023

Is this reproduceable with .NET Core (.NET 6 for example)?

Also, is there any difference when creating LiteDatabase instances at point of use (and disposing after) instead of a singleton?

@tjmoore
Copy link

tjmoore commented Oct 23, 2023

When this occurs are there any errors or is it just hung?

In particular are there any "Access denied" or "File in use by another process" exceptions occurring just before this happens?

I've been working on customer issues some time where we see these errors and then something is locking up. I've removed all possible bottlenecks in our own code but still something is locking up. Just wondering if it's the same issue.

Oddly this only occurs on certain customer servers. Always Windows Server but version varies, and so far a common factor is MsSense is active on these servers and accessing the database files, but I can't prove it has anything to do with it.

@tjmoore
Copy link

tjmoore commented Oct 24, 2023

Tried 5.0.17 and doesn't resolve the issue I'm seeing.

Using Process Monitor I do see a few DELETE PENDING entries just before it locks up, on the '-log.db' file. In normal operation I don't see that.

From a process dump I can see it's blocked on the mutex in the private OpenDatabase method in SharedEngine. It seems a similar issue has been found before with that - #2135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants