-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] LiteDB is locking up in the DiskWriterQueue #2307
Comments
We have run into this issue as well and have been able to reproduce it with a unit test. The root cause is this check here. So when does this happen? It happens if thread T1 calls EnqueuePage while T2 is running ExecuteQueue and processed the queued item exactly while T1 called Run and it was between these lines https://github.com/mbdavid/LiteDB/blob/master/LiteDB/Engine/Disk/DiskWriterQueue.cs#L59-L61. T2 set _running to 0 just before T1 tries to change it, so now T1 starts a new ExecuteQueue task, while the queue is empty and it's all dead after that. The fix is to get rid of that if condition, but I want to see if a simpler reproduction test can be written and will then make a change request. |
Remove if queue empty check that does not reset running flag. If the queue was already emptied out already by the previous ExecuteQueue, then this way it will always reset running flag back to 0.
This is my reproduction: master...m-omerovic:LiteDB:master-DiskWriterQueue-multi-threading-issue |
@m-omerovic, I'm not sure if this is the same issue as we've been seeing, but we'll test it out to see if your fix resolves it. Thanks. |
I have a strong suspicion that it is, please let me know if it does not solve your problem. |
Hopefully it's ok to write here without much information. I suspect I am having the same issue. Everything works on 4.1.4 (like OP wrote). However, issue still persists on 5.0.17. I've added the LiteDB code to my solution and use it directly. When I run it in debug mode, there is no issue. When I run it in release, however, the issue occurs again. Since I can't reproduce the bug on debug mode, I will try to reproduce/debug using prints and update here. But anyway, hopefully the information I provided (debug vs. release) can help on its own. |
I'll elaborate a bit on what I'm doing. I'm writing a server that concurrently receives messages, inserts them to the DB, then forwards to destination. For every client, an async task is launched to handle that client's connection and receive its messages. LiteDB is used as a singleton. I have a test of 100 clients that concurrently send a message (through the server) to each other. On 4.1.4 (and 5.0.17 on debug) there is no issue, the test passes quickly and successfully. However, on 5.0.17 release the issue is as follows: After around 1000-2000 messages (around ~1-2 seconds), every time this line is reached, the ReaderWriterLockSlim _transaction shows a bigger _transaction.WaitingReadCount. That is, more and more tasks are awaiting the reader lock for some reason. (I reach this place from LiteDB.Engine: LiteEngine.Insert -> LiteEngine.AutoTransaction -> TransactionMonitor.GetTransaction -> LockService.EnterTransaction.) If I add a Console.WriteLine(...); print anywhere in this Run function, the issue does not occur. So I assume it happens somewhere in this function, however, I am not sure how I can further investigate this (since I can't use the debugger and can't use prints). |
Unfortunately, have tested the latest code, together with your PR changes and the problem persists. For now we have had to revert back to 4.1.4. |
@Ido00 @dgodwin1175 That's a bummer. I will try to run my original reproduction test on the new release and see what happens. If you guys have any reproduction code, then please share. |
@m-omerovic, we don't have any test code that we can share at this stage, it may require some effort to write a test harness to replicate what our application is doing. If we get time to do this, we will. |
Is this reproduceable with .NET Core (.NET 6 for example)? Also, is there any difference when creating LiteDatabase instances at point of use (and disposing after) instead of a singleton? |
When this occurs are there any errors or is it just hung? In particular are there any "Access denied" or "File in use by another process" exceptions occurring just before this happens? I've been working on customer issues some time where we see these errors and then something is locking up. I've removed all possible bottlenecks in our own code but still something is locking up. Just wondering if it's the same issue. Oddly this only occurs on certain customer servers. Always Windows Server but version varies, and so far a common factor is MsSense is active on these servers and accessing the database files, but I can't prove it has anything to do with it. |
Tried 5.0.17 and doesn't resolve the issue I'm seeing. Using Process Monitor I do see a few DELETE PENDING entries just before it locks up, on the '-log.db' file. In normal operation I don't see that. From a process dump I can see it's blocked on the mutex in the private OpenDatabase method in SharedEngine. It seems a similar issue has been found before with that - #2135 |
Version
LiteDB: 5.0.16
.NET Framework: 4.6.2
Describe the bug
Requests to read or write a collection hang up and get no response when being used by multiple threads. The only way to recover from this situation is to kill our process that is hosting the database.
E.g. LiteDatabase.GetCollection(CollectionName)
Note that this is becoming a critical issue for our application and our customers, so would appreciate a prompt response if at all possible.
Steps to Reproduce
Our application has concurrent threads that are simultaneously reading and updating documents in the database. Our LiteDatabase object is a singleton, and we use locking on the object to ensure only one thread is updating the database at a time.
After some time, the threads all block waiting on a lock on the database, and the lock is currently held by the above request to GetCollection(), which never returns.
I have downloaded the latest LiteDB code and added some logging to pinpoint where the problem is and it appears to be in the DiskWriterQueue where the ExecuteQueue function is called by Task.Run(ExecuteQueue), however ExecuteQueue is never actually run, and so the queue is never processed. See below log snippets (these logs are all mine that I have added):
2023-04-11 11:56:58.054 [17] ERROR Got lock in DiskWriter.Run(), queue size is 30
2023-04-11 11:56:58.054 [17] ERROR Run(): oldValue is 0, _running is 1, task is null False
2023-04-11 11:56:58.054 [17] ERROR Creating new task to ExecuteQueue, queue size is 30
2023-04-11 11:56:58.054 [17] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.163 [41] ERROR Got lock in DiskWriter.Run(), queue size is 60
2023-04-11 11:56:58.163 [41] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.163 [41] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.195 [9] ERROR Got lock in DiskWriter.Run(), queue size is 62
2023-04-11 11:56:58.195 [9] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.195 [9] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.210 [38] ERROR Got lock in DiskWriter.Run(), queue size is 63
2023-04-11 11:56:58.226 [38] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.226 [38] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.226 [53] ERROR Got lock in DiskWriter.Run(), queue size is 64
2023-04-11 11:56:58.226 [53] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.226 [53] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.242 [32] ERROR Got lock in DiskWriter.Run(), queue size is 65
2023-04-11 11:56:58.242 [32] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.242 [32] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.242 [44] ERROR Got lock in DiskWriter.Run(), queue size is 66
2023-04-11 11:56:58.242 [44] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.242 [44] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.242 [58] ERROR Got lock in DiskWriter.Run(), queue size is 67
2023-04-11 11:56:58.257 [58] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.257 [58] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.257 [39] ERROR Got lock in DiskWriter.Run(), queue size is 69
2023-04-11 11:56:58.257 [39] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.257 [39] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.273 [4] ERROR Got lock in DiskWriter.Run(), queue size is 70
2023-04-11 11:56:58.273 [4] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.273 [4] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.288 [54] ERROR Got lock in DiskWriter.Run(), queue size is 71
2023-04-11 11:56:58.288 [54] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.288 [54] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.288 [62] ERROR Got lock in DiskWriter.Run(), queue size is 72
2023-04-11 11:56:58.304 [62] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.304 [62] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.320 [40] ERROR Got lock in DiskWriter.Run(), queue size is 73
2023-04-11 11:56:58.320 [62] ERROR ExecuteQueue, queue size is 73
2023-04-11 11:56:58.320 [40] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.351 [40] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.429 [57] ERROR Got lock in DiskWriter.Run(), queue size is 4
2023-04-11 11:56:58.429 [57] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:58.429 [57] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:58.476 [62] ERROR ExecuteQueue, page is null, queue size 0
2023-04-11 11:56:58.476 [62] ERROR Queue is empty, exiting ExecuteQueue
2023-04-11 11:56:58.714 [54] ERROR Got lock in DiskWriter.Run(), queue size is 30
2023-04-11 11:56:58.714 [54] ERROR Run(): oldValue is 0, _running is 1, task is null False
2023-04-11 11:56:58.714 [54] ERROR Creating new task to ExecuteQueue, queue size is 30
2023-04-11 11:56:58.714 [54] ERROR Releasing lock in DiskWriter.Run()
Note that after this, ExecuteQueue never runs and so the queue is never emptied and keeps on growing:
2023-04-11 11:56:59.403 [40] ERROR Got lock in DiskWriter.Run(), queue size is 60
2023-04-11 11:56:59.403 [40] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:59.403 [40] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:56:59.757 [39] ERROR Got lock in DiskWriter.Run(), queue size is 90
2023-04-11 11:56:59.772 [39] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:56:59.772 [39] ERROR Releasing lock in DiskWriter.Run()
2023-04-11 11:57:01.401 [55] ERROR Got lock in DiskWriter.Run(), queue size is 120
2023-04-11 11:57:01.401 [55] ERROR Run(): oldValue is 1, _running is 1, task is null False
2023-04-11 11:57:01.401 [55] ERROR Releasing lock in DiskWriter.Run()
If anything hits the DiskWriterQueue.Wait() function after this, it waits forever, waiting on _task to complete and empty the queue, but it never does.
The only way to recover from this is to restart our process.
Expected behavior
No locked up threads
Additional context
Note that we tried removing all locking on our side, but this resulted in us hitting "Lock timeout on transaction after 01:00" exceptions, followed by the "Maximum number of transactions reached". Which I assume is just different manifestation of the same problem.
Note that if we repeat the test with version 4.1.4 of LiteDB, we do not see this issue.
The text was updated successfully, but these errors were encountered: