-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BlockingCollection<T>.TryTakeFromAny throws InvalidOperationException when underlying collection is ConcurrentBag<T> #26671
Comments
I believe the problem is that when we rewrote ConcurrentBag for .NET Core to significantly improve its performance, as one small part of that we effectively removed this check: The problem with that, as this repro highlights, is that if multiple threads are taking/stealing, it's possible that a thread may miss an item if it's taken by another thread. Consider a situation with four threads each with their own local queue, all of which are currently empty. Then consider this ordering of operations:
So, even though there was an item in the collection that could have been taken, it missed it. This sequence highlights why "Updated repro to use ThreadPool instead of tasks - it reproduces much more frequently now" made a difference: ThreadPool.QueueUserWorkItem(callback) puts work items into the global queue, whereas Task.Run from a thread pool thread puts the task into the thread's queue... that means the thread that's doing the add is very likely to keep doing adds rather than takes, which means it'll be much less likely to get into a situation like with steps (1) and (4) in the above sequence, where the same thread needed to add then take. Unfortunately I think we're going to need to put back some kind of versioning check, where steals that fail check the versions, and if anything's been added since, it tries again. cc: @kouvel, @benaadams |
Thank you for taking a look, @stephentoub. One effect of this bug is that some items which are added to the collection cannot be retrieved even after successive calls to
Maybe this is because |
@ReubenBond, would you be able to test the fix in dotnet/corefx#30947 and confirm that it fixes your issue? |
@ReubenBond ? If you guys can OK master, that would make us more comfortable with taking this into 2.1 in servicing. |
Apologies, I'll try to build CoreFx and test today. Please note the second issue which resides in |
There's no issue if the wrapped collection behaves correctly, though, right? |
@stephentoub that's right, so this fix should rectify our problems. It's just a question of resiliency, since a transient error with the underlying collection will break the BlockingCollection |
Apologies, @stephentoub @danmosemsft, could you point me in the right direction for using a local corefx build from a csproj? |
@ReubenBond you want https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/dogfooding.md in particular I think you need https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/dogfooding.md#option-1-framework-dependent (because this fix has not yet been built into a full product build) |
Thanks, @danmosemsft. It worked! This change fixes the issue we were experiencing. As a bonus, I see a massive speed improvement in our little Orleans repro project when running netcoreapp3.0 with the latest nightly SDK compared to netcoreapp2.0 on SDK 2.1.301. Before: 41.48 ms per iteration on average EDIT: I'm not sure of the etiquette/workflow for corefx, but please feel free to close this. |
Thanks for validating!
Did you try with netcoreapp2.1? My assumption is that's where the bulk of the wins are coming from, though there has already been some additional perf work for netcoreapp3.0, just not as much.
Thanks. We'll close it when we either close or merge the release/2.1 port PR. |
Shiproom templateDescriptionConcurrentBag.TryTake may fail to take an item from the collection even if it’s known to be there. This in turn causes problems for wrappers that assume if they know the collection contains an item that TryTake will succeed, like BlockingCollection. Race conditions can result in BlockingCollection throwing exceptions and getting into a corrupted state due to TryTake failing to return an item when it should have been able to. Customer ImpactExceptions / corrupted data structures / deadlocks when multiple threads access a BlockingCollection wrapped around a ConcurrentBag and race in a manner that results in takes on the bag failing. Regression?Regression from 1.x RiskLow: |
Ported to 2.1 with dotnet/corefx#31009 |
Pulled this temporarily as it missed 2.1.3 and they want a clean branch in case they need to rebuild 2.1.3 for some reason. keeping this issue open meantime. |
Moving this to 2.1.5 per shiproom today. @tarekgh would you be able to help load balance this from @danmosemsft |
Moving label to PR per new process BTW I think the PR is already ready, no work required |
Fixed in release/2.1 branch in PR dotnet/corefx#31162 (2nd attempt after the first attempt in PR dotnet/corefx#31009 was reverted by PR dotnet/corefx#31132). The fix will ship as part of 2.1.5 release. |
When the underlying collection for
BlockingCollection<T>
isConcurrentBag<T>
, concurrent calls to the staticBlockingCollection<T>.TryTakeFromAny
method can sometimes throwInvalidOperationException
with the message "The underlying collection was modified from outside of the BlockingCollection". This can occur without any external modification to the collection.This behavior is present in .NET Core 2.0/2.1 but not in .NET Framework 4.6.1.
Repro: https://gist.github.com/ReubenBond/98de2cede0d57a989ededa8e113b0f39#file-blockingcollection_concurrentbag_issue-cs
EDIT: This can be reproduced without
TryTakeFromAny
by replacing that line in the repro withsuccess = blockingCollection.TryTake(out _)
.EDIT 2: This does not reproduce with an underlying collection of type
ConcurrentQueue<T>
EDIT 3: Updated repro to use ThreadPool instead of tasks - it reproduces much more frequently now.
The text was updated successfully, but these errors were encountered: