BlockingCollection<T>.TryTakeFromAny throws InvalidOperationException when underlying collection is ConcurrentBag<T> #26671

ReubenBond · 2018-07-02T06:52:35Z

When the underlying collection for BlockingCollection<T> is ConcurrentBag<T>, concurrent calls to the static BlockingCollection<T>.TryTakeFromAny method can sometimes throw InvalidOperationException with the message "The underlying collection was modified from outside of the BlockingCollection". This can occur without any external modification to the collection.

This behavior is present in .NET Core 2.0/2.1 but not in .NET Framework 4.6.1.

Repro: https://gist.github.com/ReubenBond/98de2cede0d57a989ededa8e113b0f39#file-blockingcollection_concurrentbag_issue-cs

EDIT: This can be reproduced without TryTakeFromAny by replacing that line in the repro with success = blockingCollection.TryTake(out _).

EDIT 2: This does not reproduce with an underlying collection of type ConcurrentQueue<T>

EDIT 3: Updated repro to use ThreadPool instead of tasks - it reproduces much more frequently now.

The text was updated successfully, but these errors were encountered:

stephentoub · 2018-07-02T15:16:55Z

I believe the problem is that when we rewrote ConcurrentBag for .NET Core to significantly improve its performance, as one small part of that we effectively removed this check:
https://referencesource.microsoft.com/#System/sys/system/collections/concurrent/ConcurrentBag.cs,397

The problem with that, as this repro highlights, is that if multiple threads are taking/stealing, it's possible that a thread may miss an item if it's taken by another thread. Consider a situation with four threads each with their own local queue, all of which are currently empty. Then consider this ordering of operations:

Thread 2 adds an item, incrementing the BlockingCollection's semaphore to 1.
Thread 4 tries to take an item; there is one, so it decrements the semaphore's count to 0, finds its local queue empty, and starts searching for an item to steal. It looks at thread 1 and finds its local queue empty.
Thread 1 adds an item, incrementing the BC's semaphore back to 1. Thread 4 just checked Thread 1's queue, so it's not going to check it again.
Thread 2 takes an item, decrementing the semaphore back to 0. It checks it own queue and finds that it contains an item. Success.
Thread 4 continues its search: it finds thread 2's list is empty and then finds thread 3's list is empty. It's now looked at all of the lists, and returns false from TryTake, even though thread 1's list contains an item and it successfully decremented the semaphore's count. BlockingCollection throws.

So, even though there was an item in the collection that could have been taken, it missed it.

This sequence highlights why "Updated repro to use ThreadPool instead of tasks - it reproduces much more frequently now" made a difference: ThreadPool.QueueUserWorkItem(callback) puts work items into the global queue, whereas Task.Run from a thread pool thread puts the task into the thread's queue... that means the thread that's doing the add is very likely to keep doing adds rather than takes, which means it'll be much less likely to get into a situation like with steps (1) and (4) in the above sequence, where the same thread needed to add then take.

Unfortunately I think we're going to need to put back some kind of versioning check, where steals that fail check the versions, and if anything's been added since, it tries again.

cc: @kouvel, @benaadams

ReubenBond · 2018-07-04T05:55:02Z

Thank you for taking a look, @stephentoub.

One effect of this bug is that some items which are added to the collection cannot be retrieved even after successive calls to TryTake, as seen in this modified repro, where the results typically look something like this:

IsCompleted returns true while the items are still in the underlying collection. The underlying collection show the lost items, but blockingCollection.Count == 0.

Maybe this is because IsCompleted is implemented as IsAddingCompleted && _occupiedNodes.Currentcount == 0, but when the InvalidOperationException is thrown from TryTakeWithNoTimeValidation, the semaphore is not released (even though an item was not taken), and on subsequent calls, that method will terminate early if IsCompleted is true.

stephentoub · 2018-07-11T21:10:01Z

@ReubenBond, would you be able to test the fix in dotnet/corefx#30947 and confirm that it fixes your issue?

danmoseley · 2018-07-13T22:43:25Z

@ReubenBond ? If you guys can OK master, that would make us more comfortable with taking this into 2.1 in servicing.

ReubenBond · 2018-07-13T22:53:01Z

Apologies, I'll try to build CoreFx and test today.
Thank you for addressing it so quickly.

Please note the second issue which resides in BlockingCollection<T> itself: https://github.com/dotnet/corefx/issues/30781#issuecomment-402371237

stephentoub · 2018-07-13T22:56:51Z

Please note the second issue

There's no issue if the wrapped collection behaves correctly, though, right?

ReubenBond · 2018-07-13T23:12:54Z

@stephentoub that's right, so this fix should rectify our problems. It's just a question of resiliency, since a transient error with the underlying collection will break the BlockingCollection

ReubenBond · 2018-07-13T23:49:18Z

Apologies, @stephentoub @danmosemsft, could you point me in the right direction for using a local corefx build from a csproj?

danmoseley · 2018-07-13T23:51:21Z

@ReubenBond you want https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/dogfooding.md in particular I think you need https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/dogfooding.md#option-1-framework-dependent (because this fix has not yet been built into a full product build)

ReubenBond · 2018-07-14T06:21:35Z

Thanks, @danmosemsft.

It worked! This change fixes the issue we were experiencing.

As a bonus, I see a massive speed improvement in our little Orleans repro project when running netcoreapp3.0 with the latest nightly SDK compared to netcoreapp2.0 on SDK 2.1.301.

Before: 41.48 ms per iteration on average
After: 14.07 ms per iteration on average

EDIT: I'm not sure of the etiquette/workflow for corefx, but please feel free to close this.

stephentoub · 2018-07-14T13:06:37Z

Thanks for validating!

I see a massive speed improvement in our little Orleans repro project when running netcoreapp3.0 with the latest nightly SDK compared to netcoreapp2.0 on SDK 2.1.301.

Did you try with netcoreapp2.1? My assumption is that's where the bulk of the wins are coming from, though there has already been some additional perf work for netcoreapp3.0, just not as much.

I'm not sure of the etiquette/workflow for corefx, but please feel free to close this.

Thanks. We'll close it when we either close or merge the release/2.1 port PR.

danmoseley · 2018-07-17T16:28:58Z

Shiproom template

Description

ConcurrentBag.TryTake may fail to take an item from the collection even if it’s known to be there. This in turn causes problems for wrappers that assume if they know the collection contains an item that TryTake will succeed, like BlockingCollection. Race conditions can result in BlockingCollection throwing exceptions and getting into a corrupted state due to TryTake failing to return an item when it should have been able to.

Customer Impact

Exceptions / corrupted data structures / deadlocks when multiple threads access a BlockingCollection wrapped around a ConcurrentBag and race in a manner that results in takes on the bag failing.
Reported by Orleans.

Regression?

Regression from 1.x

Risk

Low:
- Small perf hit due to additional synchronization on some code paths, but that synchronization was already there pre-.NET Core 2.0 and is in netfx.
- The fix involves retries if a particular status changes during the operation, and so in theory it's possible that each try could hit that same condition and result in livelock, but the chances of that are so small as to not be relevant, and the same issue was present in pre-.NET Core 2.0 and is in netfx.

danmoseley · 2018-07-17T18:27:17Z

Ported to 2.1 with dotnet/corefx#31009

danmoseley · 2018-07-17T18:36:00Z

Pulled this temporarily as it missed 2.1.3 and they want a clean branch in case they need to rebuild 2.1.3 for some reason. keeping this issue open meantime.

joshfree · 2018-08-09T18:45:45Z

Moving this to 2.1.5 per shiproom today.

@tarekgh would you be able to help load balance this from @danmosemsft

danmoseley · 2018-08-14T20:46:26Z

Moving label to PR per new process

BTW I think the PR is already ready, no work required

karelz · 2018-09-06T18:53:15Z

Fixed in release/2.1 branch in PR dotnet/corefx#31162 (2nd attempt after the first attempt in PR dotnet/corefx#31009 was reverted by PR dotnet/corefx#31132). The fix will ship as part of 2.1.5 release.

stephentoub self-assigned this Jul 2, 2018

stephentoub assigned danmoseley and unassigned stephentoub Jul 13, 2018

danmoseley closed this as completed Jul 17, 2018

danmoseley reopened this Jul 17, 2018

karelz closed this as completed Sep 6, 2018

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 2.1.x milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlockingCollection<T>.TryTakeFromAny throws InvalidOperationException when underlying collection is ConcurrentBag<T> #26671

BlockingCollection<T>.TryTakeFromAny throws InvalidOperationException when underlying collection is ConcurrentBag<T> #26671

ReubenBond commented Jul 2, 2018

stephentoub commented Jul 2, 2018

ReubenBond commented Jul 4, 2018

stephentoub commented Jul 11, 2018

danmoseley commented Jul 13, 2018

ReubenBond commented Jul 13, 2018

stephentoub commented Jul 13, 2018

ReubenBond commented Jul 13, 2018

ReubenBond commented Jul 13, 2018

danmoseley commented Jul 13, 2018

ReubenBond commented Jul 14, 2018 •

edited

Loading

stephentoub commented Jul 14, 2018

danmoseley commented Jul 17, 2018

danmoseley commented Jul 17, 2018

danmoseley commented Jul 17, 2018

joshfree commented Aug 9, 2018

danmoseley commented Aug 14, 2018

karelz commented Sep 6, 2018

BlockingCollection<T>.TryTakeFromAny throws InvalidOperationException when underlying collection is ConcurrentBag<T> #26671

BlockingCollection<T>.TryTakeFromAny throws InvalidOperationException when underlying collection is ConcurrentBag<T> #26671

Comments

ReubenBond commented Jul 2, 2018

stephentoub commented Jul 2, 2018

ReubenBond commented Jul 4, 2018

stephentoub commented Jul 11, 2018

danmoseley commented Jul 13, 2018

ReubenBond commented Jul 13, 2018

stephentoub commented Jul 13, 2018

ReubenBond commented Jul 13, 2018

ReubenBond commented Jul 13, 2018

danmoseley commented Jul 13, 2018

ReubenBond commented Jul 14, 2018 • edited Loading

stephentoub commented Jul 14, 2018

danmoseley commented Jul 17, 2018

Shiproom template

Description

Customer Impact

Regression?

Risk

danmoseley commented Jul 17, 2018

danmoseley commented Jul 17, 2018

joshfree commented Aug 9, 2018

danmoseley commented Aug 14, 2018

karelz commented Sep 6, 2018

ReubenBond commented Jul 14, 2018 •

edited

Loading