-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing memory ordering issue in ConcurrentQueue #78142
Conversation
Tagging subscribers to this area: @dotnet/area-system-collections Issue DetailsFixes: #76501
|
I am running the repro in multiple processes for about 20 min now with no failures. It typically fails in under one minute. |
No failures for an hour. So it looks like the scenario has no other issues. |
thanks @filipnavara for a repro scenario! |
Thanks for the fix, LGTM. It should probably be backported to 7.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Nice find.
I'm fine with that; CQ is at the heart of the system due to its usage in ThreadPool. Just note this isn't new; been this way since I missed adding the volatile in 2016 :) |
Thanks!! |
The issue is on ToArray/CopyTo/Enumerate path. I think it would be unusual to do such operations while the queue is mutated. Any kind of synchronization that guarantees quiescence will likely make this issue disappear. Also it requires weak architecture and hardware that utilizes that weakness aggressively. - Missing write fences often cause troubles as write-buffering is common. Missing read fences require that CPU speculates far ahead. These are probably still not very common. Evidently, we have only seen this issue on M1. I think chances of seeing this causing problems in actual programs are low. On the other hand it is possible, and the fix is very low risk and 7.0 may see a bigger share of weak architectures than prior releases. I think it may be worth porting. |
I also think this issue may become a test/stress nuisance in 7.0 It was quite annoying for NativeAOT on OSX lately as it failed fairly often |
/backport to release/7.0 |
Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3439145097 |
Fixes: #76501