-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write a parallel deque for work stealing #4877
Comments
Related to #3095 |
Since I'm not sure that lock-free deques even exist I've scaled back the scope of this slightly. |
The deques used in work-stealing typically have a special property that one thread (the owner) only PUSHES and POP and other threads (the thieves) only ever DEQUEUE. Nobody ever queues. This lets you get better efficiency for multi-threading, but means that they are not multi-purpose. |
Somebody pointed out that the 'Chase / Lev' deque is a lock-free parallel deque for work stealing. Sounds good to me. |
Is there any progress on this? |
@ILyoan No. There is a type in place at rt::work_queue but it is not implemented. |
A related paper http://www.cs.bgu.ac.il/~hendlerd/papers/dynamic-size-deque.pdf, "A Dynamic-Sized Nonblocking Work Stealing Deque", by Hendler, Lev, Moir, Shavit |
@Aatch was working on this at some point but stopped. |
http://www.di.ens.fr/~zappa/readings/ppopp13.pdf This recent paper is a wonderfully detailed description of the atomic memory issues involved with the data structure. Includes pseudocode for a C11 memory model version with every atomic operation specified. |
a good example implementation of such a scheduler can be found in this haskell lib: https://github.com/ekmett/structures/blob/master/src/Control/Concurrent/Deque.hs |
I am wondering whether we really want to use a work-stealing deque as or per-thread work queue. Something that allows for random access would prevent the starvation problems we have to hack around and would be more fair. Is there such a data structure that is lock free? |
I've got a work-in-progress implementation of chase-lev deque. I started working from the C11 paper, but I found some bugs and omissions in their implementation so I'm having to go back to the original paper. Any help would be appreciated. I'm currently trying to wrap my head around Section 4 of the paper, which discusses how to adapt the algorithm for growing/shrinking the array without a garbage collector. https://github.com/toffaletti/rust-code/blob/master/chase_lev_deque.rs Even if this isn't used for the scheduler, I've been told it might be useful for Servo rendering work. |
if you want to see a worked out chase lev implementation thats pretty readable, look at the deque branch of edward kmett's structures lib https://github.com/ekmett/structures/blob/deque/src/Control/Concurrent/Deque.hs it uses some other primops you can see defined here http://hackage.haskell.org/package/atomic-primops-0.4/docs/Data-Atomics-Counter-Reference.html some of that may not be relevant for the no gc context, but at least gives a pretty readable working implementation to look at explicitly |
Thanks, @cartazio I will take a look. I haven't seen a fully working implementation that does array resizing and reclaiming without a GC. The original paper is light on specifics, just outlining a solution and then in a footnote saying "It is straightforward, however, to use the same solution for reclaiming buffers also when growing." I decided to use Relacy Race Detector to help get a correct implementation because helgrind and drd report too many false positives. That attempt is here: https://github.com/toffaletti/chase-lev It currently passes the tests, but I've gone the extremely heavy handed route of making all atomic access use memory_order_seq_cst. I'll work on relaxing that and fixing any other bugs. |
This adds an implementation of the Chase-Lev work-stealing deque to libstd under std::rt::deque. I've been unable to break the implementation of the deque itself, and it's not super highly optimized just yet (everything uses a SeqCst memory ordering). The major snag in implementing the chase-lev deque is that the buffers used to store data internally cannot get deallocated back to the OS. In the meantime, a shared buffer pool (synchronized by a normal mutex) is used to deallocate/allocate buffers from. This is done in hope of not overcommitting too much memory. It is in theory possible to eventually free the buffers, but one must be very careful in doing so. I was unable to get some good numbers from src/test/bench tests (I don't think many of them are slamming the work queue that much), but I was able to get some good numbers from one of my own tests. In a recent rewrite of select::select(), I found that my implementation was incredibly slow due to contention on the shared work queue. Upon switching to the parallel deque, I saw the contention drop to 0 and the runtime go from 1.6s to 0.9s with the most amount of time spent in libuv awakening the schedulers (plus allocations). Closes rust-lang#4877
This adds an implementation of the Chase-Lev work-stealing deque to libstd under std::rt::deque. I've been unable to break the implementation of the deque itself, and it's not super highly optimized just yet (everything uses a SeqCst memory ordering). The major snag in implementing the chase-lev deque is that the buffers used to store data internally cannot get deallocated back to the OS. In the meantime, a shared buffer pool (synchronized by a normal mutex) is used to deallocate/allocate buffers from. This is done in hope of not overcommitting too much memory. It is in theory possible to eventually free the buffers, but one must be very careful in doing so. I was unable to get some good numbers from src/test/bench tests (I don't think many of them are slamming the work queue that much), but I was able to get some good numbers from one of my own tests. In a recent rewrite of select::select(), I found that my implementation was incredibly slow due to contention on the shared work queue. Upon switching to the parallel deque, I saw the contention drop to 0 and the runtime go from 1.6s to 0.9s with the most amount of time spent in libuv awakening the schedulers (plus allocations). Closes #4877
This adds an implementation of the Chase-Lev work-stealing deque to libstd under std::rt::deque. I've been unable to break the implementation of the deque itself, and it's not super highly optimized just yet (everything uses a SeqCst memory ordering). The major snag in implementing the chase-lev deque is that the buffers used to store data internally cannot get deallocated back to the OS. In the meantime, a shared buffer pool (synchronized by a normal mutex) is used to deallocate/allocate buffers from. This is done in hope of not overcommitting too much memory. It is in theory possible to eventually free the buffers, but one must be very careful in doing so. I was unable to get some good numbers from src/test/bench tests (I don't think many of them are slamming the work queue that much), but I was able to get some good numbers from one of my own tests. In a recent rewrite of select::select(), I found that my implementation was incredibly slow due to contention on the shared work queue. Upon switching to the parallel deque, I saw the contention drop to 0 and the runtime go from 1.6s to 0.9s with the most amount of time spent in libuv awakening the schedulers (plus allocations). Closes #4877
The work stealing algorithm uses a deque. Their algorithm has some properties that might make further optimizations possible later (one end is used only by a single thread). We only need something simple to start with, a locked vector that just pushes and pops and shifts and unshifts. Using a circular buffer would be better, lock-free better still (maybe).
Also probably relevant is the paper on data locality in work stealing though I haven't read it yet.
There are some useful data structures for atomically reference counted types and mutexes in
core::private
.The text was updated successfully, but these errors were encountered: