Thread wakeup times have a significant impact on parallel task execution #6941

james7132 · 2022-12-13T10:26:14Z

What problem does this solve or what need does it fill?

Parallel tasks are currently bottlenecked by thread wake-up times. Even if we can schedule tasks from the parallel system scheduler or from parallel iteration, it can take quite a while to wake all of the threads in in the TaskPools up. This isn't particularly an issue for Io or AsyncCompute Tasks, but it has a measurable impact on Compute oriented tasks.

For example, see this run of propagate_transforms in Tracy. There's a 90.43us lag between when the first task and last task starts. This significantly impacts the system's total runtime.

There are likely multiple factors at work here:

futures_lite::block_on uses parking internally, which yields the thread back to the OS when there is no work to be done in an attempt to conserve CPU power. Waking threads up from this has notable latency.
async_executor conservatively wakes up only one thread at a time to avoid contention on the global and local queues. This forces threads to wake up in a cascading fashion. This seems to be the norm in both async_executor and tokio.
Our current use of Executor::try_tick in a hot loop seems to be increasing contention on the global task queue.

What solution would you like?

This will need some investigation. We can switch off of futures_lite::block_on with our own implementation that minimizes yielding, keeping cores hotter, but effectively spin-waiting for new tasks. This is likely a non-solution for battery bound platforms like mobile, but might net some improvements here, at the cost of higher reported idle CPU usage.

We could alternatively try to upstream or fork a change to async_executor with a different thread wake-up strategy. Our compute workloads tend to batch spawn tasks all at once, so it may be worth the contention directly schedule tasks in batches inside the executor and wake up an appropriate number of threads simultaneously instead of in a cascade.

Avoiding additional contention on the global executor by removing Executor::try_tick is done in #6503, which may improve the lag.

What alternative(s) have you considered?

Eating the perf cost. This really only affects systems with large core counts. However, this is increasingly becoming the norm as seen on Steam's hardware survey, where 40+% of desktop users now have access to a 8+ core machine.

Another alternative is just to parallelize everything we can. Bevy's internal systems aren't very parallel right now with many systems easily bottlenecking further execution. By keeping threads busier, the executor will not have the opportunity to yield to the OS as often, which will naturally eliminate the overhead of yielding and waking up threads.

The text was updated successfully, but these errors were encountered:

hymm · 2022-12-15T05:48:04Z

suddenly tracy is showing me more info about the thread state

blue areas are where the thread is sleeping, brown is the thread waking up, and green is the thread running.

Seems like there's about 10us between the thread waking up and it running the task. Not sure what that is about, might be the time it takes for async executor to find a task to run?

hymm · 2022-12-16T06:54:51Z

added some more spans including some in async executor and it seems like more than half of the green gap between systems running and a thread waking up in the picture above is the context switching overhead. With the rest being the time it takes for async executor to find a task and execute it.

hymm · 2022-12-16T18:25:15Z

looks like rayon does wake multiple threads at the same time
https://github.com/rayon-rs/rayon/blob/master/rayon-core/src/sleep/mod.rs#L333-L338

edit: not sure it's being called to wake multiple threads in the normal flow

farnoy · 2022-12-21T22:03:57Z

Very interesting to see this discussion. This is one of the reasons why I implemented a custom ParallelExecutor using Rayon after it was removed from Bevy. I don't know the numbers are directly comparable, but in my case, the delay between when the first and last system starts executing is at most 10us, usually lower - this is for independent systems.

hymm · 2022-12-23T04:03:54Z

looked into it a bit more. While rayon always calls the new_jobs funtion linked above with num_jobs = 1, it looks like it allows subsequent calls to the function wake up more threads. This is different from async executor where it prevents subsequent calls to the notify function from waking more threads until waking thread is fully awake.

See https://github.com/smol-rs/async-executor/blob/master/src/lib.rs#L499-L502 and also https://github.com/smol-rs/async-executor/blob/master/src/lib.rs#L581

As a quick test, I removed this code and for our par_for_each functions it did wake up all the threads in like 30us. Unsurprisingly this naive change was a net negative as it causes issues with the woken threads not being able to find work or hitting contention. Async executor probably needs a significant rewrite to avoid some of the contention before we can try to wake multiple threads at once.

james7132 · 2024-02-19T01:48:53Z

Rayon's documentation on their sleep module seems to have a more complete picture: https://github.com/rayon-rs/rayon/tree/main/rayon-core/src/sleep#jobs-event-counter. They're definitely tracking more closely the numerical state of the worker pool (i.e. how many pending jobs are there, how many threads are sleeping, etc.), and that drives how aggressively they wake up their threads.

james7132 · 2024-02-19T01:58:19Z

Reading through their documentation, this seems non-trivial to replicate in both Bevy and async_executor. I'm coming around to the idea of trying to have Rayon, or at least rayon-core, underpin our task pools on non-WASM platforms, and let it transparently handle non-async tasks, but have some extra handling for running async tasks in the same thread pools.

james7132 added C-Performance A change motivated by improving speed, memory usage or compile times A-Tasks Tools for parallel and async work S-Needs-Investigation This issue requires detective work to figure out what's going wrong labels Dec 13, 2022

james7132 changed the title ~~Thread wakeup times have a signifgant impact on parallel task execution~~ Thread wakeup times have a significant impact on parallel task execution Dec 13, 2022

james7132 mentioned this issue Feb 19, 2024

Wake up threads more aggressively if there is known work to do smol-rs/async-executor#100

Open

This was referenced Feb 19, 2024

Close performance gap between bevy_tasks and rayon #10064

Open

Use rayon as the bevy_task thread pool #11995

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread wakeup times have a significant impact on parallel task execution #6941

Thread wakeup times have a significant impact on parallel task execution #6941

james7132 commented Dec 13, 2022 •

edited

Loading

hymm commented Dec 15, 2022

hymm commented Dec 16, 2022 •

edited

Loading

hymm commented Dec 16, 2022 •

edited

Loading

farnoy commented Dec 21, 2022

hymm commented Dec 23, 2022

james7132 commented Feb 19, 2024

james7132 commented Feb 19, 2024 •

edited

Loading

Thread wakeup times have a significant impact on parallel task execution #6941

Thread wakeup times have a significant impact on parallel task execution #6941

Comments

james7132 commented Dec 13, 2022 • edited Loading

What problem does this solve or what need does it fill?

What solution would you like?

What alternative(s) have you considered?

hymm commented Dec 15, 2022

hymm commented Dec 16, 2022 • edited Loading

hymm commented Dec 16, 2022 • edited Loading

farnoy commented Dec 21, 2022

hymm commented Dec 23, 2022

james7132 commented Feb 19, 2024

james7132 commented Feb 19, 2024 • edited Loading

james7132 commented Dec 13, 2022 •

edited

Loading

hymm commented Dec 16, 2022 •

edited

Loading

hymm commented Dec 16, 2022 •

edited

Loading

james7132 commented Feb 19, 2024 •

edited

Loading