-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use OpenMP-like synchronization patterns in Eigen thread pool #4236
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this as-is for the moment, as per the current main branch, but it would be interesting to check how this performs on different platforms. I think some mutex + condvar implementations will identify that the waiting threads require the lock that is still held, and defer any work until the lock is released, while others benefited from avoiding the notification going to a thread that gets unblocked briefly before blocking on the lock-acquire. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please change the subject of this PR as it'll show up as-is in the commit logs? Thanks!
The log message can be changed with merging the PR. |
0eafefc
Description:
This PR updates the thread pool implementation to make work distribution over the Eigen thread pool more closely resemble techniques used in OpenMP. In particular:
(1) A thread entering a parallel loop works on the iterations itself, rather than requiring a thread switch to/from a thread in the pool, if called from outside the thread pool.
(2) To support #1, work items pushed to the thread pool run a loop to claim iterations from a shared counter via atomic-fetch-and-add, as opposed to having work items themselves represent individual batches of iterations. This means that any thread working on the loop can execute any batch of iterations, including having the main thread run through all of the batches itself if the loop turns out to be short-running.
(3) As with OpenMP active scheduling, the worker loop spins waiting for work prior to blocking. This avoids OS blocking / wake-up paths in workloads with series of short-running parallel sections. The default spinning duration prior to blocking is measured at around 1ms.
Performance tests on a 32-vCPU VM for CPU inference workloads show performance broadly similar to OpenMP builds, with p50 across 143 models a 12% improvement, and p80 a 28% improvement.
Motivation and Context
The PR aims to simplify the configuration of threading with ORT by providing consistent performance with OpenMP-based parallelism.