Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<execution>: cache thread::hardware_concurrency #1134

Closed
AlexGuteniev opened this issue Aug 3, 2020 · 10 comments · Fixed by #1143
Closed

<execution>: cache thread::hardware_concurrency #1134

AlexGuteniev opened this issue Aug 3, 2020 · 10 comments · Fixed by #1143
Labels
fixed Something works now, yay! performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

Currently does not cache hardware_concurrency value, like it used to in certain configurations.

GetNativeSystemInfo may be a system call, so it may defeat the speedup obtained by parallel algorithms.

Consider caching it again.

Note that since number of CPUs may change at runtime, it may be a good idea to refresh the value after some GetTickCount64 interval.

@AlexGuteniev
Copy link
Contributor Author

I don't know whether it should be fixed in _Thrd_hardware_concurrency or in __std_parallel_algorithms_hw_threads, and also whether GetTickCount should actually be used (or maybe threadpool refresher timer function be started)

@StephanTLavavej StephanTLavavej added the performance Must go faster label Aug 3, 2020
@BillyONeal
Copy link
Member

I think we would need a perf benchmark showing this as a substantial problem before we would want to change this; if I understand correctly GetNativeSystemInfo is implemented by a simple memory read from a memory page mapped into both userspace and the kernel (as readonly in userspace and read write in the kernel). But that's anecdotal, I've not actually checked.

Back when I did perf optimizations for the parallel algorithms the cost of GetNativeSystemInfo was never high enough on the chart to care about. But if someone can produce a benchmark suggesting otherwise I'd be all for it.

@AlexGuteniev
Copy link
Contributor Author

Profiling has shown that GetNativeSystemInfo always calls ZwQuerySystemInformation and ZwQueryInformationProcess, both are kernel calls. The benchmark is just run this:

int main() {
    for (;;) {
        SYSTEM_INFO si {};
        GetNativeSystemInfo(&si);
    }
}

Under a profiler, and see OS calls as hotspots.

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Aug 4, 2020
resolves microsoft#1134 (conservatively)
revert to what there were before atomic wait
@CaseyCarter
Copy link
Member

Note that since number of CPUs may change at runtime, it may be a good idea to refresh the value after some GetTickCount64 interval.

Should we have a separate issue tracking updating the value so it doesn't get lost in the shuffle once this issue is closed?

@AlexGuteniev
Copy link
Contributor Author

Should we have a separate issue tracking updating the value so it doesn't get lost in the shuffle once this issue is closed?

I think the decision can be made in the current PR to do or not to do this; my default take is not to do this

@AlexGuteniev
Copy link
Contributor Author

Reasons not to do timed refresh:

@AlexGuteniev
Copy link
Contributor Author

What I would do if we want to deal with dynamic CPU count change:

  • Add __set_parallel_agorithms_hw_concurrency, taking integer
  • Small positive value will set number of threads
  • Invalid value (say 0 or -1) will reset to GetNativeSystemInfo, thus also doing a refresh

@BillyONeal
Copy link
Member

I think anyone who cares about hot plugging CPUs cares enough to call the platform topology enumeration APIs and similar since they probably want to support CPUs outside of the default group. As a result I don't think we should worry about that.

@AlexGuteniev
Copy link
Contributor Author

I think anyone who cares about hot plugging CPUs cares enough to call the platform topology enumeration APIs and similar since they probably want to support CPUs outside of the default group. As a result I don't think we should worry about that.

And if group apis are called before thread::hardware_concurrency, parallel algorithm may not parallel anything, since the value may be zero. Guess if we really care about this case, should at least provide a setter.

@AlexGuteniev
Copy link
Contributor Author

And if group apis are called before thread::hardware_concurrency, parallel algorithm may not parallel anything, since the value may be zero. Guess if we really care about this case, should at least provide a setter.

Decided against it. Thread pool is not actually controlled anyway.

StephanTLavavej pushed a commit that referenced this issue Aug 9, 2020
Fixes #1134.

Co-authored-by: Casey Carter <cartec69@gmail.com>
Co-authored-by: Billy O'Neal <bion@microsoft.com>
@StephanTLavavej StephanTLavavej added the fixed Something works now, yay! label Aug 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants