-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: reduce scheduling overhead on large ppc64 systems #16476
Comments
HI Bishop, It would also be useful to see request latency and CPU load. It may also make sense to benchmark 4 steal iterations + mayhavework. Changing this condition can also have profound effect: if !_g_.m.spinning && 2*atomic.Load(&sched.nmspinning) >= procs-atomic.Load(&sched.npidle) {
goto stop
} I've chosen the threshold with bind intuition and my desktop had only 16 hw threads at the time. |
Hi Dmitry, You are right that I had simply assumed that reducing cumulative runtime in
Of the 2 proposals, the proposal 2) to introduce I'm glad you mentioned the "don't spin" heuristic above. I'm puzzled at how this reduces CPU time, since threads branching to I have in fact created a highly modified scheduler that keeps one thread spinning permanently and sends all others immediately to park. This scheduler delivers a solid 5+% throughput boost for our application at the expense of burning a hardware thread. I wouldn't suggest this as a general solution, but I could see us offering something like this to our clients under certain conditions. For today though, I would ask you to consider these proposals based on first principles, and even with the knowledge that decreasing the amount of time spent in work-stealing may slightly drop performance by missing out on chance encounters with work which was not present when the stealing loop was first entered. My own outlook is probably colored by having spent many years working on server energy efficiency, and I understand that not everyone is willing to make this kind of trade. |
Isn't it 2.9/1.8% TPS decrease?
Good question. As you noted, it's can be suboptimal when we have lots of Ps and lots of churn in findrunnable. Maybe we could try to pop a G and obtain a P in the second check, if both succeeds, we can proceed directly to running the G (return from findrunnable). Or maybe we should let each P check all queues once, then all except 1 P proceed to blocks and 1 is left spinning for longer. This requires careful thought, experimentation and extensive benchmarking. I would suggest to also test this on http benchmark from golang.org/x/benchmarks. It is not super realistic, but at least easy to setup and run and reports latency. |
Regarding the slight throughput loss: Yes, I am arguing for changes that will likely increase latency / reduce performance slightly in some cases, the argument being that a small part of the current performance is "accidental" due to spending longer than necessary spinning. I have been looking at the HTTP benchmark for a couple of days. This benchmark is quite different from our application, for example threads entering the stealing loop have about a 90% chance of stealing work in the HTTP benchmark, but only about 20% in our application. The results are noisy - run-to-run variation is on the same order (low single-digit percentages) as the differences we're trying to observe. However if the data is real then in general the things I've proposed seem to reduce CPU time slightly for large GOMAXPROCS and seem to increase latency slightly in the HTTP benchmark. The work-stealing "hit rate" goes from 86.3% for the current scheduler to 84.8% with 2 loops and the I agree that it would be useful to see a comprehensive study of scheduling heuristics on a large number of Go applications. Unfortunately I don't have a mandate to undertake that kind of study myself. |
On Fri, Jul 22, 2016 at 5:16 PM, Bishop Brock notifications@github.com
|
4 iterations were introduced in the first revision of the distributed scheduler. At that time we did have the good iteration algorithm that covers all Ps in one pass. And we had lots of much more important problems to solve. There is no particular theory behind exactly 4 iterations. |
@randall77 : The sharing of |
The typical case as I see it, in a heavily loaded system:
So we write mayhavework twice, read it once. That would argue for making writes fast. The other argument for making writes fast is that writes are on the critical path to getting work done. Reads are done by otherwise idle Ps. Or does mayhavework indicate having work that is currently not being executed (that is, the P has 2+ goroutines assigned to it). I'm not sure if that changes my math. |
All of the numbers I quoted above are from an implementation where I started with a more aggressive implementations where I think the unknown in the more aggressive context is how often the question "does this P have work" is answered yes vs. no. In the system I was working in that led to this issue the answer was almost always "no" - most of the Ps were always idle and no longer updating their entries, but they were being read multiple times by each thread looking for work. |
@randall77 if we distribute mayhavework per P, then we can as well just use runq instead of mayhavework to understand if P has work or not, and then we get exactly what we have now. |
Any progress here? @aclements, do you have input? |
I believe runtime scheduler overhead is something Bishop and others at IBM will be looking at more for Go 1.9. |
Can this be closed? My memory is that we agreed that maybe runtime/internal/atomic can have a custom atomic used in the one line identified above, if it makes such a big difference, but that we wouldn't expose that same semantics under any name in sync/atomic. Maybe this has already been done, or maybe other optimizations have removed the hot spot? |
/cc @laboger |
Sorry I don't seem to get these notifications. @ceseo is working on a change for go 1.12 to add atomics for load-acquire and store-release to runtime/internal/atomic. There are several functions in the runtime package with comments as to where these can be used, maybe there are others. |
Thanks to everyone for your attention on this issue. Although I haven't been working with Golang for a couple of years now, my recollection is that I was never able to find anything that was generally better (worthwhile to implement) than the current scheduler algorithms. If IBM is able to get more efficient atomics for POWER implemented that will help a little. People concerned about this issue can use GOMAXPROCS to limit resource utilization. |
Please answer these questions before submitting your issue. Thanks!
go version
)?Go 1.7beta2
go env
)?X86 (Ivy Bridge) - Ubuntu 16.04
POWER8 - Ubuntu 15.10
This issue/proposal is a bit long, and likely only of interest to those interested in goroutine scheduling.
I work on the Hyperledger fabric (https://github.com/hyperledger/fabric), a large Go application implementing a permissioned blockchain. As part of this work I have observed what I would call "excessive" amounts of cumulative time spent in
runtime.findrunnable
when running on large systems with GOMAXPROCS defaulting to the number of available CPUs. In the following I assume the reader is familiar with thefindrunnable
routine inproc.go
.Drilling down into a
findrunnable
profile, the obvious culprit is seen to be the work-stealing loop. This loop is inefficient on large systems for several reasons:"Spinners" poll the system 4 times while holding a P, and all threads poll once again after releasing their P.
The stealing loop checks for stealable work from all Ps, including Ps that have no possibility of having any work to steal. The atomic operations used to load the queue pointers in runqgrab require synchronization primitives on some architectures, and a subroutine call overhead on all architectures. This global polling is disruptive in an SMP-coherence sense, since the poller must pull cache lines from around the system in order to examine only a few fields of each line. The randomized polling order also defeats the hardware's prefetching heuristics.
Regarding 1): I understand why it is good to poll at least twice - First for ez-pickin's from the local run queues, and a second pass for the longer-latency
runnext
stealing. It occurred to me that perhaps 4 loops were made in Go 1.6 because the randomization used there was not guaranteed to visit every P, so polling 4X increased the odds of looking at every local queue. Now that this has been fixed in Go 1.7, polling more than twice is arguably not necessary. The polling withrunnext
grabs included is so thourough that once this loop is finished there is no a priori reason to expect that another pass will bear fruit.Regarding 2): Note that the answer to the question: "Could this P possibly have any work to steal?" can be efficiently centralized since the answer is relatively rarely modified but relatively often observed. I've created a modified scheduler that includes a global array called
mayhavework
that is indexed by the id of a P. Currently,mayhavework[i]
isfalse
whenever a P is queued in the list of idle Ps, andtrue
otherwise. More aggressive update protocols are also possible, but this simple protocol is sufficient to illustrate the benefit.Setting/clearing
mayhavework[i]
adds a small overhead to queue management of idle Ps, as well as a test during polling loops. Note that the polling loop in the "delicate dance" already includes what appears to be a redundant guard ofallp[i] != nil
which is not made by the work-stealing loop.Here are some results for an example Hyperledger fabric benchmark running on a 4-socket X86 Ivy Bridge server with 120 hardware threads. These examples are for illustration only and are not claimed to be exhaustive; The arguments for the proposal should be valid based on first principles. Performance (throuhgput) of the server is measured in transactions per second (TPS). Cumulative profile percentages were reported by the Golang
net/http/pprof
profiling service running in the application. Results for GOMAXPROCS eqaul to 12 and 120 (the default) are presented.This full proposal has no effect on
findrunnable
overhead or performance on this system with GOMAXPROCS=12. However I have also run the experiment on a POWER8 server and observed a reduction from 14.5% to 9.4% offindrunnable
overhead on that system with GOMAXPROCS=12. This may be due to the fact thatatomic.Load
includes a synchronization instruction on POWER.For the full system there is a significant reduction in scheduling overhead. It is not clear whether the slight performance drop with 2 stealing loops only is real, or due to experimental variation. In a number of experiments (on POWER8) I have seen what I believe are small, real performance increases and decreases from these modified heuristics, which vary based on the particular benchmark.
To summarize the proposal:
Only poll twice in the work stealing loop;
Implement an efficient centralized data structure that records which Ps might possibly have any work to steal.
Bishop Brock
The text was updated successfully, but these errors were encountered: