-
-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
macOS sierra extreme busy wait #1787
Comments
I unfortunately can't replicate this. @sylvanc you are on sierra, can you try to replicate? anyone else using sierra who can give this a try? |
@Jonke which version of sierra do you have specifically? |
@SeanTAllen After changing
|
And before the change @agarman ? |
25% with default slop. 300% with slop of 1. |
@SeanTAllen At the time of the report macOS Sierra 10.12.3 However on 10.12.4 $ uname -a slop 0 slop 1 slop 20 |
Lowering slop values should greatly increase CPU usage. How are you getting CPU usage? Is that a single moment in time? Is that consistently using that much CPU? |
When I tested it was non-stop 300% CPU utilization. |
With a standard slop value @agarman? |
@SeanTAllen I reproduce same results as @Jonke |
On macOS Sierra nanosleep(0) is now a no-op, which can cause extreme busy waiting and high CPU usage. Fixes ponylang#1787
Suspicion here is that Kqueue on Sierra (and maybe El Capitan) is firing. We have no proof of that but think we should look into it. |
I've had a look again and kqueue isn't firing anything spurious events. The kevent call here only returns once or twice every 2 seconds with a single event every time. Back to Using a higher frequency cpu usage tracking confirms this (see screenshot). The process is at 100% for just under half a second, then some intermittent 5-10% usage for a second, then to zero. This is definitely what I'd expect from looking at the function. Like I mentioned on the PR, |
I see similar high cpu usage on pre-Sierra:
With the following ponyc:
I compiled and ran the following modified program:
Run output on OS X:
Same program run in an Ubuntu 16.04 VM on the same OS X host results in the following:
One similar timing related issue from the Apple Darwin mailing lists is: https://lists.apple.com/archives/darwin-kernel/2007/Feb/msg00031.html. The thread seems to come to the conclusion that the extra busy waiting related cpu usage is due to the Assuming others agree that this is the root cause, the solution would be to sleep for a specific amount of time in |
Can the nanosleep value used be calibrated at runtime? Otherwise hardcoded values will become bugs in later OS updates. |
@agarman I'm not sure I understand your comment/concern. The call to So far, I think we've been relying on the implicit behavior where I don't see how this would become a bug in a later OS update since we're just explicitly saying that we want to wait Well, the above is my understanding but maybe I'm missing something? |
@dipinhora -- thanks for clarification. If I understand correctly, the solution would no longer use |
@agarman I'm not sure what you mean by "any OS specific durations" but my suggestion is that we move away from using |
Got it. Sounds reasonable. FWIW erts looks like it uses sleep 0, but I'm just a tourist wrt ponyc or erts c internals. |
Would we take any significant performance hit if we made the sleep time configurable via a |
Trying to explain how that interacts with things could be seriously difficult and black magic. I'm always wary of options that are basically magic. |
Is it black magic? If I'm understanding it correctly, the concept seems fairly simple. It seems analogous to a global "slop" option, where you can increase the number for less CPU usage, and decrease it for more precise timings, with diminishing returns in either direction if you stray to far from the default. |
@jemc you are referring to the values in If yes, then I think its black magic because its called as part of quiescent and changing that value can have a large impact on performance and work stealing. |
I'm somewhere in the middle. The following are my -$0.02 8*) I agree with @SeanTAllen that we have to be careful since this is directly related to work stealing and other scheduler internals and not very easy to understand or explain the impact. On the other hand, I think having it be a configurable option for advanced users is acceptable and no more dangerous than some of the thread related options. This assumes that we'll choose a sensible default that we would otherwise be hardcoding into a compile time constant. |
I also observe this behaviour on Ubuntu 16.04. |
@adam-antonik Can you please confirm some details? It seems surprising to me that Ubuntu 16.04 has the same behavior and I'd like to investigate a bit if you're able to provide the following information:
|
@dipinhora Requested details below. Difference to your output on OS X is we don't use so much user time, but sys is still too high.
`0 real 0m23.083s |
@adam-antonik Thanks for the quick response. Based on my limited knowledge of linux and it's internal clocksources and timing subsystem, I don't think the behavior you're seeing is an issue with Pony itself but a side effect of how Pony relies on Additionally, my understanding is that all linux timing related system overhead is related to the current clocksource that the kernel is using and not all clocksources are created equal. Some are very low overhead (tsc) while others have much higher overhead (acpi_pm, I think). There's some information about clocksources in the accepted answer at https://unix.stackexchange.com/questions/164512/what-does-the-change-of-the-clocksource-influence the the links it provides. Can you confirm the output of:
and:
Are you running on bare metal or in a VM of some kind? If a VM, it may not be relying on a hardware clocksource but instead a software emulated one and that could have a large system overhead (I've personally seen this sort of overhead impact a latency sensitive application when the |
This is caused by work-stealing. The problem would be to make work stealing nicer in this scenario without impacting on performance during high loads. Sylvan and I are poking at it but we've poked at it before. |
@SeanTAllen I believe you're referring to this issue/ticket in general (i.e. the busy waiting related to the internal scheduler thread timing for work stealing) and not specifically about the unusually high I don't believe @adam-antonik's issue is related to the work stealing busy wait but something with the timing subsystem in his environment as I tried to explain in my previous comment. |
I'm referring to the issue in general. There may be other factors as well. |
I'm a Pony newbie, but I have a long and usually-fruitful habit of I've had to make a copy of that script and then add the following
It still occasionally loses events ... check its output for errors For example, running
That is an extraordinary number of system calls for a tiny app with a A very naive experiment suggests that each I've put a raw DTrace output file at
... which prints the following. The "small" category events are merely counted, since most of them report elapsed time as 0 or 1 microseconds (too much rounding error lurking there, I think).
|
Update: @SeanTAllen found a lovely little bug in |
Fixes #1787 Interestingly, all the info needed to solve this issue a while ago was already in the issue but it wasn't until @slfritchie put his additional comments in #1787 (comment) that it all clicked for me. The excess CPU time is from us doing too much work stealing. In a normal scenario, with nothing to do, we'd not doing anything for a long time and we'd end up sleeping for quite a while. With the timer that goes off every few seconds as seen in the issue, that isn't what happens. We regularly get woken and end up in a work stealing cycle. Then, due to the lack of an `else` block for yielding, on OSX, we'd nanosleep for 0 which is the same as an immediate return. To see what the impact of that would be on any platform change the: ```c // 10m cycles is about 3ms if((tsc2 - tsc) < 10000000) return; ``` to ```c // 10m cycles is about 3ms if((tsc2 - tsc) < 1000000000) return; ``` This is effectively what we were running. That's a lot more work-stealing. And, not the increased CPU usage. The reason this was happening more on OSX is that on Linux, nanosleep 0 will sleep for at least a bit. Here we remove the variability and do a small nanosleep that will be the same across all platforms.
Fixes #1787 Interestingly, all the info needed to solve this issue a while ago was already in the issue but it wasn't until @slfritchie put his additional comments in #1787 (comment) that it all clicked for me. The excess CPU time is from us doing too much work stealing. In a normal scenario, with nothing to do, we'd not doing anything for a long time and we'd end up sleeping for quite a while. With the timer that goes off every few seconds as seen in the issue, that isn't what happens. We regularly get woken and end up in a work stealing cycle. Then, due to the lack of an `else` block for yielding, on OSX, we'd nanosleep for 0 which is the same as an immediate return. To see what the impact of that would be on any platform change the: ```c // 10m cycles is about 3ms if((tsc2 - tsc) < 10000000) return; ``` to ```c // 10m cycles is about 3ms if((tsc2 - tsc) < 1000000000) return; ``` This is effectively what we were running. That's a lot more work-stealing. And, not the increased CPU usage. The reason this was happening more on OSX is that on Linux, nanosleep 0 will sleep for at least a bit. Here we remove the variability and do a small nanosleep that will be the same across all platforms.
Fixes #1787 Interestingly, all the info needed to solve this issue a while ago was already in the issue but it wasn't until @slfritchie put his additional comments in #1787 (comment) that it all clicked for me. The excess CPU time is from us doing too much work stealing. In a normal scenario, with nothing to do, we'd not doing anything for a long time and we'd end up sleeping for quite a while. With the timer that goes off every few seconds as seen in the issue, that isn't what happens. We regularly get woken and end up in a work stealing cycle. Then, due to the lack of an `else` block for yielding, on OSX, we'd nanosleep for 0 which is the same as an immediate return. To see what the impact of that would be on any platform change the: ```c // 10m cycles is about 3ms if((tsc2 - tsc) < 10000000) return; ``` to ```c // 10m cycles is about 3ms if((tsc2 - tsc) < 1000000000) return; ``` This is effectively what we were running. That's a lot more work-stealing. And, not the increased CPU usage. The reason this was happening more on OSX is that on Linux, nanosleep 0 will sleep for at least a bit. Here we remove the variability and do a small nanosleep that will be the same across all platforms.
Brought over from #1625
reported by @Jonke.
I was trying out the Timer class, running an exact copy of the version in the doc: http://www.ponylang.org/ponyc/time-Timer/#timer
very high proportional cpu usage, on macOS Sierra
$ ponyc --version 0.11.0 [release] compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)
$ bash --version GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16) Copyright (C) 2007 Free Software Foundation, Inc.
installed via homebrew.
uilt with every tag from 0.4 up to HEAD with both llvm 3.8 and 3.9 and could not see any difference with the timer example from the time package.
However it one changes the slop value it get a significant effect on the cpu usage, a slop = 1 gave me a cpu load 200%, a slop value =40 0.2% but then off course the timer don't work as expected.
The default slop = 20 ~25% cpu.
2,7 GHz Intel Core i5
The text was updated successfully, but these errors were encountered: