Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macOS sierra extreme busy wait #1787

Closed
SeanTAllen opened this issue Mar 30, 2017 · 32 comments
Closed

macOS sierra extreme busy wait #1787

SeanTAllen opened this issue Mar 30, 2017 · 32 comments
Assignees

Comments

@SeanTAllen
Copy link
Member

Brought over from #1625

reported by @Jonke.
I was trying out the Timer class, running an exact copy of the version in the doc: http://www.ponylang.org/ponyc/time-Timer/#timer

use "time"

actor Main
  new create(env: Env) =>
    let timers = Timers
    let timer = Timer(Notify(env), 5_000_000_000, 2_000_000_000)
    timers(consume timer)

class Notify is TimerNotify
  let _env: Env
  var _counter: U32 = 0
  new iso create(env: Env) =>
    _env = env
  fun ref apply(timer: Timer, count: U64): Bool =>
    _env.out.print(_counter.string())
    _counter = _counter + 1
    true

very high proportional cpu usage, on macOS Sierra

$ ponyc --version 0.11.0 [release] compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)
$ bash --version GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16) Copyright (C) 2007 Free Software Foundation, Inc.

installed via homebrew.

uilt with every tag from 0.4 up to HEAD with both llvm 3.8 and 3.9 and could not see any difference with the timer example from the time package.

However it one changes the slop value it get a significant effect on the cpu usage, a slop = 1 gave me a cpu load 200%, a slop value =40 0.2% but then off course the timer don't work as expected.
The default slop = 20 ~25% cpu.

2,7 GHz Intel Core i5

@SeanTAllen SeanTAllen added bug: 1 - needs investigation triggers release Major issue that when fixed, results in an "emergency" release labels Mar 30, 2017
@SeanTAllen
Copy link
Member Author

I unfortunately can't replicate this.

@sylvanc you are on sierra, can you try to replicate?

anyone else using sierra who can give this a try?

@SeanTAllen
Copy link
Member Author

@Jonke which version of sierra do you have specifically?

@agarman
Copy link

agarman commented Apr 2, 2017

@SeanTAllen After changing let timers = Timers to let timers = Timers(1) the test program took 300% CPU.

$ ponyc --version
0.12.3 [release]
compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)
$ uname -a
Darwin Andrews-MacBook-Pro.local 16.5.0 Darwin Kernel Version 16.5.0: Fri Mar  3 16:52:33 PST 2017; root:xnu-3789.51.2~3/RELEASE_X86_64 x86_64

@SeanTAllen
Copy link
Member Author

And before the change @agarman ?

@agarman
Copy link

agarman commented Apr 2, 2017

25% with default slop. 300% with slop of 1.

@Jonke
Copy link

Jonke commented Apr 4, 2017

@SeanTAllen At the time of the report macOS Sierra 10.12.3

However on 10.12.4
Pony built from source
../../ponyc/build/release/ponyc --version
0.12.3-4eb7bb1 [release]
compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)

$ uname -a
Darwin Jonas-MBP 16.5.0 Darwin Kernel Version 16.5.0: Fri Mar 3 16:52:33 PST 2017; root:xnu-3789.51.2~3/RELEASE_X86_64 x86_64

slop 0
PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP
6322 ponyplay 199.4 00:44.20 4/2 0 16 748K+ 0B 0B 632

slop 1
6324 ponyplay 197.6 01:03.65 4/2 0 16 720K+ 0B 0B 6324

slop 20
32 ponyplay 37.6 00:04.86 4 0 16 648K+ 0B 0B 6332
6

@SeanTAllen
Copy link
Member Author

Lowering slop values should greatly increase CPU usage.

How are you getting CPU usage? Is that a single moment in time? Is that consistently using that much CPU?

@agarman
Copy link

agarman commented Apr 5, 2017

When I tested it was non-stop 300% CPU utilization.

@SeanTAllen
Copy link
Member Author

With a standard slop value @agarman?

@agarman
Copy link

agarman commented Apr 5, 2017

@SeanTAllen I reproduce same results as @Jonke
At default slop (20): 15-25% CPU
at slop (1): 300% CPU

plietar added a commit to plietar/ponyc that referenced this issue Apr 7, 2017
On macOS Sierra nanosleep(0) is now a no-op, which can cause extreme
busy waiting and high CPU usage.

Fixes ponylang#1787
@SeanTAllen
Copy link
Member Author

SeanTAllen commented Apr 12, 2017

Suspicion here is that Kqueue on Sierra (and maybe El Capitan) is firing. We have no proof of that but think we should look into it.

@SeanTAllen SeanTAllen added priority: 1 - low and removed triggers release Major issue that when fixed, results in an "emergency" release needs discussion during sync labels Apr 12, 2017
@plietar
Copy link
Contributor

plietar commented Apr 12, 2017

I've had a look again and kqueue isn't firing anything spurious events. The kevent call here only returns once or twice every 2 seconds with a single event every time.

Back to ponyint_cpu_core_pause, for the first billion cycles after the thread has woken up it will busy wait for a message before going back to longer sleeps. That's about half a second of busy waiting after each timer trigger (at 2Ghz clock speed). With a 2 seconds timer interval that's 25% cpu usage on average so it matches what we see.

Using a higher frequency cpu usage tracking confirms this (see screenshot). The process is at 100% for just under half a second, then some intermittent 5-10% usage for a second, then to zero. This is definitely what I'd expect from looking at the function.
cpuusage.

Like I mentioned on the PR, nanosleep(0) on Linux actually takes a lot of time, which explains why this isn't an issue there. I still don't understand why pre-Sierra doesn't have this problem since it seems like nanosleep(0) was already fast there. (I only have Sierra so I had to ask someone else to measure it. If anyone with pre-Sierra wants to confirm this: https://gist.github.com/plietar/d0de6de87a7b709aa214788cfa5adc5f)

@dipinhora
Copy link
Contributor

I see similar high cpu usage on pre-Sierra:

$ uname -a
Darwin Dipins-MBP.home 14.5.0 Darwin Kernel Version 14.5.0: Wed Jul 29 02:26:53 PDT 2015; root:xnu-2782.40.9~1/RELEASE_X86_64 x86_64 i386 MacBookPro11,2 Darwin

With the following ponyc:

$ ponyc --version
0.13.0 [release]
compiled with: llvm 3.8.1 -- clang version 3.8.1 (tags/RELEASE_381/final)

I compiled and ran the following modified program:

use "time"

actor Main
  new create(env: Env) =>
    let timers = Timers
    let timer = Timer(Notify(env), 5_000_000_000, 2_000_000_000)
    timers(consume timer)

class Notify is TimerNotify
  let _env: Env
  var _counter: U32 = 0
  new iso create(env: Env) =>
    _env = env
  fun ref apply(timer: Timer, count: U64): Bool =>
    _env.out.print(_counter.string())
    _counter = _counter + 1
    if _counter == 10 then
      false
    else
      true
    end

Run output on OS X:

$ time ./timer-issue
0
1
2
3
4
5
6
7
8
9

real	0m23.079s
user	0m3.294s
sys	0m4.541s

Same program run in an Ubuntu 16.04 VM on the same OS X host results in the following:

$ time ./timer-issue
0
1
2
3
4
5
6
7
8
9

real	0m23.036s
user	0m0.572s
sys	0m0.088s

One similar timing related issue from the Apple Darwin mailing lists is: https://lists.apple.com/archives/darwin-kernel/2007/Feb/msg00031.html. The thread seems to come to the conclusion that the extra busy waiting related cpu usage is due to the nanosleep call returning much more quickly on OS X than on Linux.

Assuming others agree that this is the root cause, the solution would be to sleep for a specific amount of time in nanosleep (and similarly on the Windows side) instead of sleeping for 0 or 1 nanoseconds and relying on the behavior that the OS will take significantly longer regardless and will not result in extra cpu usage since that doesn't seem to always hold true.

@agarman
Copy link

agarman commented Apr 19, 2017

Can the nanosleep value used be calibrated at runtime? Otherwise hardcoded values will become bugs in later OS updates.

@dipinhora
Copy link
Contributor

@agarman I'm not sure I understand your comment/concern. The call to nanosleep is in the C code for the runtime scheduler logic for when there are no actors with any work available for the scheduler thread to run.

So far, I think we've been relying on the implicit behavior where nanosleep for 0 nanoseconds actually takes significantly longer than that. I think we should figure out the "appropriate" amount of time to sleep a scheduler thread if there is no work before checking for work again instead of relying on unpredictable OS behavior. Granted, this wouldn't preclude the OS from sleeping for longer than we requested but it would prevent unnecessary busy waiting and cpu burning in situations where the OS actually does a good job to trying to meet our requested sleep time.

I don't see how this would become a bug in a later OS update since we're just explicitly saying that we want to wait 10 microseconds (for example) before we check for work again instead of saying we want to wait 0 nanoseconds when the OS will always take longer than that. I don't think later OS updates would invalidate our decision to wait for 10 microseconds.

Well, the above is my understanding but maybe I'm missing something?

@agarman
Copy link

agarman commented Apr 20, 2017

@dipinhora -- thanks for clarification. If I understand correctly, the solution would no longer use nanosleep of 0 nor any OS specific durations?

@dipinhora
Copy link
Contributor

dipinhora commented Apr 20, 2017

@agarman I'm not sure what you mean by "any OS specific durations" but my suggestion is that we move away from using 0 nanoseconds to a "more realistic" duration that balances cpu usage with time between checks explicitly (instead of the current reliance on the implicit behavior of the nanosleep call taking significantly longer than requested). The decision could be that the "more realistic" duration be custom/specific to each OS but that seems unlikely to be necessary.

@agarman
Copy link

agarman commented Apr 20, 2017

Got it. Sounds reasonable. FWIW erts looks like it uses sleep 0, but I'm just a tourist wrt ponyc or erts c internals.

@jemc
Copy link
Member

jemc commented Apr 20, 2017

Would we take any significant performance hit if we made the sleep time configurable via a --ponywhatever=$ns option at runtime? That is, having the value read from a static variable at runtime rather than being a compile-time constant.

@SeanTAllen
Copy link
Member Author

SeanTAllen commented Apr 20, 2017

Trying to explain how that interacts with things could be seriously difficult and black magic.

I'm always wary of options that are basically magic.

@jemc
Copy link
Member

jemc commented Apr 20, 2017

Is it black magic? If I'm understanding it correctly, the concept seems fairly simple.

It seems analogous to a global "slop" option, where you can increase the number for less CPU usage, and decrease it for more precise timings, with diminishing returns in either direction if you stray to far from the default.

@SeanTAllen
Copy link
Member Author

@jemc you are referring to the values in ponyint_cpu_core_pause correct?

If yes, then I think its black magic because its called as part of quiescent and changing that value can have a large impact on performance and work stealing.

@dipinhora
Copy link
Contributor

I'm somewhere in the middle. The following are my -$0.02 8*)

I agree with @SeanTAllen that we have to be careful since this is directly related to work stealing and other scheduler internals and not very easy to understand or explain the impact.

On the other hand, I think having it be a configurable option for advanced users is acceptable and no more dangerous than some of the thread related options. This assumes that we'll choose a sensible default that we would otherwise be hardcoding into a compile time constant.

@adam-antonik
Copy link

I also observe this behaviour on Ubuntu 16.04.

@dipinhora
Copy link
Contributor

@adam-antonik Can you please confirm some details? It seems surprising to me that Ubuntu 16.04 has the same behavior and I'd like to investigate a bit if you're able to provide the following information:

@adam-antonik
Copy link

@dipinhora Requested details below. Difference to your output on OS X is we don't use so much user time, but sys is still too high.

Linux adam-750-279na 4.4.0-81-generic #104-Ubuntu SMP Wed Jun 14 08:17:06 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

0.14.0-f70dba0 [release] compiled with: llvm 3.9.1 -- cc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609

`0
1
2
3
4
5
6
7
8
9

real 0m23.083s
user 0m0.388s
sys 0m2.136s`

@dipinhora
Copy link
Contributor

@adam-antonik Thanks for the quick response. Based on my limited knowledge of linux and it's internal clocksources and timing subsystem, I don't think the behavior you're seeing is an issue with Pony itself but a side effect of how Pony relies on rdtsc instruction (see https://en.wikipedia.org/wiki/Time_Stamp_Counter) to read the TSC value for internal scheduler thread timing. I would be very surprised if other programs relying on rdtsc instruction don't have similar overhead to what you're seeing in Pony. It is possible that if you're on a VM that the VM is emulating the rdtsc instruction for reading TSC and that is why system overhead is so high.

Additionally, my understanding is that all linux timing related system overhead is related to the current clocksource that the kernel is using and not all clocksources are created equal. Some are very low overhead (tsc) while others have much higher overhead (acpi_pm, I think). There's some information about clocksources in the accepted answer at https://unix.stackexchange.com/questions/164512/what-does-the-change-of-the-clocksource-influence the the links it provides.

Can you confirm the output of:

cat /sys/devices/system/clocksource/clocksource0/available_clocksource

and:

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Are you running on bare metal or in a VM of some kind? If a VM, it may not be relying on a hardware clocksource but instead a software emulated one and that could have a large system overhead (I've personally seen this sort of overhead impact a latency sensitive application when the current_clocksource wasn't TSC but I couldn't explain all of the details as to why/how it all interacts). If bare metal, it is possible the kernel detected an unstable clocksource (for TSC) and is using a less efficient one instead. This is usually possible to resolve either by overriding the current clocksource or rebooting.

@SeanTAllen
Copy link
Member Author

This is caused by work-stealing. The problem would be to make work stealing nicer in this scenario without impacting on performance during high loads. Sylvan and I are poking at it but we've poked at it before.

@dipinhora
Copy link
Contributor

@SeanTAllen I believe you're referring to this issue/ticket in general (i.e. the busy waiting related to the internal scheduler thread timing for work stealing) and not specifically about the unusually high system time that @adam-antonik is encountering. Please correct me if I'm wrong.

I don't believe @adam-antonik's issue is related to the work stealing busy wait but something with the timing subsystem in his environment as I tried to explain in my previous comment.

@SeanTAllen
Copy link
Member Author

I'm referring to the issue in general. There may be other factors as well.

@slfritchie
Copy link
Contributor

I'm a Pony newbie, but I have a long and usually-fruitful habit of
using strace, ktrace, and truss on various UNIX flavors to
observe their internals. OS X's closest equivalent is dtruss, which
is implemented in DTrace. Here's what I found with some experimenting
on my Sierra/10.12.6 machine this afternoon.

I've had to make a copy of that script and then add the following
pragma at the top of the embedded D script. The event rate is so
extremely high that the DTrace internal buffers overflow.

 #pragma D option dynvarsize=1024m

It still occasionally loses events ... check its output for errors
such as 346307 dynamic variable drops with non-empty dirty list.

For example, running
the Pony refdoc's timer example
of a single timer firing every 8 seconds ... nearly every syscall
found by dtruss is __semwait_signal(). And it is called at
extremely high rate in cycles. It goes roughly like this:

  1. A bunch of 10-11 millisecond sleeps, usually 1.8-1.9 seconds total but
    occasionally 3.x seconds.
  2. A huge number of 1 microsecond sleeps, typically 80K-120K syscalls.
  3. If the timer fires and prints output, it is during step 2.
    However, the elapsed time of step 1 is less than 8 seconds, so
    output may not happen during this step.
  4. A bunch of 1.0-1.2 millsecond sleeps, typically 0.8-0.9 seconds
    total.
  5. Goto step 1.

That is an extraordinary number of system calls for a tiny app with a
timer wheel with one timer in it.

A very naive experiment suggests that each nanosleep() call is
translated into a single __semwait_signal() system call. That holds
for extremely large intervals such as 1 second or more; I haven't
explored smaller intervals.

I've put a raw DTrace output file at
http://www.snookles.com/scotttmp/1787-timer-exploration.txt for those
who haven't twiddled their OS X settings to permit DTrace full access
to the OS. The really ugly 1-liner script that I used to classify
small/medium/large sleep times is:

% egrep 'semwait|write' /tmp/yoo6de | awk '/write/ { print "WRITE" } { if ($2 < 900) {if (l > 0) { printf("large %d usec\n", l); } print "small"; s++; m = 0; l = 0;} else if ($2 < 9999) { if (s > 0) { printf "%d small\n", s } print "medium"; s = 0; m += $2; l = 0; }  else { if (m > 0) { printf "medium %d usec\n", m } print "large" ; s = 0; m = 0; l += $2} }' | uniq | egrep 'usec|count|WRITE|small'

... which prints the following. The "small" category events are merely counted, since most of them report elapsed time as 0 or 1 microseconds (too much rounding error lurking there, I think).

large 824577 usec
small
87275 small
medium 887975 usec
large 1799915 usec
small
WRITE
small
127259 small
medium 888596 usec
large 3206314 usec
small
87696 small
medium 885989 usec
large 1861788 usec
small
WRITE
small
115488 small
medium 890829 usec
large 3213300 usec
small
87633 small
medium 886477 usec
large 1902987 usec
small
WRITE
small
107248 small
medium 885446 usec
large 3207897 usec
small
85618 small
medium 888300 usec
large 1965298 usec
small
WRITE
small
97915 small
medium 886948 usec

@slfritchie
Copy link
Contributor

Update: @SeanTAllen found a lovely little bug in ponyint_cpu_core_pause(). I'll play with it a bit, but it seems very likely to be an effective fix for this problem.

@SeanTAllen SeanTAllen self-assigned this Sep 29, 2017
SeanTAllen added a commit that referenced this issue Sep 29, 2017
Fixes #1787

Interestingly, all the info needed to solve this issue a while ago was
already in the issue but it wasn't until @slfritchie put his additional
comments in
#1787 (comment)
that it all clicked for me.

The excess CPU time is from us doing too much work stealing. In a normal
scenario, with nothing to do, we'd not doing anything for a long time
and we'd end up sleeping for quite a while.

With the timer that goes off every few seconds as seen in the issue,
that isn't what happens. We regularly get woken and end up in a work
stealing cycle.

Then, due to the lack of an `else` block for yielding, on OSX, we'd
nanosleep for 0 which is the same as an immediate return. To see what
the impact of that would be on any platform change the:

```c
  // 10m cycles is about 3ms
    if((tsc2 - tsc) < 10000000)
        return;
```

to

```c
  // 10m cycles is about 3ms
    if((tsc2 - tsc) < 1000000000)
        return;
```

This is effectively what we were running. That's a lot more
work-stealing. And, not the increased CPU usage. The reason this was
happening more on OSX is that on Linux, nanosleep 0 will sleep for at
least a bit. Here we remove the variability and do a small nanosleep
that will be the same across all platforms.
SeanTAllen added a commit that referenced this issue Sep 29, 2017
Fixes #1787

Interestingly, all the info needed to solve this issue a while ago was
already in the issue but it wasn't until @slfritchie put his additional
comments in
#1787 (comment)
that it all clicked for me.

The excess CPU time is from us doing too much work stealing. In a normal
scenario, with nothing to do, we'd not doing anything for a long time
and we'd end up sleeping for quite a while.

With the timer that goes off every few seconds as seen in the issue,
that isn't what happens. We regularly get woken and end up in a work
stealing cycle.

Then, due to the lack of an `else` block for yielding, on OSX, we'd
nanosleep for 0 which is the same as an immediate return. To see what
the impact of that would be on any platform change the:

```c
  // 10m cycles is about 3ms
    if((tsc2 - tsc) < 10000000)
        return;
```

to

```c
  // 10m cycles is about 3ms
    if((tsc2 - tsc) < 1000000000)
        return;
```

This is effectively what we were running. That's a lot more
work-stealing. And, not the increased CPU usage. The reason this was
happening more on OSX is that on Linux, nanosleep 0 will sleep for at
least a bit. Here we remove the variability and do a small nanosleep
that will be the same across all platforms.
Theodus pushed a commit that referenced this issue Sep 29, 2017
Fixes #1787

Interestingly, all the info needed to solve this issue a while ago was
already in the issue but it wasn't until @slfritchie put his additional
comments in
#1787 (comment)
that it all clicked for me.

The excess CPU time is from us doing too much work stealing. In a normal
scenario, with nothing to do, we'd not doing anything for a long time
and we'd end up sleeping for quite a while.

With the timer that goes off every few seconds as seen in the issue,
that isn't what happens. We regularly get woken and end up in a work
stealing cycle.

Then, due to the lack of an `else` block for yielding, on OSX, we'd
nanosleep for 0 which is the same as an immediate return. To see what
the impact of that would be on any platform change the:

```c
  // 10m cycles is about 3ms
    if((tsc2 - tsc) < 10000000)
        return;
```

to

```c
  // 10m cycles is about 3ms
    if((tsc2 - tsc) < 1000000000)
        return;
```

This is effectively what we were running. That's a lot more
work-stealing. And, not the increased CPU usage. The reason this was
happening more on OSX is that on Linux, nanosleep 0 will sleep for at
least a bit. Here we remove the variability and do a small nanosleep
that will be the same across all platforms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants