Improve work-stealing "scheduler is blocked" logic #2355

SeanTAllen · 2017-11-19T13:41:33Z

Prior to this commit, we sent actor block and unblock messages each time
we entered and left steal. Every instance of work stealing resulted in
a block/unblock message pair being sent; even if stealing was
immediately successful.

This was wasteful in a number of ways:

extra memory allocations
extra message sends
extra handling and processing of pointless block/unblock messages

This commit changes block/unblock message sending logic. Hat tip to
Scott Fritchie for pointing out to be how bad the issue was. He spent
some time with DTrace and come up with some truly terrifying numbers for
how much extra work was being done. Dipin Hora and I independently came
up with what was effectively the same solution for this problem. This
commit melds the best of his implementation with the best of mine.

With this commit applied, work stealing will only result in a
block/unblock message pair being sent if:

the scheduler in question has attempted to steal from every other
scheduler (new behavior)
the scheduler in question has tried to steal for at least 10 billion
clock cycles (about 5 seconds on most machines) (new behavior)
the scheduler in question has no unscheduled actors in its mutemap
(existing behavior)

Item 2 is the biggest change. What we are doing is increasing program
shutdown time by at least 5 seconds (perhaps slightly more due to cross
scheduler timing issues) in return for much better application
performance while running.

Issue #2317 is mostly fixed by this issue (although there is still a
small amount of memory growth due to another issue).

Issue #517 is changed by this commit. It has memory growth that is much
slower than before but still quite noticeable. On my machine #517 will
no longer OOM as it eventually gets to around 8 gigs in memory usage and
is able to keep up with freeing memory ahead of new memory allocations.
Given that there is still an underlying problem with memory allocation
patterns (the same as #2317), I think that it's possible that the
example program in #517 would still OOM on some test machines.

Fixes #647

Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647

jemc · 2017-11-20T18:40:37Z

Great find! It sounds like a lot of unnecessary work can be prevented this way.

Does this mean that after this change, all pony programs (including an almost-immediately finished "hello world" program) will take at least 5 seconds to detect quiescence and terminate?

Not necessarily a dealbreaker for me - I'm just trying to understand the impact.

SeanTAllen · 2017-11-20T18:53:54Z

@jemc yes. 10 billion cycles. around 5 seconds.

slfritchie · 2017-11-20T18:57:37Z

For the record.

% ./build/release-dtrace/ponyc examples/helloworld
% time ./helloworld
Hello, world.
0.061u 0.116s 0:04.74 3.5%	0+0k 0+0io 0pf+0w

jemc · 2017-11-20T19:24:36Z

If we go forward with this, we'll just have to make sure we break users of their already-bad habit of using time to measure the time it takes their pony program to do something.

It definitely seems reasonable to have to wait a little bit longer for program termination, to have better performance during runtime.

On the other hand, if someone is using Pony to write a short program meant to terminate almost immediately, waiting 5 seconds may be a bother. For example, let's say someone wants to run pony-stable to check their dependencies as part of their development process loop - adding 5 extra seconds into that loop kind of sucks.

It would be nice to find a solution that still keeps those kinds of short programs terminating quickly, while also losing the extra overhead for long-running programs as well. At the very least maybe we could make it a compile-time option?

SeanTAllen · 2017-11-20T20:36:08Z

@jemc I think we could make the duration to wait a compile time option.

SeanTAllen · 2017-11-20T20:50:35Z

BTW, 10 billion cycles was a mostly arbitrary value that I took after looking at the cpu_pause code that was also mostly arbitrary when it was created. As in, "let's try some values, these seem to work well, let's use them".

slfritchie · 2017-11-22T02:18:59Z

To add my two bits, I've tried this branch on an EC2 instance of c4.8xlarge.

When running extras/message-ubench/message-ubench --pingers 320 --initial-pings 5 --ponynoblock --ponythreads=N using today's master, there's a pretty remarkable explosion of memory within 30 seconds of runtime.

| --ponythreads |  RSS size in 30 sec |
|               | MBytes very roughly |
|---------------+---------------------|
|            18 |                1880 |
|            14 |                 650 |
|            13 |                 200 |
|            12 |                  14 |
|            11 |                  13 |
|            10 |                  13 |

With this branch, I don't see that explosion happening at 32 or 34 Pony threads.

SeanTAllen · 2017-11-22T20:41:26Z

MAKE SURE TO NOTE THE SHORT RUNNING PROGRAM IN RELEASE NOTES.

sgebbie · 2017-11-23T15:04:45Z

Hi,

I am concerned about the side effect of this commit. I currently use Pony to build a command line tool. The reason is that I make use of Pony's type system to shorten my dev cycle. Also, I use Pony for other longer running simulations, but even these applications are not intended to run indefinitely.

However, this use case is then at odds with have an imposed long shutdown of the program. In fact, because I am building a CLI I am working on always completing the work in under 200ms. Prior to this commit the Pony binary performed incredibly well. For my current files that I am processing I get a startup-process-shutdown time of 6ms. Which would be hard to beat with other higher level languages. My projection puts my current implementation at remaining sub 200ms for the next year or two, and after that, if need be, I have strategies for improving my file representation and remaining sub 200ms.

So, my question is, given that I strongly depend on the previous behaviour of a short run time, can these features be turned on/off via at compile time? Or, how easy would it be to add the ability to disable this behaviour at compile time? That way, for the the critical path could be left optimised for long running programs, and disabled for short running applications.

An alternative would be to have an explicit function call, akin to Env.exitcode(), but rather Env.exit() that instructs the Pony runtime to shutdown cleanly, but immediately.

Thanks,
Stewart.

PR #2355 included a change that improved runtime performance at the cost of significantly delaying program termination. This PR makes that performance tuning an opt-in change.

jemc · 2017-11-24T03:29:40Z

@sgebbie - Thanks for your comment - I think it's important to accommodate use cases like yours, so I'm glad you spoke up.

You inspired me to go ahead and file #2369, which is intended to make the change from this PR an opt-in trade-off. Please feel free to participate in the discussion over there.

PR ponylang#2355 included a change that improved runtime performance at the cost of significantly delaying program termination. This commit makes it possible to have the best of both worlds. At start up time the quiescence cycle timeout count will default to 10E9 (around 5 seconds on 2017 hardware). And, therefore, programs can still benefit from this enhancement. However, this exposes the means to then update the cycle timeout value to a lower value (defaulting to 0). That way, programs can still control their termination time. The implementation ties the quiescence timeout configuration to setting of an exit code. This is based on the idea that setting an exit code is a strong signal that the program is intending to exit soon. Finally, care is taken to propagate the timeout value to the scheduler thread via the scheduler messaging mechanism rather than using something like an atomic variable. This ensures that on the critical path that the performance is not negatively impacted by cache invalidation. Rather, the scheduler_t hold the value locally the scheduler's thread.

sgebbie · 2017-11-24T09:40:47Z

I have proposed an approach to managing this by allowing runtime control of the quiescence cycle timeout value in #2370.

PR ponylang#2355 included a change that improved runtime performance at the cost of significantly delaying program termination using a magic value of `10000000000`. After some additional testing, it was discovered that a smaller magic value of `1000000` is equally effective without unnecessary termination delay. This commit changes the magic value from `10000000000` to `1000000`.

PR #2355 included a change that improved runtime performance at the cost of significantly delaying program termination using a magic value of `10000000000`. After some additional testing, it was discovered that a smaller magic value of `1000000` is equally effective without unnecessary termination delay. This commit changes the magic value from `10000000000` to `1000000`.

PR ponylang#2355 included a change that improved runtime performance at the cost of significantly delaying program termination. This was amended vi PR ponylang#2376. This commit makes it possible to have the best of both worlds. At start up time the quiescence cycle timeout count will default to 10E6 (which still adds an overhead of ~10ms to shutdown). Therefore, programs can still benefit from scheduler enhancements. However, this commit exposes the means to then update the cycle timeout value to a lower value (defaulting to 0). That way, programs can still control their termination time. The implementation ties the quiescence timeout configuration to setting of an exit code. This is based on the idea that setting an exit code is a strong signal that the program is intending to exit soon. Finally, care is taken to propagate the timeout value to the scheduler thread via the scheduler messaging mechanism rather than using something like an atomic variable. This ensures that on the critical path that the performance is not negatively impacted by cache invalidation. Rather, the scheduler_t hold the value locally the scheduler's thread. There are also unit tests to check assumptions regarding field alignment and type sizes.

SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Nov 19, 2017

SeanTAllen force-pushed the slower-sched-block-msg-sends branch 3 times, most recently from e5f09de to c250f88 Compare November 19, 2017 13:51

This was referenced Nov 19, 2017

memory leak with timer #2317

Closed

Uncontrolled memory usage #647

Closed

Pony runtime goes out of memory under high CPU load / concurrency #517

Closed

SeanTAllen force-pushed the slower-sched-block-msg-sends branch from c250f88 to e751677 Compare November 19, 2017 14:08

jemc added the needs discussion during sync label Nov 20, 2017

SeanTAllen removed the needs discussion during sync label Nov 22, 2017

Fix formatting

b12179a

SeanTAllen merged commit 2601b7c into master Nov 22, 2017

SeanTAllen deleted the slower-sched-block-msg-sends branch November 22, 2017 20:47

ponylang-main added a commit that referenced this pull request Nov 22, 2017

Update CHANGELOG for PR #2355 [skip ci]

8b2829f

jemc mentioned this pull request Nov 24, 2017

Add --ponyqkcycles runtime option for performance tuning. #2369

Closed

sgebbie mentioned this pull request Nov 24, 2017

Runtime scheduler configuration of quiescence cycle timeout count. #2370

Closed

dipinhora mentioned this pull request Nov 26, 2017

Change steal magic value to 1000000 #2376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve work-stealing "scheduler is blocked" logic #2355

Improve work-stealing "scheduler is blocked" logic #2355

SeanTAllen commented Nov 19, 2017

jemc commented Nov 20, 2017 •

edited

Loading

SeanTAllen commented Nov 20, 2017

slfritchie commented Nov 20, 2017

jemc commented Nov 20, 2017

SeanTAllen commented Nov 20, 2017

SeanTAllen commented Nov 20, 2017

slfritchie commented Nov 22, 2017 •

edited

Loading

SeanTAllen commented Nov 22, 2017

sgebbie commented Nov 23, 2017

jemc commented Nov 24, 2017

sgebbie commented Nov 24, 2017

Improve work-stealing "scheduler is blocked" logic #2355

Improve work-stealing "scheduler is blocked" logic #2355

Conversation

SeanTAllen commented Nov 19, 2017

jemc commented Nov 20, 2017 • edited Loading

SeanTAllen commented Nov 20, 2017

slfritchie commented Nov 20, 2017

jemc commented Nov 20, 2017

SeanTAllen commented Nov 20, 2017

SeanTAllen commented Nov 20, 2017

slfritchie commented Nov 22, 2017 • edited Loading

SeanTAllen commented Nov 22, 2017

sgebbie commented Nov 23, 2017

jemc commented Nov 24, 2017

sgebbie commented Nov 24, 2017

jemc commented Nov 20, 2017 •

edited

Loading

slfritchie commented Nov 22, 2017 •

edited

Loading