Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve work-stealing "scheduler is blocked" logic #2355

Merged
merged 2 commits into from
Nov 22, 2017

Conversation

SeanTAllen
Copy link
Member

Prior to this commit, we sent actor block and unblock messages each time
we entered and left steal. Every instance of work stealing resulted in
a block/unblock message pair being sent; even if stealing was
immediately successful.

This was wasteful in a number of ways:

  1. extra memory allocations
  2. extra message sends
  3. extra handling and processing of pointless block/unblock messages

This commit changes block/unblock message sending logic. Hat tip to
Scott Fritchie for pointing out to be how bad the issue was. He spent
some time with DTrace and come up with some truly terrifying numbers for
how much extra work was being done. Dipin Hora and I independently came
up with what was effectively the same solution for this problem. This
commit melds the best of his implementation with the best of mine.

With this commit applied, work stealing will only result in a
block/unblock message pair being sent if:

  1. the scheduler in question has attempted to steal from every other
    scheduler (new behavior)
  2. the scheduler in question has tried to steal for at least 10 billion
    clock cycles (about 5 seconds on most machines) (new behavior)
  3. the scheduler in question has no unscheduled actors in its mutemap
    (existing behavior)

Item 2 is the biggest change. What we are doing is increasing program
shutdown time by at least 5 seconds (perhaps slightly more due to cross
scheduler timing issues) in return for much better application
performance while running.

Issue #2317 is mostly fixed by this issue (although there is still a
small amount of memory growth due to another issue).

Issue #517 is changed by this commit. It has memory growth that is much
slower than before but still quite noticeable. On my machine #517 will
no longer OOM as it eventually gets to around 8 gigs in memory usage and
is able to keep up with freeing memory ahead of new memory allocations.
Given that there is still an underlying problem with memory allocation
patterns (the same as #2317), I think that it's possible that the
example program in #517 would still OOM on some test machines.

Fixes #647

@SeanTAllen SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Nov 19, 2017
@SeanTAllen SeanTAllen force-pushed the slower-sched-block-msg-sends branch 3 times, most recently from e5f09de to c250f88 Compare November 19, 2017 13:51
Prior to this commit, we sent actor block and unblock messages each time
we entered and left `steal`. Every instance of work stealing resulted in
a block/unblock message pair being sent; even if stealing was
immediately successful.

This was wasteful in a number of ways:

1. extra memory allocations
2. extra message sends
3. extra handling and processing of pointless block/unblock messages

This commit changes block/unblock message sending logic. Hat tip to
Scott Fritchie for pointing out to be how bad the issue was. He spent
some time with DTrace and come up with some truly terrifying numbers for
how much extra work was being done. Dipin Hora and I independently came
up with what was effectively the same solution for this problem. This
commit melds the best of his implementation with the best of mine.

With this commit applied, work stealing will only result in a
block/unblock message pair being sent if:

1) the scheduler in question has attempted to steal from every other
scheduler (new behavior)
2) the scheduler in question has tried to steal for at least 10 billion
clock cycles (about 5 seconds on most machines) (new behavior)
3) the scheduler in question has no unscheduled actors in its mutemap
(existing behavior)

Item 2 is the biggest change. What we are doing is increasing program
shutdown time by at least 5 seconds (perhaps slightly more due to cross
scheduler timing issues) in return for much better application
performance while running.

Issue #2317 is mostly fixed by this issue (although there is still a
small amount of memory growth due to another issue).

Issue #517 is changed by this commit. It has memory growth that is much
slower than before but still quite noticeable. On my machine #517 will
no longer OOM as it eventually gets to around 8 gigs in memory usage and
is able to keep up with freeing memory ahead of new memory allocations.
Given that there is still an underlying problem with memory allocation
patterns (the same as #2317), I think that it's possible that the
example program in #517 would still OOM on some test machines.

Fixes #647
@jemc
Copy link
Member

jemc commented Nov 20, 2017

Great find! It sounds like a lot of unnecessary work can be prevented this way.


Does this mean that after this change, all pony programs (including an almost-immediately finished "hello world" program) will take at least 5 seconds to detect quiescence and terminate?

Not necessarily a dealbreaker for me - I'm just trying to understand the impact.

@SeanTAllen
Copy link
Member Author

@jemc yes. 10 billion cycles. around 5 seconds.

@slfritchie
Copy link
Contributor

For the record.

% ./build/release-dtrace/ponyc examples/helloworld
% time ./helloworld
Hello, world.
0.061u 0.116s 0:04.74 3.5%	0+0k 0+0io 0pf+0w

@jemc
Copy link
Member

jemc commented Nov 20, 2017

If we go forward with this, we'll just have to make sure we break users of their already-bad habit of using time to measure the time it takes their pony program to do something.

It definitely seems reasonable to have to wait a little bit longer for program termination, to have better performance during runtime.

On the other hand, if someone is using Pony to write a short program meant to terminate almost immediately, waiting 5 seconds may be a bother. For example, let's say someone wants to run pony-stable to check their dependencies as part of their development process loop - adding 5 extra seconds into that loop kind of sucks.

It would be nice to find a solution that still keeps those kinds of short programs terminating quickly, while also losing the extra overhead for long-running programs as well. At the very least maybe we could make it a compile-time option?

@SeanTAllen
Copy link
Member Author

@jemc I think we could make the duration to wait a compile time option.

@SeanTAllen
Copy link
Member Author

BTW, 10 billion cycles was a mostly arbitrary value that I took after looking at the cpu_pause code that was also mostly arbitrary when it was created. As in, "let's try some values, these seem to work well, let's use them".

@slfritchie
Copy link
Contributor

slfritchie commented Nov 22, 2017

To add my two bits, I've tried this branch on an EC2 instance of c4.8xlarge.

When running extras/message-ubench/message-ubench --pingers 320 --initial-pings 5 --ponynoblock --ponythreads=N using today's master, there's a pretty remarkable explosion of memory within 30 seconds of runtime.

| --ponythreads |  RSS size in 30 sec |
|               | MBytes very roughly |
|---------------+---------------------|
|            18 |                1880 |
|            14 |                 650 |
|            13 |                 200 |
|            12 |                  14 |
|            11 |                  13 |
|            10 |                  13 |

With this branch, I don't see that explosion happening at 32 or 34 Pony threads.

@SeanTAllen
Copy link
Member Author

MAKE SURE TO NOTE THE SHORT RUNNING PROGRAM IN RELEASE NOTES.

@SeanTAllen SeanTAllen merged commit 2601b7c into master Nov 22, 2017
@SeanTAllen SeanTAllen deleted the slower-sched-block-msg-sends branch November 22, 2017 20:47
ponylang-main added a commit that referenced this pull request Nov 22, 2017
@sgebbie
Copy link
Contributor

sgebbie commented Nov 23, 2017

Hi,

I am concerned about the side effect of this commit. I currently use Pony to build a command line tool. The reason is that I make use of Pony's type system to shorten my dev cycle. Also, I use Pony for other longer running simulations, but even these applications are not intended to run indefinitely.

However, this use case is then at odds with have an imposed long shutdown of the program. In fact, because I am building a CLI I am working on always completing the work in under 200ms. Prior to this commit the Pony binary performed incredibly well. For my current files that I am processing I get a startup-process-shutdown time of 6ms. Which would be hard to beat with other higher level languages. My projection puts my current implementation at remaining sub 200ms for the next year or two, and after that, if need be, I have strategies for improving my file representation and remaining sub 200ms.

So, my question is, given that I strongly depend on the previous behaviour of a short run time, can these features be turned on/off via at compile time? Or, how easy would it be to add the ability to disable this behaviour at compile time? That way, for the the critical path could be left optimised for long running programs, and disabled for short running applications.

An alternative would be to have an explicit function call, akin to Env.exitcode(), but rather Env.exit() that instructs the Pony runtime to shutdown cleanly, but immediately.

Thanks,
Stewart.

jemc added a commit that referenced this pull request Nov 24, 2017
PR #2355 included a change that improved runtime performance at
the cost of significantly delaying program termination.

This PR makes that performance tuning an opt-in change.
@jemc
Copy link
Member

jemc commented Nov 24, 2017

@sgebbie - Thanks for your comment - I think it's important to accommodate use cases like yours, so I'm glad you spoke up.

You inspired me to go ahead and file #2369, which is intended to make the change from this PR an opt-in trade-off. Please feel free to participate in the discussion over there.

sgebbie added a commit to sgebbie/ponyc that referenced this pull request Nov 24, 2017
PR ponylang#2355 included a change that improved runtime performance at
the cost of significantly delaying program termination.

This commit makes it possible to have the best of both worlds. At start
up time the quiescence cycle timeout count will default to 10E9 (around 5
seconds on 2017 hardware). And, therefore, programs can still benefit
from this enhancement.

However, this exposes the means to then update the cycle timeout value
to a lower value (defaulting to 0). That way, programs can still control
their termination time.

The implementation ties the quiescence timeout configuration to setting
of an exit code. This is based on the idea that setting an exit code is
a strong signal that the program is intending to exit soon.

Finally, care is taken to propagate the timeout value to the scheduler
thread via the scheduler messaging mechanism rather than using
something like an atomic variable. This ensures that on the critical
path that the performance is not negatively impacted by cache
invalidation. Rather, the scheduler_t hold the value locally the
scheduler's thread.
@sgebbie
Copy link
Contributor

sgebbie commented Nov 24, 2017

I have proposed an approach to managing this by allowing runtime control of the quiescence cycle timeout value in #2370.

dipinhora added a commit to dipinhora/ponyc that referenced this pull request Nov 26, 2017
PR ponylang#2355 included a change that improved runtime performance at
the cost of significantly delaying program termination using a
magic value of `10000000000`. After some additional testing, it
was discovered that a smaller magic value of `1000000` is equally
effective without unnecessary termination delay.

This commit changes the magic value from `10000000000` to
`1000000`.
jemc pushed a commit that referenced this pull request Nov 27, 2017
PR #2355 included a change that improved runtime performance at
the cost of significantly delaying program termination using a
magic value of `10000000000`. After some additional testing, it
was discovered that a smaller magic value of `1000000` is equally
effective without unnecessary termination delay.

This commit changes the magic value from `10000000000` to
`1000000`.
sgebbie added a commit to sgebbie/ponyc that referenced this pull request Nov 27, 2017
PR ponylang#2355 included a change that improved runtime performance at
the cost of significantly delaying program termination. This was
amended vi PR ponylang#2376.

This commit makes it possible to have the best of both worlds. At start
up time the quiescence cycle timeout count will default to 10E6 (which
still adds an overhead of ~10ms to shutdown). Therefore, programs can
still benefit from scheduler enhancements.

However, this commit exposes the means to then update the cycle timeout
value to a lower value (defaulting to 0). That way, programs can still
control their termination time.

The implementation ties the quiescence timeout configuration to setting
of an exit code. This is based on the idea that setting an exit code is
a strong signal that the program is intending to exit soon.

Finally, care is taken to propagate the timeout value to the scheduler
thread via the scheduler messaging mechanism rather than using something
like an atomic variable. This ensures that on the critical path that the
performance is not negatively impacted by cache invalidation. Rather,
the scheduler_t hold the value locally the scheduler's thread.

There are also unit tests to check assumptions regarding field
alignment and type sizes.
sgebbie added a commit to sgebbie/ponyc that referenced this pull request Nov 28, 2017
PR ponylang#2355 included a change that improved runtime performance at
the cost of significantly delaying program termination. This was
amended vi PR ponylang#2376.

This commit makes it possible to have the best of both worlds. At start
up time the quiescence cycle timeout count will default to 10E6 (which
still adds an overhead of ~10ms to shutdown). Therefore, programs can
still benefit from scheduler enhancements.

However, this commit exposes the means to then update the cycle timeout
value to a lower value (defaulting to 0). That way, programs can still
control their termination time.

The implementation ties the quiescence timeout configuration to setting
of an exit code. This is based on the idea that setting an exit code is
a strong signal that the program is intending to exit soon.

Finally, care is taken to propagate the timeout value to the scheduler
thread via the scheduler messaging mechanism rather than using something
like an atomic variable. This ensures that on the critical path that the
performance is not negatively impacted by cache invalidation. Rather,
the scheduler_t hold the value locally the scheduler's thread.

There are also unit tests to check assumptions regarding field
alignment and type sizes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Uncontrolled memory usage
4 participants