-
-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve work-stealing "scheduler is blocked" logic #2355
Conversation
e5f09de
to
c250f88
Compare
Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647
c250f88
to
e751677
Compare
Great find! It sounds like a lot of unnecessary work can be prevented this way. Does this mean that after this change, all pony programs (including an almost-immediately finished "hello world" program) will take at least 5 seconds to detect quiescence and terminate? Not necessarily a dealbreaker for me - I'm just trying to understand the impact. |
@jemc yes. 10 billion cycles. around 5 seconds. |
For the record.
|
If we go forward with this, we'll just have to make sure we break users of their already-bad habit of using It definitely seems reasonable to have to wait a little bit longer for program termination, to have better performance during runtime. On the other hand, if someone is using Pony to write a short program meant to terminate almost immediately, waiting 5 seconds may be a bother. For example, let's say someone wants to run It would be nice to find a solution that still keeps those kinds of short programs terminating quickly, while also losing the extra overhead for long-running programs as well. At the very least maybe we could make it a compile-time option? |
@jemc I think we could make the duration to wait a compile time option. |
BTW, 10 billion cycles was a mostly arbitrary value that I took after looking at the cpu_pause code that was also mostly arbitrary when it was created. As in, "let's try some values, these seem to work well, let's use them". |
To add my two bits, I've tried this branch on an EC2 instance of c4.8xlarge. When running
With this branch, I don't see that explosion happening at 32 or 34 Pony threads. |
MAKE SURE TO NOTE THE SHORT RUNNING PROGRAM IN RELEASE NOTES. |
Hi, I am concerned about the side effect of this commit. I currently use Pony to build a command line tool. The reason is that I make use of Pony's type system to shorten my dev cycle. Also, I use Pony for other longer running simulations, but even these applications are not intended to run indefinitely. However, this use case is then at odds with have an imposed long shutdown of the program. In fact, because I am building a CLI I am working on always completing the work in under 200ms. Prior to this commit the Pony binary performed incredibly well. For my current files that I am processing I get a startup-process-shutdown time of 6ms. Which would be hard to beat with other higher level languages. My projection puts my current implementation at remaining sub 200ms for the next year or two, and after that, if need be, I have strategies for improving my file representation and remaining sub 200ms. So, my question is, given that I strongly depend on the previous behaviour of a short run time, can these features be turned on/off via at compile time? Or, how easy would it be to add the ability to disable this behaviour at compile time? That way, for the the critical path could be left optimised for long running programs, and disabled for short running applications. An alternative would be to have an explicit function call, akin to Env.exitcode(), but rather Env.exit() that instructs the Pony runtime to shutdown cleanly, but immediately. Thanks, |
PR #2355 included a change that improved runtime performance at the cost of significantly delaying program termination. This PR makes that performance tuning an opt-in change.
PR ponylang#2355 included a change that improved runtime performance at the cost of significantly delaying program termination. This commit makes it possible to have the best of both worlds. At start up time the quiescence cycle timeout count will default to 10E9 (around 5 seconds on 2017 hardware). And, therefore, programs can still benefit from this enhancement. However, this exposes the means to then update the cycle timeout value to a lower value (defaulting to 0). That way, programs can still control their termination time. The implementation ties the quiescence timeout configuration to setting of an exit code. This is based on the idea that setting an exit code is a strong signal that the program is intending to exit soon. Finally, care is taken to propagate the timeout value to the scheduler thread via the scheduler messaging mechanism rather than using something like an atomic variable. This ensures that on the critical path that the performance is not negatively impacted by cache invalidation. Rather, the scheduler_t hold the value locally the scheduler's thread.
I have proposed an approach to managing this by allowing runtime control of the quiescence cycle timeout value in #2370. |
PR ponylang#2355 included a change that improved runtime performance at the cost of significantly delaying program termination using a magic value of `10000000000`. After some additional testing, it was discovered that a smaller magic value of `1000000` is equally effective without unnecessary termination delay. This commit changes the magic value from `10000000000` to `1000000`.
PR #2355 included a change that improved runtime performance at the cost of significantly delaying program termination using a magic value of `10000000000`. After some additional testing, it was discovered that a smaller magic value of `1000000` is equally effective without unnecessary termination delay. This commit changes the magic value from `10000000000` to `1000000`.
PR ponylang#2355 included a change that improved runtime performance at the cost of significantly delaying program termination. This was amended vi PR ponylang#2376. This commit makes it possible to have the best of both worlds. At start up time the quiescence cycle timeout count will default to 10E6 (which still adds an overhead of ~10ms to shutdown). Therefore, programs can still benefit from scheduler enhancements. However, this commit exposes the means to then update the cycle timeout value to a lower value (defaulting to 0). That way, programs can still control their termination time. The implementation ties the quiescence timeout configuration to setting of an exit code. This is based on the idea that setting an exit code is a strong signal that the program is intending to exit soon. Finally, care is taken to propagate the timeout value to the scheduler thread via the scheduler messaging mechanism rather than using something like an atomic variable. This ensures that on the critical path that the performance is not negatively impacted by cache invalidation. Rather, the scheduler_t hold the value locally the scheduler's thread. There are also unit tests to check assumptions regarding field alignment and type sizes.
PR ponylang#2355 included a change that improved runtime performance at the cost of significantly delaying program termination. This was amended vi PR ponylang#2376. This commit makes it possible to have the best of both worlds. At start up time the quiescence cycle timeout count will default to 10E6 (which still adds an overhead of ~10ms to shutdown). Therefore, programs can still benefit from scheduler enhancements. However, this commit exposes the means to then update the cycle timeout value to a lower value (defaulting to 0). That way, programs can still control their termination time. The implementation ties the quiescence timeout configuration to setting of an exit code. This is based on the idea that setting an exit code is a strong signal that the program is intending to exit soon. Finally, care is taken to propagate the timeout value to the scheduler thread via the scheduler messaging mechanism rather than using something like an atomic variable. This ensures that on the critical path that the performance is not negatively impacted by cache invalidation. Rather, the scheduler_t hold the value locally the scheduler's thread. There are also unit tests to check assumptions regarding field alignment and type sizes.
Prior to this commit, we sent actor block and unblock messages each time
we entered and left
steal
. Every instance of work stealing resulted ina block/unblock message pair being sent; even if stealing was
immediately successful.
This was wasteful in a number of ways:
This commit changes block/unblock message sending logic. Hat tip to
Scott Fritchie for pointing out to be how bad the issue was. He spent
some time with DTrace and come up with some truly terrifying numbers for
how much extra work was being done. Dipin Hora and I independently came
up with what was effectively the same solution for this problem. This
commit melds the best of his implementation with the best of mine.
With this commit applied, work stealing will only result in a
block/unblock message pair being sent if:
scheduler (new behavior)
clock cycles (about 5 seconds on most machines) (new behavior)
(existing behavior)
Item 2 is the biggest change. What we are doing is increasing program
shutdown time by at least 5 seconds (perhaps slightly more due to cross
scheduler timing issues) in return for much better application
performance while running.
Issue #2317 is mostly fixed by this issue (although there is still a
small amount of memory growth due to another issue).
Issue #517 is changed by this commit. It has memory growth that is much
slower than before but still quite noticeable. On my machine #517 will
no longer OOM as it eventually gets to around 8 gigs in memory usage and
is able to keep up with freeing memory ahead of new memory allocations.
Given that there is still an underlying problem with memory allocation
patterns (the same as #2317), I think that it's possible that the
example program in #517 would still OOM on some test machines.
Fixes #647