-
-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge number of unix-systemcalls under macOS #2474
Comments
These are wildly different programs that happen to perform the same end task, echo server. However, correct me if I'm wrong but your python version is single threaded and probably uses synchronous IO. Is that correct? The Pony version would be using multiple native threads to potentially multiplex many different actors concurrently. Additionally, it operates using async io. A program designed in the fashion that the Pony one is would have far more system call overhead than the synchronous single threaded version. All other things being equal, the program designed in the fashion of the Pony one should also be able to handle far more concurrent requests. If you'd like to strike up a conversation about specifics, the mailing list, and IRC are available. For example, Pony programs depending on the workload can end up making a large number of system calls to sleep. Why? Scheduler threads put themselves to sleep for a while if they can't find any work to do. The longer they sleep, the fewer system calls. Also, the longer they sleep, the longer it would take to respond to an increase in workload. There's a tradeoff there that should be discussed based on the merits of each. If you have specific efficiency concerns and data to discuss, those can be fruitful conversations. I'm closing this issue because there's nothing actionable here, but to reiterate, I encourage you to dig deeper into what you are seeing and avail yourself of IRC and the mailing list to learn more about the Pony runtime and become an active member of discussions about the tradeoffs we have to make. |
@SeanTAllen As you have proposed, I have collected some data in regard to my concerns. I have performed an strace under linux, because under macOS I do not get dtrace to work (even with sudo). The result is pretty clear. For the actual communication, I count 24 syscalls for Pony, which is same to the 21 of Python. So this confirms my expectation, that Pony and Python should be on par for the workload. Nice to see, the use of two threads by the different pids:
In the strace dump is too included the explanation for the other ~4000 syscalls as per my first post. Those are all nanosleep waits. This you have already indicated by the remark: Pony programs depending on the workload can end up making a large number of system calls to sleep. Just for reference, the code which increases the sleep time in steps from 100us to 10ms. This is the selection part with Windows stuff removed for clarity:
Now my question is quite simple: if those busy sleep loops can be avoided by e.g. use of pthread_mutex/pthread_cond blocking calls or a semaphore ? With a quick glance on the code, I assume, few changes in scheduler could be sufficient plus a command line option to enable pthread usage. If wakeup from a block condition is immediate, a pony program can actually even be faster with this solution. But I have not found in the internet clear information, how fast a thread restarts after an unlock (if it is immediately or a os defined cycle later). Reason I am concerned is, that I do own a Macbook, which often runs on battery. My intention is to run my socks-proxy constantly in the background. But if minor internet traffic lets the CPU jump to >10% CPU for a while, then I loose hours of battery time. And I fear, that macOS implements a 100us sleep with a running CPU at full speed which is not good for the battery at all. |
I'm not sure I follow. If I understand, you are suggesting a semaphore/condition that would allow a scheduler to know that "work is available"? |
Change dynamic scheduler scaling implementation in order to resolve the hangs encountered in ponylang#2451. The previous implementation assumed that signalling to wake a thread was a reliable operation. Apparently, that's not necessarily true (see https://en.wikipedia.org/wiki/Spurious_wakeup and https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing as we couldn't find any other explanation for why the previous implementation was experiencing hangs, I've assumed it is either because of lost wake ups or spurious wake ups and redesigned the logic accordingly. Now, when a thread is about to suspend, it will decrement the `active_scheduler_count` and then suspend. When it wakes up, it will check to see if the `active_scheduler_count` is at least as big as its `index`. If the `active_scheduler_count` isn't big enough, the thread will suspend itself again immediately. If it is big enough, it will resume. Threads no longer modify `active_scheduler_count` when they wake up. `active_scheduler_count` must now be modified by the thread that is waking up another thread prior to sending the wake up notification. Additionally, since we're now assuming that wake up signals can be lost, we now send multiple wake up notifications just in case. While this is somewhat wasteful, it is better than being in a situation where some threads aren't woken up at all (i.e. a hang). This commit also includes a change inspired by ponylang#2474. Now, *all* scheduler threads can suspend as long as there is at least one noisy actor registered with the ASIO subsystem. If there are no noisy actors registered with the ASIO subsystem then scheduler 0 is not allowed to suspend itself.
In the meantime I have forked ponyc. In this fork I have added an atomic variable prey_count, which contains the total amount of actors available to thieves. Correct update seems to require only to add atomic_add/sub to every mpmcs push/pop for all queues. Together with a push from the inject-queue I currently just wake up a thread without more consideration. In steal() the prey_count is checked and depending of this: continue stealing and/or wake up more threads or going to sleep. As I do not understand every aspect of the scheduler,now ask for your comments. This change totally avoids the busy sleep, which is my primary intention. At the definition of prey_count have added this comment, which should help to understand better:
In the meantime have checked with my socks-proxy and I am very pleased with the improvement in CPU load. Just something is still not working well, because after some time I have received this error:
Running with lldb reveals:
The compiler is built with:
Now modifying the Makefile, because pthread still in use....found it. USE_SCHEDULER_SCALING_PTHREADS is hardcoded for macOS. The change have pushed to github and the socks-proxy is still running. Happily the CPU-load appears to be even a bit better. |
This change will probably have a really awful performance impact for highly concurrent, heavily loaded servers. |
From my point of view, a better worked out implementation of this change will only slightly impact the performance of a highly concurrent, heavily loaded server. As long as work is available - which is in general the case for a highly concurrent, heavily loaded server - no thread will ever sleep and this change has no further impact besides atomic counting. On the positive side it allows to simplify the scheduler, while keeping the work-stealing principle. This simplification may be sufficient to compensate for the atomic counting overhead. But this is just my opinion, it is not backed up with quantitative data. My use case is an application on a battery driven computer. Having a CPU load of 10% versus now below 1% makes a real difference. |
I whole heartily in favor of a scheduler that does not need a busy sleep. As pointed out, a busy sleep is unacceptable for some systems, such as one of my targets applications, using pony for embedded systems. If a single scheduler isn't able to provide a solution maybe we'll need pluggable schedulers or conditional compilication. In any case, I hope we can coalesce on a solution. |
@winksaville @gin66 please see the work that @dipinhora has been doing that has built about the generalized runtime backpressure work that I did done: #2483 @gin66 I appreciate your enthusiasm. Your atomic count will have a huge impact on performance. Especially as it is in one of the hottest paths in the runtime. The more scheduler threads, the more contention on that atomic variable. on every message send. that's a really large impact. The goal is laudable. But putting a contended, atomic variable in an extreme hot path is not a solution that the core team would support. |
@SeanTAllen Unfortunately I am not familiar with the impact of atomic add/sub in today's caching/pipelining L1/L2/L3 multi core CPUs. And I cannot come up with any measurements to prove myself right or wrong, too. Just for my specific use case I value battery runtime over performance. @winksaville's idea with a user (aka SW developer) selectable scheduler would solve this conflict of interest. If I see it correctly, @dipinhora's work is already in the master branch. This code version has been the basis of my fork and the CPU load is still much higher than my own draft proposal, so I will not use it. As I have currently no better idea to solve the CPU load issue, I will rewrite my socks-proxy in rust. With this language I have the freedom to set the priority depending on application need. Just that the language itself is much more complicated than pony. It's a real pity. |
@gin66 I pointed you at an open PR, it is not on master. I think your fork is fine for your use case. It's not a good long term solution but we are working on a long term solution. There's no need to switch it to rust, you'll just need to maintain your fork for a while as we bring down the busy wait. |
@gin66 PR #2483 goes a long way towards your goal of lowering syscalls because it will shut down all scheduler threads if possible (depending on workload). I've tried to architect it with performance in mind so it should have minimal impact on a normal busy workload. It's probably a few days or a week or so before that PR is hopefully merged into master. It would be great if you could run your application and benchmark the CPU usage and syscalls with the PR changes to confirm that it helps your use case in the meantime. My understanding is that it should significantly lower that 10% idle CPU usage but I'm not 100% sure of that. Also, there's still room for improvement beyond what that PR accomplishes and it would be awesome if you and @winksaville are able to help and give more specifics as to your use cases and needs so we can all brainstorm to find an appropriate solution. I believe that the current runtime scheduling mechanism is fairly versatile and we should be able to reach the performance targets required without pluggable schedulers or negatively impacting high load applications. |
Mine use case is if there is no work to do and all actors are waiting for
I/O then all threads should be waiting in the OS. There should be no
polling in the pony runtime.
…On Sat, Jan 13, 2018, 3:05 PM Dipin Hora ***@***.***> wrote:
@gin66 <https://github.com/gin66> PR#2483 goes a long way towards your
goal of lowering syscalls because it will shut down all scheduler threads
if possible (depending on workload). I've tried to architect it with
performance in mind so it should have minimal impact on a normal busy
workload. It's probably a few days or a week or so before that PR is
hopefully merged into master.
It would be great if you could run your application and benchmark the CPU
usage and syscalls with the PR changes to confirm that it helps your use
case in the meantime. My understanding is that it should significantly
lower that 10% idle CPU usage but I'm not 100% sure of that.
Also, there's still room for improvement beyond what that PR accomplishes
and it would be awesome if you and @winksaville
<https://github.com/winksaville> are able to help and give more specifics
as to your use cases and needs so we can all brainstorm to find an
appropriate solution. I believe that the current runtime scheduling
mechanism is fairly versatile and we should be able to reach the
performance targets required without pluggable schedulers or negatively
impacting high load applications.
Sent with GitHawk <http://githawk.com>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2474 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA-hHITzgajrtyQ7dElJNfgsaVHGKm0Yks5tKTawgaJpZM4RWT65>
.
|
@winksaville That should be what should happen once #2483 is merged (assuming no more missed edge cases or bugs) since it will suspend all scheduler threads as long as there is at least one actor subscribed with the ASIO subsystem. If there are no actors subscribed with the ASIO subsystem, then lack of work would mean quiescence so scheduler thread 0 isn't allowed to suspend in that case because it needs to be awake to handle work and to detect quiescence. Also, @gin66, thank you for the inspiration regarding the suspending of all threads as long as at least one actor is subscribed with the ASIO subsystem. That functionality is a direct result of this ticket and your use case. |
I wonder, what if we had a mode that disabled quiescence detection and put the burden on the app when it's time to exit, would that allow all threads to suspend? In an embedded system I can envision a system were interrupts send messages directly to actors and thus detection of quiescence might be difficult. Although I have no idea how quiescence is detected so maybe a nonsensical question. |
@winksaville Given the relationship between the ASIO subsystem and the scheduler threads (along with the dynamic scheduler changes PR logic), I think that is effectively already in place (assuming I'm understanding the scenario correctly). The output in my example comment on the PR (#2483 (comment)) shows a situation where an echo server is created to listen on a specific port (i.e. an actor is waiting for an async notification from the OS). In this scenario, all of the scheduler threads suspend (including sched 0). This is because the ASIO subsystem is waiting for an OS notification via either epoll or kqueue or iocp. Once the ASIO subsystem receives the OS notification of an event, it will wake up one of the scheduler threads to handle the notification. If there is no more work to do, the scheduler thread would suspend again waiting to be woken by the ASIO subsystem again. Quiescence would only occur if there are no actors registered with the ASIO subsystem. This is in the actor's control because it has to explicitly tell the ASIO subsystem that it doesn't want to wait for any more notifications. Once there are no more actors registered with the ASIO subsystem, quiescence detection can proceed and relies on the block/cnf/ack messages between the scheduler threads that scheduler thread 0 is responsible for coordinating. |
OK, over time I'll get more familiar, for now things sound more than good enough, txs. |
Change dynamic scheduler scaling implementation in order to resolve the hangs encountered in ponylang#2451. The previous implementation assumed that signalling to wake a thread was a reliable operation. Apparently, that's not necessarily true (see https://en.wikipedia.org/wiki/Spurious_wakeup and https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing as we couldn't find any other explanation for why the previous implementation was experiencing hangs, I've assumed it is either because of lost wake ups or spurious wake ups and redesigned the logic accordingly. Now, when a thread is about to suspend, it will decrement the `active_scheduler_count` and then suspend. When it wakes up, it will check to see if the `active_scheduler_count` is at least as big as its `index`. If the `active_scheduler_count` isn't big enough, the thread will suspend itself again immediately. If it is big enough, it will resume. Threads no longer modify `active_scheduler_count` when they wake up. `active_scheduler_count` must now be modified by the thread that is waking up another thread prior to sending the wake up notification. Additionally, since we're now assuming that wake up signals can be lost, we now send multiple wake up notifications just in case. While this is somewhat wasteful, it is better than being in a situation where some threads aren't woken up at all (i.e. a hang). This commit also includes a change inspired by ponylang#2474. Now, *all* scheduler threads can suspend as long as there is at least one noisy actor registered with the ASIO subsystem. If there are no noisy actors registered with the ASIO subsystem then scheduler 0 is not allowed to suspend itself.
Change dynamic scheduler scaling implementation in order to resolve the hangs encountered in ponylang#2451. The previous implementation assumed that signalling to wake a thread was a reliable operation. Apparently, that's not necessarily true (see https://en.wikipedia.org/wiki/Spurious_wakeup and https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing as we couldn't find any other explanation for why the previous implementation was experiencing hangs, I've assumed it is either because of lost wake ups or spurious wake ups and redesigned the logic accordingly. Now, when a thread is about to suspend, it will decrement the `active_scheduler_count` and then suspend. When it wakes up, it will check to see if the `active_scheduler_count` is at least as big as its `index`. If the `active_scheduler_count` isn't big enough, the thread will suspend itself again immediately. If it is big enough, it will resume. Threads no longer modify `active_scheduler_count` when they wake up. `active_scheduler_count` must now be modified by the thread that is waking up another thread prior to sending the wake up notification. Additionally, since we're now assuming that wake up signals can be lost, we now send multiple wake up notifications just in case. While this is somewhat wasteful, it is better than being in a situation where some threads aren't woken up at all (i.e. a hang). Additionally, only use `scheduler_count_changing` for `signals` implementation of dynamic scheduler scaling. `pthreads` implementation now uses a mutex (`sched_mut`) in its place. We also now change logic to only unlock mutex in `pthreads` implementation once threads have been woken to avoid potential lost wake ups. This isn't an issue for the `signals` implementation and the unlocking of `scheduler_count_changing` can remain where it is prior to threads being woken up. This commit also splits out scheduler block/unblock message handling logic into their own functions (this is so that sched 0 can call those functions directly instead of sending messages to itself). This commit also includes a change inspired by ponylang#2474. Now, *all* scheduler threads can suspend as long as there is at least one noisy actor registered with the ASIO subsystem. If there are no noisy actors registered with the ASIO subsystem then scheduler 0 is not allowed to suspend itself because it is reponsible for quiescence detection. Lastly, this commit adds logic to allow a scheduler thread to suspend even if it has already sent a scheduler block message so that we can now suspend scheduler threads in most scenarios.
Change dynamic scheduler scaling implementation in order to resolve the hangs encountered in #2451. The previous implementation assumed that signalling to wake a thread was a reliable operation. Apparently, that's not necessarily true (see https://en.wikipedia.org/wiki/Spurious_wakeup and https://askldjd.com/2010/04/24/the-lost-wakeup-problem/). Seeing as we couldn't find any other explanation for why the previous implementation was experiencing hangs, I've assumed it is either because of lost wake ups or spurious wake ups and redesigned the logic accordingly. Now, when a thread is about to suspend, it will decrement the `active_scheduler_count` and then suspend. When it wakes up, it will check to see if the `active_scheduler_count` is at least as big as its `index`. If the `active_scheduler_count` isn't big enough, the thread will suspend itself again immediately. If it is big enough, it will resume. Threads no longer modify `active_scheduler_count` when they wake up. `active_scheduler_count` must now be modified by the thread that is waking up another thread prior to sending the wake up notification. Additionally, since we're now assuming that wake up signals can be lost, we now send multiple wake up notifications just in case. While this is somewhat wasteful, it is better than being in a situation where some threads aren't woken up at all (i.e. a hang). Additionally, only use `scheduler_count_changing` for `signals` implementation of dynamic scheduler scaling. `pthreads` implementation now uses a mutex (`sched_mut`) in its place. We also now change logic to only unlock mutex in `pthreads` implementation once threads have been woken to avoid potential lost wake ups. This isn't an issue for the `signals` implementation and the unlocking of `scheduler_count_changing` can remain where it is prior to threads being woken up. This commit also splits out scheduler block/unblock message handling logic into their own functions (this is so that sched 0 can call those functions directly instead of sending messages to itself). This commit also includes a change inspired by #2474. Now, *all* scheduler threads can suspend as long as there is at least one noisy actor registered with the ASIO subsystem. If there are no noisy actors registered with the ASIO subsystem then scheduler 0 is not allowed to suspend itself because it is reponsible for quiescence detection. Lastly, this commit adds logic to allow a scheduler thread to suspend even if it has already sent a scheduler block message so that we can now suspend scheduler threads in most scenarios.
Currently I am developing a socks-proxy in pony, which should do kind of load balancing via several ssh-socks-channels to three servers. Works well and has been fun to write with actors.
Now I am only concerned,that this little workload is shown with 10% in macOS' system activity monitor, while the actual ssh-proxies (doing the whole work including encryption) are around 1% or less. On a macbook this makes a difference for the usage time, so this load is not acceptable at all.
In order to avoid my code being the culprit, I have made comparison between echo-server from the examples directory and its Python counterpart.
Test method is to call:
and for ten requests:
Result is:
The python numbers are as they should be. Even the actor system of pony should not lead to thousands of sys calls. Apparently the pony runtime has efficiency issues, which need to be solved in order to be really a high-performant language.
Little side note: PONY is in PYthON :-)
The text was updated successfully, but these errors were encountered: