Generalized runtime backpressure #2264

SeanTAllen · 2017-10-09T03:33:07Z

This is a first draft of generalized runtime backpressure. A final version would require a changes to TCPConnection and anything else that can become "overloaded" and would need to exert backpressure based on external conditions (such as a slow receiver).

Praetonus

Nice! I've left some comments on implementation details.

Praetonus · 2017-10-09T13:35:16Z

src/libponyrt/actor/actor.c

+  }
+}
+
+void maybe_mute(pony_ctx_t* ctx, pony_actor_t* to, pony_msg_t* first,


The function name should be ponyint_maybe_mute according to the runtime naming conventions.

Praetonus · 2017-10-09T14:05:54Z

src/libponyrt/actor/actor.c

+
+      pony_msg_t* m = first;
+
+      while(m != last)


Iterating the message chain here is going to be expensive, so I think it would be nice to refactor in order to remove the need for the loop.

In the current state of things, a message chain cannot contain ORCA messages, so a possible alternative would be to take the message chain length as a parameter. If we want to future-proof the function now, the parameter could be the number of application messages in the chain instead.

If we use that assumption about message chains not containing ORCA messages, we should likely try to add a pony_assert somewhere to ensure that.

That's true. I think the best place for that assertion would be in pony_chain.

@Praetonus if we had the number of application messages in the chain, that would become much easier. if there is more than 0, we do our check. if there are none, then we don't need our check. i was going to bring that up.

Discussed this with @SeanTAllen, I'm going to implement that change (counting the number of application messages in the chain) since it also requires a change to one of the optimisation passes, and I'll submit the patch in this PR.

Praetonus' patch was applied.

Praetonus · 2017-10-09T14:08:58Z

src/libponyrt/actor/actor.c

@@ -212,8 +212,20 @@ bool ponyint_actor_run(pony_ctx_t* ctx, pony_actor_t* actor, size_t batch)
      app++;
      try_gc(ctx, actor);

+      // if we become muted as a result of handling a message, bail out now.
+      if(actor->muted > 0)


This should be an atomic_load_explicit with memory_order_relaxed. Atomic operations without an explicit memory order implicitly use memory_order_seq_cst and that's a performance hit.

Praetonus · 2017-10-09T14:09:12Z

src/libponyrt/actor/actor.c

@@ -225,6 +237,15 @@ bool ponyint_actor_run(pony_ctx_t* ctx, pony_actor_t* actor, size_t batch)
  // We didn't hit our app message batch limit. We now believe our queue to be
  // empty, but we may have received further messages.
  pony_assert(app < batch);
+  pony_assert(actor->muted == 0);


atomic_load_explicit here.

Praetonus · 2017-10-09T14:09:26Z

src/libponyrt/actor/actor.c

  if(ponyint_messageq_push(&to->q, first, last))
  {
-    if(!has_flag(to, FLAG_UNSCHEDULED))
+    if(!has_flag(to, FLAG_UNSCHEDULED) && (to->muted == 0)) {


atomic_load_explicit here.

Praetonus · 2017-10-09T14:09:56Z

src/libponyrt/actor/actor.c

+    // 2. the sender isn't overloaded
+    // AND
+    // 3. we are sending to another actor (as compared to sending to self)
+    if((has_flag(to, FLAG_OVERLOADED) || (to->muted > 0)) &&


atomic_load_explicit here.

Praetonus · 2017-10-09T14:16:52Z

src/libponyrt/sched/scheduler.c

+    for(uint32_t i = 0; i < scheduler_count; i++)
+    {
+      if(&scheduler[i] != sched)
+        send_msg(i, SCHED_UNMUTE_ACTOR, (intptr_t)actor);


Now that all schedulers can broadcast, send_msg_single is unsafe. Calls to send_msg_single should be replaced by calls to send_msg.

does that mean that we should remove send_msg_single entirely? i believe that is the implication but I want to verify.

I only find one instance in send_msg_all in scheduler.c

updating.

Yes, this is what I meant.

Praetonus · 2017-10-09T14:20:56Z

src/libponyrt/sched/scheduler.c

+  if(r == NULL)
+  {
+    ponyint_muteset_putindex(&mref->value, sender, index2);
+    sender->muted += 1;


This is currently equivalent to atomic_fetch_add_explicit(&sender->muted, 1, memory_order_seq_cst), i.e. a very expensive operation.

As far as I can see only one scheduler can mute/unmute a given actor at a time, so this can be replaced with

uint8_t muted = atomic_load_explicit(&sender->muted, memory_order_relaxed); atomic_store_explicit(&sender->muted, muted + 1, memory_order_relaxed);

If I'm wrong and multiple schedulers can modify the muted field of an actor at the same time, this should be atomic_fetch_add_explicit(&sender->muted, 1, memory_order_relaxed) instead.

that's correct. only a single scheduler can mute an actor at a time as it happens on message send.

Praetonus · 2017-10-09T14:24:17Z

src/libponyrt/sched/scheduler.c

+
+void ponyint_sched_unmute(pony_ctx_t* ctx, pony_actor_t* actor, bool inform)
+{
+  // this needs a better name. its not unmuting actor.


I'd suggest ponyint_sched_unmute_senders.

Praetonus · 2017-10-09T14:34:43Z

src/libponyrt/sched/scheduler.c

+    while((muted = ponyint_muteset_next(&mref->value, &i)) != NULL)
+    {
+      pony_assert(muted->muted > 0);
+      muted->muted -= 1;


Same as in ponyint_sched_mute, these two lines and the if below can be replaced with

uint8_t muted_count = atomic_load_explicit(&muted->muted, memory_order_relaxed); pony_assert(muted_count > 0); muted_count--; atomic_store_explicit(&muted->muted, muted_count, memory_order_relaxed); if(muted_count == 0) ...

slfritchie · 2017-10-10T22:47:05Z

@SeanTAllen Would you consider also adding some variation of https://gist.github.com/slfritchie/0dab74fd729b7ecdd2a11c32c1f984cb?

SeanTAllen · 2017-10-11T13:39:20Z

@slfritchie i think that's reasonable.

do you think coarse grained tracking on the number of times an actor is overloaded or muted would be interesting? (also overloading cleared and unmuted)

Praetonus · 2017-10-11T15:21:15Z

src/libponyrt/sched/scheduler.c

+      muted_count--;
+      atomic_store_explicit(&muted->muted, muted_count, memory_order_relaxed);
+
+      if (muted->muted == 0)


This needs to either be an atomic_load_explicit, or use the muted_count local.

thanks i missed that

slfritchie · 2017-10-11T17:37:22Z

@SeanTAllen tl;dr: yes.

DTrace (and presumably SystemTap) permit easy dynamic probes at entry & exit to any function. This code is structured almost enough dynamic function entry probes could tell you most of what you'd like to know. Overload & not overload have dedicated functions. But mute & unmute do not. Muting status changes are buried far inside of ponyint_sched_mute() and ponyint_sched_unmute_senders().

The code could be restructured to give dedicated small functions for the actual mute state changes; then dynamic function entry probes are easy. However, there's infrastructure value in defining static probes for important events in the system. These new events are the kinds of thing that affect scheduling, and visibility into scheduling is Good.

SeanTAllen · 2017-10-11T23:20:32Z

Ive found a couple problems with this implementation.

program termination doesnt take into account that there might be muted actors (unscheduled) which will become scheduled again, which can lead to early program termination. the fewer actors running, the more like that is to occur. Really we need to know if there are any muted actors before termination. because if there are, we should keep trying to steal actors rather than exiting that scheduler thread.
the incrementing and decrementing of mute value for an actor isnt thread safe. more than 1 scheduler could try to decrement the mute value at a time. which could result in a data race and FUN. we’d probably want a CAS operation for that OR… what would also solve the issue 1. we need an actor like the cycle detector that handles all muting and unmuting. and can know if there are any “live” muted actors around (in which case a scheduler shouldn’t exit)

Praetonus · 2017-10-12T14:27:02Z

@SeanTAllen The first problem can be solved by two small modifications to the workstealing and quiescence detection algorithm.

A scheduler shouldn't send SCHED_BLOCK if the size of its mutemap isn't 0.
When a scheduler is looping in steal and reschedules a previously muted actor as a result of receiving SCHED_UNMUTE_ACTOR, it should resume its normal execution.

Could you detail the circumstances in which the second problem can occur? It seems to me that a given actor can only be in one mutemap at a time.

SeanTAllen · 2017-10-12T16:16:38Z

@Praetonus excellent ideas. and now that I am a little less tired, I realize you are correct about an actor only being able to be in a single mutemap at a time. I need to add comments to that effect.

SeanTAllen · 2017-10-12T16:17:18Z

At the next sync, I'd like to discuss what sort of documentation this might need. Inclusion in the tutorial? On the website? Just notes in the code?

Praetonus · 2017-10-12T18:22:19Z

@SeanTAllen Here's the diff containing the changes needed to remove the message chain iteration in ponyint_maybe_mute: https://gist.github.com/Praetonus/e6f9d24d1f88e4d1fbfd97dbdc340fef

SeanTAllen · 2017-10-13T02:36:16Z

@Praetonus patch applied. looking good.

SeanTAllen · 2017-10-19T20:17:02Z

@Praetonus everything we talked about is in place. Sylvan helped me track down a bug.

I have to add the ability for actors to manual indicate they can't make progress so that is included in the backpressure system and perf testing. but this is getting close.

right now "can't make progress" would be a TCPConnection that is unable to send (backpressure for example).

SeanTAllen · 2017-10-21T00:32:06Z

@slfritchie I added the telemetry info. Can you have a look to make sure I did it correctly?

SeanTAllen · 2017-10-21T00:44:05Z

Things I need to do:

add docs for the Backpressure package
performance testing using Wallaroo

Please give this another review. It's ready for more feedback.

Question, how, if at all should we document this somewhat advanced feature beyond the package level docs in Backpressure?

SeanTAllen · 2017-10-21T00:47:36Z

src/libponyrt/actor/actor.c

  if(ponyint_messageq_push_single(&to->q, first, last))
  {
-    if(!has_flag(to, FLAG_UNSCHEDULED))
+    if(!has_flag(to, FLAG_UNSCHEDULED) &&
+      (atomic_load_explicit(&to->muted, memory_order_relaxed) == 0)) {


should i move this to a nicely named function?

SeanTAllen · 2017-10-21T15:34:15Z

I just updated ProcessMonitor to use backpressure mechanism to prevent unbounded pending queue growth. This is a breaking API change to the constructor as a ApplyReleaseBackpressureAuth token is now required.

SeanTAllen · 2017-10-21T16:06:50Z

Ok with the updates to TCPConnection documentation and with the addition of backpressure to ProcessMonitor, it appears to me that any of the "runaway memory growth" actors in the standard library have some sort of backpressure coverage.

SeanTAllen · 2017-10-22T18:53:53Z

There's a problem with work stealing and block messages. At the time a scheduler enters into steal() it might have a muted actor. This will cause it to not send a block message. When that actor is unmuted, it might be stolen by another scheduler, leaving the existing scheduler blocked but looping in steal, without ever being able to exit and without ever having sent a block message.

slfritchie · 2017-10-23T15:43:24Z

I added the telemetry info. Can you have a look to make sure I did it correctly?

The change to examples/dtrace/telemetry.d looks fine, @SeanTAllen.

SeanTAllen · 2017-10-23T17:24:31Z

We did performance testing with Wallaroo.

Using our standard testing app under a normal load of 3 million messages a 2nd, we saw no change in latencies. Awesome!

Praetonus · 2017-10-25T19:52:13Z

src/libponyrt/actor/actor.c

+    ponyint_sched_unmute_senders(ctx, actor, true);
+}
+
+PONY_API void pony_apply_backpressure()


I think this should follow the convention for runtime functions and take a pony_ctx_t* parameter.

This was intentionally done this way to make calling from Pony straightforward. Sylvan C and I spent a while coming up w this approach

Ok, that makes sense.

Praetonus · 2017-10-25T19:52:22Z

src/libponyrt/actor/actor.c

+  set_flag(pony_ctx()->current, FLAG_UNDER_PRESSURE);
+}
+
+PONY_API void pony_release_backpressure()


Same as above.

Praetonus · 2017-10-25T19:52:30Z

src/libponyrt/actor/actor.c

+      ponyint_sched_unmute_senders(ctx, ctx->current, true);
+}
+
+bool ponyint_triggers_muting(pony_actor_t* actor)


Same as above.

Given this only needs the actor it’s unclear to me why we should do that

That's true, I missed that.

Praetonus · 2017-10-25T19:52:41Z

src/libponyrt/actor/actor.c

+    ponyint_is_muted(actor);
+}
+
+bool ponyint_is_muted(pony_actor_t* actor)


Same as above.

not complete, need to expose "has flags"

SeanTAllen · 2017-11-16T01:17:19Z

@jemc @Praetonus @mfelsche @sylvanc i added comments and did some clean up. please have a look. where should there be additional explanation, comments etc.

SeanTAllen · 2017-11-16T16:00:35Z

Latest perf testing round looks good. On the cleanup and blog post etc.

…ime. A microbenchmark for measuring message passing rates in the Pony runtime. This microbenchmark executes a sequence of intervals. During an interval, 1 second long by default, the SyncLeader actor sends an initial set of ping messages to a static set of Pinger actors. When a Pinger actor receives a ping() message, the Pinger will randomly choose another Pinger to forward the ping() message. This technique limits the total number of messages "in flight" in the runtime to avoid causing unnecessary memory consumption & overhead by the Pony runtime. This small program has several intended uses: * Demonstrate use of three types of actors in a Pony program: a timer, a SyncLeader, and many Pinger actors. * As a stress test for Pony runtime development, for example, finding deadlocks caused by experiments in the "Generalized runtime backpressure" work in pull request #2264 * As a stress test for measuring message send & receive overhead for experiments in the "Add DTrace probes for all message push and pop operations" work in pull request #2295

SeanTAllen · 2017-11-17T12:47:03Z

I'm planning on squashing and merging this today. Here's the planned commit comment. Anything else that should be included? If no, I'll get this merged down then start working on a blog post that would announce the feature.

This commit has backpressure to Pony runtime scheduling.

Prior to this commit, it was possible to create Pony programs that would be able to cause runaway memory growth due to a producer/consumer imbalance in message sending. There are a variety of actor topologies that could cause the problem.

Because Pony actor queues are unbounded, runaway memory growth is possible. This commit contains a program that demonstrates this. examples/overload has a large number of actors sending to a single actor. Under the original scheduler algo, each of these actors would receive a fairly equivalent number of chances to process messages. For each time an actor is given access to the scheduler, it is allowed to process up to batch size number of messages. The default for batch size is 100. The overload example many many actors sending to a single actor where it can't keep up with the send.

This commit adjusts the Pony scheduler to apply backpressure. The basic idea is:

1- Pony message queues are unbounded
2- Memory can grow without end if an actor isn't able to keep up with the incoming messages
3- We need a way to detect if an actor is overloaded and if it is, apply backpressure

With this commit, we apply backpressure according to the following rules:

1- If an actor processes batch size application messages then it is overloaded. It wasn't able to drain its message queue during a scheduler run.
2- Sending to an overloaded actor will result in the sender being "muted"
3- Muting means that an actor won't be scheduled for a period of time allowing overloaded actors to catch up

Particular details on this

1- Sending to an overloaded or muted actor will result in the sender being muted unless the sender is overloaded.
2- Muted actors will remain unscheduled until any actors that they sent to that were muted/overloaded are no longer muted/overloaded

With this commit, the basics of backpressure are in place. Still to come:

backpressure isn't currently applied from the cycle detector so its queue can still grow in an unbounded fashion. More work/thought needs to go into addressing that problem.

Its possible that due to implementation bugs that this commit results in deadlocks for some actor topologies. I found a number of implementation issues that had to be fixed after my first pass. The basic algo though should be fine.

There are a number of additional work items that could be added on to the basic scheme. Some might turn out to be actual improvements, some might turn out to not make sense.

1- Allow for notification of senders when they send to a muted/overloaded actor. This would allow application level decisions on possible load shedding or other means to address the underlying imbalance.

2- Allow an actor to know that it has become overloaded so it can take application level

3- Allow actors to have different batch sizes that might result in better performance for some actor topologies

This work was performance tested at Wallaroo Labs and was found under heavy loads to have no noticeable impact on performance.

jemc · 2017-11-17T16:54:20Z

Excellent! 👍

One small typo I noticed in the first line: I think "has" should be "adds".

SeanTAllen added the do not merge This PR should not be merged at this time label Oct 9, 2017

SeanTAllen requested a review from sylvanc October 9, 2017 03:33

Praetonus reviewed Oct 9, 2017

View reviewed changes

Praetonus reviewed Oct 11, 2017

View reviewed changes

SeanTAllen added the needs discussion during sync label Oct 12, 2017

SeanTAllen removed the needs discussion during sync label Oct 12, 2017

SeanTAllen force-pushed the sean-runtime-backpressure branch 2 times, most recently from a0ee64d to f1a3b74 Compare October 20, 2017 21:25

SeanTAllen added the needs discussion during sync label Oct 21, 2017

SeanTAllen commented Oct 21, 2017

View reviewed changes

sylvanc approved these changes Oct 21, 2017

View reviewed changes

SeanTAllen force-pushed the sean-runtime-backpressure branch from 15b59ea to 57af3be Compare October 21, 2017 20:44

Praetonus reviewed Oct 25, 2017

View reviewed changes

SeanTAllen added 2 commits November 1, 2017 21:32

Correct name

ae9971e

Add handy assertion

511acd8

SeanTAllen force-pushed the sean-runtime-backpressure branch from d3f6bd7 to 511acd8 Compare November 2, 2017 01:32

SeanTAllen added 13 commits November 1, 2017 21:34

Bad rebase

b051fa3

Play nice with C API

effad24

not complete, need to expose "has flags"

Don't reschedule til we are done unmuting

75e9f4a

WIP

16b4829

WIP

d3f548b

WIP

3599269

WIP

91e819c

WIP

ba74672

Turn off printfs

cb44ddc

Don't treat ACTORMSG_ACK messages as application message

bcd570d

Be a little less slow but still be correct

a59cd31

Fix compilation errors

8bdb8fd

Clean up

29995dc

SeanTAllen added 3 commits November 15, 2017 20:17

Remove test file

2501f3f

Remove ring binary

dbc2060

Cleanup

8d2d492

EpicEric mentioned this pull request Nov 16, 2017

Backpressure support EpicEric/pony-mqtt#6

Closed

slfritchie mentioned this pull request Nov 16, 2017

A microbenchmark for measuring message passing rates in the Pony runtime. #2347

Merged

SeanTAllen merged commit 1104a6c into master Nov 17, 2017

SeanTAllen deleted the sean-runtime-backpressure branch November 17, 2017 17:08

SeanTAllen removed do not merge This PR should not be merged at this time needs discussion during sync labels Nov 17, 2017

Generalized runtime backpressure #2264

Generalized runtime backpressure #2264

Conversation

SeanTAllen commented Oct 9, 2017

Praetonus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slfritchie commented Oct 10, 2017

SeanTAllen commented Oct 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slfritchie commented Oct 11, 2017

SeanTAllen commented Oct 11, 2017

Praetonus commented Oct 12, 2017

SeanTAllen commented Oct 12, 2017

SeanTAllen commented Oct 12, 2017

Praetonus commented Oct 12, 2017

SeanTAllen commented Oct 13, 2017

SeanTAllen commented Oct 19, 2017

SeanTAllen commented Oct 21, 2017

SeanTAllen commented Oct 21, 2017 • edited Loading

Choose a reason for hiding this comment

SeanTAllen commented Oct 21, 2017

SeanTAllen commented Oct 21, 2017

SeanTAllen commented Oct 22, 2017

slfritchie commented Oct 23, 2017

SeanTAllen commented Oct 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanTAllen commented Nov 16, 2017

SeanTAllen commented Nov 16, 2017

SeanTAllen commented Nov 17, 2017

jemc commented Nov 17, 2017

SeanTAllen commented Oct 21, 2017 •

edited

Loading