Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized runtime backpressure #2264

Merged
merged 72 commits into from
Nov 17, 2017
Merged

Generalized runtime backpressure #2264

merged 72 commits into from
Nov 17, 2017

Conversation

SeanTAllen
Copy link
Member

This is a first draft of generalized runtime backpressure. A final version would require a changes to TCPConnection and anything else that can become "overloaded" and would need to exert backpressure based on external conditions (such as a slow receiver).

@SeanTAllen SeanTAllen added the do not merge This PR should not be merged at this time label Oct 9, 2017
Copy link
Member

@Praetonus Praetonus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I've left some comments on implementation details.

}
}

void maybe_mute(pony_ctx_t* ctx, pony_actor_t* to, pony_msg_t* first,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name should be ponyint_maybe_mute according to the runtime naming conventions.


pony_msg_t* m = first;

while(m != last)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating the message chain here is going to be expensive, so I think it would be nice to refactor in order to remove the need for the loop.

In the current state of things, a message chain cannot contain ORCA messages, so a possible alternative would be to take the message chain length as a parameter. If we want to future-proof the function now, the parameter could be the number of application messages in the chain instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use that assumption about message chains not containing ORCA messages, we should likely try to add a pony_assert somewhere to ensure that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I think the best place for that assertion would be in pony_chain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Praetonus if we had the number of application messages in the chain, that would become much easier. if there is more than 0, we do our check. if there are none, then we don't need our check. i was going to bring that up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with @SeanTAllen, I'm going to implement that change (counting the number of application messages in the chain) since it also requires a change to one of the optimisation passes, and I'll submit the patch in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praetonus' patch was applied.

@@ -212,8 +212,20 @@ bool ponyint_actor_run(pony_ctx_t* ctx, pony_actor_t* actor, size_t batch)
app++;
try_gc(ctx, actor);

// if we become muted as a result of handling a message, bail out now.
if(actor->muted > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an atomic_load_explicit with memory_order_relaxed. Atomic operations without an explicit memory order implicitly use memory_order_seq_cst and that's a performance hit.

@@ -225,6 +237,15 @@ bool ponyint_actor_run(pony_ctx_t* ctx, pony_actor_t* actor, size_t batch)
// We didn't hit our app message batch limit. We now believe our queue to be
// empty, but we may have received further messages.
pony_assert(app < batch);
pony_assert(actor->muted == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic_load_explicit here.

if(ponyint_messageq_push(&to->q, first, last))
{
if(!has_flag(to, FLAG_UNSCHEDULED))
if(!has_flag(to, FLAG_UNSCHEDULED) && (to->muted == 0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic_load_explicit here.

// 2. the sender isn't overloaded
// AND
// 3. we are sending to another actor (as compared to sending to self)
if((has_flag(to, FLAG_OVERLOADED) || (to->muted > 0)) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic_load_explicit here.

for(uint32_t i = 0; i < scheduler_count; i++)
{
if(&scheduler[i] != sched)
send_msg(i, SCHED_UNMUTE_ACTOR, (intptr_t)actor);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that all schedulers can broadcast, send_msg_single is unsafe. Calls to send_msg_single should be replaced by calls to send_msg.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean that we should remove send_msg_single entirely? i believe that is the implication but I want to verify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only find one instance in send_msg_all in scheduler.c

updating.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I meant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its gone!

if(r == NULL)
{
ponyint_muteset_putindex(&mref->value, sender, index2);
sender->muted += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently equivalent to atomic_fetch_add_explicit(&sender->muted, 1, memory_order_seq_cst), i.e. a very expensive operation.

As far as I can see only one scheduler can mute/unmute a given actor at a time, so this can be replaced with

uint8_t muted = atomic_load_explicit(&sender->muted, memory_order_relaxed);
atomic_store_explicit(&sender->muted, muted + 1, memory_order_relaxed);

If I'm wrong and multiple schedulers can modify the muted field of an actor at the same time, this should be atomic_fetch_add_explicit(&sender->muted, 1, memory_order_relaxed) instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct. only a single scheduler can mute an actor at a time as it happens on message send.


void ponyint_sched_unmute(pony_ctx_t* ctx, pony_actor_t* actor, bool inform)
{
// this needs a better name. its not unmuting actor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest ponyint_sched_unmute_senders.

while((muted = ponyint_muteset_next(&mref->value, &i)) != NULL)
{
pony_assert(muted->muted > 0);
muted->muted -= 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in ponyint_sched_mute, these two lines and the if below can be replaced with

uint8_t muted_count = atomic_load_explicit(&muted->muted, memory_order_relaxed);
pony_assert(muted_count > 0);
muted_count--;
atomic_store_explicit(&muted->muted, muted_count, memory_order_relaxed);

if(muted_count == 0)
...

@slfritchie
Copy link
Contributor

@SeanTAllen Would you consider also adding some variation of https://gist.github.com/slfritchie/0dab74fd729b7ecdd2a11c32c1f984cb?

@SeanTAllen
Copy link
Member Author

@slfritchie i think that's reasonable.

do you think coarse grained tracking on the number of times an actor is overloaded or muted would be interesting? (also overloading cleared and unmuted)

muted_count--;
atomic_store_explicit(&muted->muted, muted_count, memory_order_relaxed);

if (muted->muted == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to either be an atomic_load_explicit, or use the muted_count local.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks i missed that

@slfritchie
Copy link
Contributor

@SeanTAllen tl;dr: yes.

DTrace (and presumably SystemTap) permit easy dynamic probes at entry & exit to any function. This code is structured almost enough dynamic function entry probes could tell you most of what you'd like to know. Overload & not overload have dedicated functions. But mute & unmute do not. Muting status changes are buried far inside of ponyint_sched_mute() and ponyint_sched_unmute_senders().

The code could be restructured to give dedicated small functions for the actual mute state changes; then dynamic function entry probes are easy. However, there's infrastructure value in defining static probes for important events in the system. These new events are the kinds of thing that affect scheduling, and visibility into scheduling is Good.

@SeanTAllen
Copy link
Member Author

Ive found a couple problems with this implementation.

  1. program termination doesnt take into account that there might be muted actors (unscheduled) which will become scheduled again, which can lead to early program termination. the fewer actors running, the more like that is to occur. Really we need to know if there are any muted actors before termination. because if there are, we should keep trying to steal actors rather than exiting that scheduler thread.

  2. the incrementing and decrementing of mute value for an actor isnt thread safe. more than 1 scheduler could try to decrement the mute value at a time. which could result in a data race and FUN. we’d probably want a CAS operation for that OR… what would also solve the issue 1. we need an actor like the cycle detector that handles all muting and unmuting. and can know if there are any “live” muted actors around (in which case a scheduler shouldn’t exit)

@Praetonus
Copy link
Member

@SeanTAllen The first problem can be solved by two small modifications to the workstealing and quiescence detection algorithm.

  1. A scheduler shouldn't send SCHED_BLOCK if the size of its mutemap isn't 0.
  2. When a scheduler is looping in steal and reschedules a previously muted actor as a result of receiving SCHED_UNMUTE_ACTOR, it should resume its normal execution.

Could you detail the circumstances in which the second problem can occur? It seems to me that a given actor can only be in one mutemap at a time.

@SeanTAllen
Copy link
Member Author

@Praetonus excellent ideas. and now that I am a little less tired, I realize you are correct about an actor only being able to be in a single mutemap at a time. I need to add comments to that effect.

@SeanTAllen
Copy link
Member Author

At the next sync, I'd like to discuss what sort of documentation this might need. Inclusion in the tutorial? On the website? Just notes in the code?

@Praetonus
Copy link
Member

@SeanTAllen Here's the diff containing the changes needed to remove the message chain iteration in ponyint_maybe_mute: https://gist.github.com/Praetonus/e6f9d24d1f88e4d1fbfd97dbdc340fef

@SeanTAllen
Copy link
Member Author

@Praetonus patch applied. looking good.

@SeanTAllen
Copy link
Member Author

@Praetonus everything we talked about is in place. Sylvan helped me track down a bug.

I have to add the ability for actors to manual indicate they can't make progress so that is included in the backpressure system and perf testing. but this is getting close.

right now "can't make progress" would be a TCPConnection that is unable to send (backpressure for example).

@SeanTAllen SeanTAllen force-pushed the sean-runtime-backpressure branch 2 times, most recently from a0ee64d to f1a3b74 Compare October 20, 2017 21:25
@SeanTAllen
Copy link
Member Author

@slfritchie I added the telemetry info. Can you have a look to make sure I did it correctly?

@SeanTAllen
Copy link
Member Author

SeanTAllen commented Oct 21, 2017

Things I need to do:

  • add docs for the Backpressure package
  • performance testing using Wallaroo

Please give this another review. It's ready for more feedback.

Question, how, if at all should we document this somewhat advanced feature beyond the package level docs in Backpressure?

if(ponyint_messageq_push_single(&to->q, first, last))
{
if(!has_flag(to, FLAG_UNSCHEDULED))
if(!has_flag(to, FLAG_UNSCHEDULED) &&
(atomic_load_explicit(&to->muted, memory_order_relaxed) == 0)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should i move this to a nicely named function?

@SeanTAllen
Copy link
Member Author

I just updated ProcessMonitor to use backpressure mechanism to prevent unbounded pending queue growth. This is a breaking API change to the constructor as a ApplyReleaseBackpressureAuth token is now required.

@SeanTAllen
Copy link
Member Author

Ok with the updates to TCPConnection documentation and with the addition of backpressure to ProcessMonitor, it appears to me that any of the "runaway memory growth" actors in the standard library have some sort of backpressure coverage.

@SeanTAllen
Copy link
Member Author

There's a problem with work stealing and block messages. At the time a scheduler enters into steal() it might have a muted actor. This will cause it to not send a block message. When that actor is unmuted, it might be stolen by another scheduler, leaving the existing scheduler blocked but looping in steal, without ever being able to exit and without ever having sent a block message.

@slfritchie
Copy link
Contributor

I added the telemetry info. Can you have a look to make sure I did it correctly?

The change to examples/dtrace/telemetry.d looks fine, @SeanTAllen.

@SeanTAllen
Copy link
Member Author

We did performance testing with Wallaroo.

Using our standard testing app under a normal load of 3 million messages a 2nd, we saw no change in latencies. Awesome!

ponyint_sched_unmute_senders(ctx, actor, true);
}

PONY_API void pony_apply_backpressure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should follow the convention for runtime functions and take a pony_ctx_t* parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentionally done this way to make calling from Pony straightforward. Sylvan C and I spent a while coming up w this approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense.

set_flag(pony_ctx()->current, FLAG_UNDER_PRESSURE);
}

PONY_API void pony_release_backpressure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

ponyint_sched_unmute_senders(ctx, ctx->current, true);
}

bool ponyint_triggers_muting(pony_actor_t* actor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this only needs the actor it’s unclear to me why we should do that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I missed that.

ponyint_is_muted(actor);
}

bool ponyint_is_muted(pony_actor_t* actor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

@SeanTAllen
Copy link
Member Author

@jemc @Praetonus @mfelsche @sylvanc i added comments and did some clean up. please have a look. where should there be additional explanation, comments etc.

@SeanTAllen
Copy link
Member Author

Latest perf testing round looks good. On the cleanup and blog post etc.

jemc pushed a commit that referenced this pull request Nov 17, 2017
…ime.

A microbenchmark for measuring message passing rates in the Pony runtime.

This microbenchmark executes a sequence of intervals.  During an interval,
1 second long by default, the SyncLeader actor sends an initial
set of ping messages to a static set of Pinger actors.  When a Pinger
actor receives a ping() message, the Pinger will randomly choose
another Pinger to forward the ping() message.  This technique limits
the total number of messages "in flight" in the runtime to avoid
causing unnecessary memory consumption & overhead by the Pony runtime.

This small program has several intended uses:

* Demonstrate use of three types of actors in a Pony program: a timer,
  a SyncLeader, and many Pinger actors.

* As a stress test for Pony runtime development, for example, finding
  deadlocks caused by experiments in the "Generalized runtime
  backpressure" work in pull request
  #2264

* As a stress test for measuring message send & receive overhead for
  experiments in the "Add DTrace probes for all message push and pop
  operations" work in pull request
  #2295
@SeanTAllen
Copy link
Member Author

I'm planning on squashing and merging this today. Here's the planned commit comment. Anything else that should be included? If no, I'll get this merged down then start working on a blog post that would announce the feature.


This commit has backpressure to Pony runtime scheduling.

Prior to this commit, it was possible to create Pony programs that would be able to cause runaway memory growth due to a producer/consumer imbalance in message sending. There are a variety of actor topologies that could cause the problem.

Because Pony actor queues are unbounded, runaway memory growth is possible. This commit contains a program that demonstrates this. examples/overload has a large number of actors sending to a single actor. Under the original scheduler algo, each of these actors would receive a fairly equivalent number of chances to process messages. For each time an actor is given access to the scheduler, it is allowed to process up to batch size number of messages. The default for batch size is 100. The overload example many many actors sending to a single actor where it can't keep up with the send.

This commit adjusts the Pony scheduler to apply backpressure. The basic idea is:

1- Pony message queues are unbounded
2- Memory can grow without end if an actor isn't able to keep up with the incoming messages
3- We need a way to detect if an actor is overloaded and if it is, apply backpressure

With this commit, we apply backpressure according to the following rules:

1- If an actor processes batch size application messages then it is overloaded. It wasn't able to drain its message queue during a scheduler run.
2- Sending to an overloaded actor will result in the sender being "muted"
3- Muting means that an actor won't be scheduled for a period of time allowing overloaded actors to catch up

Particular details on this

1- Sending to an overloaded or muted actor will result in the sender being muted unless the sender is overloaded.
2- Muted actors will remain unscheduled until any actors that they sent to that were muted/overloaded are no longer muted/overloaded

With this commit, the basics of backpressure are in place. Still to come:

backpressure isn't currently applied from the cycle detector so its queue can still grow in an unbounded fashion. More work/thought needs to go into addressing that problem.

Its possible that due to implementation bugs that this commit results in deadlocks for some actor topologies. I found a number of implementation issues that had to be fixed after my first pass. The basic algo though should be fine.

There are a number of additional work items that could be added on to the basic scheme. Some might turn out to be actual improvements, some might turn out to not make sense.

1- Allow for notification of senders when they send to a muted/overloaded actor. This would allow application level decisions on possible load shedding or other means to address the underlying imbalance.

2- Allow an actor to know that it has become overloaded so it can take application level

3- Allow actors to have different batch sizes that might result in better performance for some actor topologies

This work was performance tested at Wallaroo Labs and was found under heavy loads to have no noticeable impact on performance.

@jemc
Copy link
Member

jemc commented Nov 17, 2017

Excellent! 👍

One small typo I noticed in the first line: I think "has" should be "adds".

@SeanTAllen SeanTAllen merged commit 1104a6c into master Nov 17, 2017
@SeanTAllen SeanTAllen deleted the sean-runtime-backpressure branch November 17, 2017 17:08
@SeanTAllen SeanTAllen removed do not merge This PR should not be merged at this time needs discussion during sync labels Nov 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants