DISCUSSION PROPOSAL Polly eventing and metrics architecture

DISCUSSION PROPOSAL: Polly eventing and metrics architecture

Version: 0.7

Authors/Contributors: Dylan Reisenberger @reisenberger

Thanks also to: @lakario; @ankitbko; @seanfarrow; @tomkerkhove whose earlier ideas, comments and prototypes have already influenced this.

Date: 28 October 2017

Status: Proposals for discussion.

Comment via this github issue or join our slack channel for metrics updates and discussion.

1 Modelling `PolicyEvent`

1.1 PolicyEvent: Type design

Type/interface hierarchy? Or keep it relatively flat (perhaps single sealed type), extensible with dictionary-like semantics for data specific to individual policy/event types? Or combination?

1.1.1 Defining shared `EventType`s across producers and consumers

Producers and consumers need a common way of specifying and identifying event types.

Either a separate nuget package named (eg) Polly.EventTypes or Polly.Events.EventTypes, which both producers and consumers reference. Or keep these in main Polly package.

If using a type hierarchy to distinguish events by .NET type, that distinguishes, for .NET consumers.

If events are ever to be consumed by a non-.Net platform, a string-constant/enumeration-like identifier could also be useful. A quasi-enumeration of policy event types could be: (example, not necessarily exact structure)

Polly.Retry.Events.OnRetry = "Retry.OnRetry"; // or just "OnRetry", if identified to "Retry" policy elsewhere`
Polly.Retry.Events.OnRetrySuccess = "Retry.OnRetrySuccess";
Polly.CircuitBreaker.Events.OnOpen = "CircuitBreaker.OnOpen";
Polly.CircuitBreaker.Events.OnClose = "CircuitBreaker.OnClose";
Polly.Fallback.Events.OnFallbackInvoked = "Fallback.OnFallbackInvoked";

If so, probably string rather than pure enum. Users coding custom policies may want to add custom event types: string is open for extension while a pure enum is closed. public static string can still provide compile-time-bound matching, eg .Where(e => e.EventType == Polly.CircuitBreaker.Events.OnBreak) if use of a type hierarchy is not available.

1.2 PolicyEvent: Content

A PolicyEvent would comprise (thoughts so far) three main kinds of data

1 Metadata: common to all event types
2 Event data: Data specific to the given policy type and event type
3 User data: Custom data which the user could add to events

1.2.1 PolicyEvent: common metadata

A property-value/key-value store of metadata common to all events:

PolicyWrapKey: The key of the PolicyWrap (if applicable) executing
PolicyKey: The key of the Policy generating the event
ExecutionKey: (better renamed?: CallSiteKey): A key identifying the call site within the code generating the event. Potentially differs from PolicyKey, as a policy instance may be re-used in multiple call sites.
ExecutionGuid: a Guid distinguishing this particular execution.

And:

SourceTimestamp: UTC timestamp of the time in the source system at which the event was raised

See later discussion also on capturing call execution time, re:

SourceTimerTicks: Tick count in the source system at which the event was raised
SourceTicksPerSecond: resolution of ticks at source

And:

possibly PolicyType: eg Retry; CircuitBreaker; Fallback.
EventType: String constant indicating the type of event. Drawn from a quasi-enumeration.

More?

1.2.2 PolicyEvent: policy-specific and event-specific data

A property-value/key-value store of data specific to the policy type and/or event type. This might contain a mixture of configuration information and state. For instance, for retry policy events it might contain:

MaxRetries: 3
CurrentTry: 1

For advanced circuit-breaker, eg:

FailureThreshold: 0.5
(other similar configuration data)
CircuitState: HalfOpen
CircuitBrokenUntil: time
(etc)

Any value in splitting this into two sections?: policy-constant elements (eg configuration) and varying elements (temporal state, or perhaps data particular to that one event)? (Is distinction sufficiently clear to always maintain, always add value?). Probably becomes clearer as we list out events per policy type.

Should configuration be (i) written to each event, and/or (ii) modelled as a separate stream? (Compare: https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring#configuration-stream). Perhaps initially (i), then later also (ii). Separate developments are underway within Polly around dynamic reconfiguration.

1.2.3 PolicyEvent: user data

A key-value store of data the user may wish added to each event.

Users might have their own metadata they attach to executions via Polly's Context, which they may want to expose to eventing dashboards / downstream metrics ingestion. Examples:

a CorrelationId tracking the progress of an original user request among downstream microservice interactions
the application/component generating the event
when horizontally-scaling, the instance/node which generated the event

Idea: Users could specify a Func<Context, IEnumerable<KeyValuePair<string, object>>> as a projection of user data from Context. If so, it certainly must be a selective Func like this rather than simply serialize the whole of Context. Context may include sensitive user data which it would be inappropriate to distribute; serializing the whole of Context likely wasteful.

1.3 PolicyEvent: compatibility

We should possibly consider compatibility/convertibility of the Polly event format to other formats we might want to interface with, such as Azure Event Grid, input to AppInsights, Hystrix dashboard, etc.

Anyone is welcome to put time into these comparisons and draw out anything we need to learn.

2 Architecture for emitting and consuming events

2.1 Metrics Architecture: Layers

Layers could be:

(1) Core Polly package
(2a) Polly.Events.Rx
(2b) Polly.Metrics.Rx
(3) Polly.Metrics.Rx.AppInsights, Polly.Metrics.Rx.HystrixDashboard (etc)

2.1.1 Metrics Architecture: (1) Core `Polly` package

The core Polly package should ideally not take a dependency on Rx.

We may be able to rely on the fact that System.IObservable<T> is in the core BCLs, outside Rx.

Or It may be that core Polly policies should expose a traditional .NET event hook (or similar: see discussion on de-duplication) for raising initial events.

Testability may influence the choice.

2.1.2 Metrics Architecture: (2a) `Polly.Events.Rx` and (2b) `Polly.Metrics.Rx`

Separation between (2a) and (2b) may initially look / be unnecessary, but see later discussion about shipping off box.

2.1.2.1 Metrics Architecture: (2a) `Polly.Events.Rx`

(2a) Polly.Events.Rx provides an implementation that ensures layer (1) can emit an Rx stream of events.

Perhaps Observable.FromEvent(...) pattern (or similar) if (1) is based around .NET events (or similar).

Or if (1) expresses signatures in System.IObservable<T>, it might be an Rx event-pump that could be injected into policies in (1) to turn on the events.

2.1.2.2 Metrics Architecture: (2b) `Polly.Metrics.Rx`

(2b) Polly.Metrics.Rx would offer a range of Rx functions aggregating events to create a range of standard metrics: (below quick examples, would be many more)

on retry policies, a ?rolling-window gauge? of the average number of retries needed to achieve success
on cache policy, rolling cacheHit/cacheMiss ratio
on Policys and PolicyWraps, average call execution time

2.1.2.3 Classes of information

The main classes of info producible as information aggregated from source events are:

Informational:

simple informational properties, eg the name of the PolicyKey
configuration properties, eg how many retries are configured for this retry policy

Counts:

pure integer. Count of how many times something has happened since metrics tracking started.
- Would this need to be accompanied by a timestamp of when the count is since??

Timer:

how long something took, eg call execution time.

Gauge:

Both count and timer measurements may then be averaged across a recent window of time to produce a rolling gauge. For example:

Average number of retries needed for success on this call channel was 1.2, in the last 60 seconds
Average execution time of calls through this channel was 4.3ms , in the last 60 seconds

Terminology

Similar implementations (eg Hystrix and StatD) use these terms in differing manners: research, and clarify our usage? Ref: https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring#bucketed-event-counters (and following); http://statsd.readthedocs.io/en/v0.5.0/types.html

2.1.3 Metrics architecture: (3) Polly.Metrics.Rx.AppInsights, Polly.Metrics.Rx.HystrixDashboard (etc)

Packages such as Polly.Metrics.Rx.AppInsights would transform/render metrics for input to an individual dashboard. Consumers need not only be dashboards: they could also be (eg) logging systems, or alert-raising systems.

Suggest: these layers are as 'dumb' as possible: they should not contain any knowledge/processing which manipulates data from the layer below into some further useful aggregation or statistic. If a new, useful aggregation or statistic is conceived, the logic/function to create that aggregation should ideally be implemented in (2b), so that it may be useful to other dashboards; layer (3) should only manipulate metrics computed by (2b) into formats acceptable to the consumer (3) targets. Analogy: similar to (3) being a 'dumb' view/view-model in MVC or MVVM; 'thinking' happens in lower layers.

2.2 Threading

We need to consider when to ship the Rx work off the user's thread: http://www.introtorx.com/content/v1.0.10621.0/15_SchedulingAndThreading.html. At (2b) Polly.Metrics.Rx ?

2.3 When to ship/flush events 'off box'?

For high-throughput systems, it may become important to be able to ship events 'off box' (off the main production server, towards processing capacity dedicated just to handling events/metrics), and to offer options for when to do so. Other high volume stats/metrics implementations like StatsD and Hystrix consider this.

The suggested division of packages above pre-plans for a couple of options for when to ship:

(1)-(2a) -> ship raw events off app servers -> (2b)--?>-(3) : Ship raw events (2a) off box before aggregating (2b). Increases network traffic but decreases metrics CPU load on the app servers.

or

(1)-(2a)-(2b) -> ship aggregated stats off app servers -> (3) : Aggregate stats (2b) still on app servers, before shipping off box. Decreases network traffic, but increases metrics CPU load on app servers.

Perhaps not for first implementation/don't need an answer immediately, but dividing (2a) and (2b) as separate packages [or keeping in mind the ability to do so] would forward plan for this.

When shipping off box, consider also batching events.

3 Timing/call duration metrics

3.1 What timing metrics should be captured?

Any policy type could emit two kinds of timing information (or the events necessary to calculate them):

Elapsed execution time of the overall policy execution, including work done by the policy code
Elapsed execution time of the user delegate execution, excluding work done by the policy code

These could be achieved by each policy instance emitting events:

PolicyExecutionStart
DelegateExecutionStart
DelegateExecutionEnd
PolicyExecutionEnd

3.2 How should timings be calculated?

The events detailed above PolicyExecutionStart, DelegateExecutionStart (etc) - and indeed any other event - could include long Ticks properties, allowing duration calculations by subtraction.

Options for Ticks sources:

3.2.1 `DateTime.UtcNow.Ticks`

3.2.2 `System.Diagnostics.Stopwatch.ElapsedTicks`

Stopwatch traditionally recommended as more precise, and is also preferred as a monotonic clock over a time-of-day clock.

To avoid repeated extra allocations of Stopwatch instances, Polly could run a thread-safe singleton Stopwatch instance. int64.MaxValue compared to Stopwatch.Frequency makes this viable for app lifecycles without overflow.

If so, the central Stopwatch instance should be abstracted and replaceable-by-property-injection, in the same manner core Polly already abstracts SystemClock, to support unit testing.

A Stopwatch obviously rebases to zero on each process start and is not synchronized between processes: Ticks values from the stopwatch would not be comparable across different running processes using Polly, only good for subtraction between policy events from the same process. Layer (2b) metrics aggregation should expose only durations, and mask the source Stopwatch.Ticks, to prevent inadvertent downstream misuse of non-correlating Stopwatch.Ticks from multiple Polly processes.

Absolute DateTime.UtcNow at event source should likely still be emitted alongside stopwatch ticks in events, as informational for logging.

4 Considerations around `PolicyWrap`

4.1 Dashboarding in a world of custom `PolicyWrap`s

A Polly PolicyWrap is intentionally free-form (policies can be wrapped in any combination) in a way that a Hystrix command is not.

This means that there will be no such thing as a one-size-fits-all Polly dashboard. However:

For PolicyWrap: We can offer standard statistics which apply whatever the composition of the PolicyWrap, eg overall execution time.
- It is relatively trivial also for PolicyWrap to identify when it is executing the innermost or outermost policy of the wrap, allowing related statistics.
For individual Policy types: We can offer common metrics, and common dashboard visualizations of them:
- eg for retry, average number of tries needed;
- for circuit-breaker, percentage of time circuit open/closed in recent window, etc.

An additional possibility would be to develop a standard set of PolicyWrap metrics which work across a semi-standard PolicyWrap including (optionally) (up to 1 of) each main Policy type, in a specific sequence, likely: Fallback -> Cache -> Retry -> CircuitBreaker -> Bulkhead -> Timeout (this sequence corresponds to the PolicyWrap wiki). This would also offer a prototype which others could adapt from if they have bespoke PolicyWraps featuring policy types in different sequence or number.

4.2 `PolicyWrap`, subscribing and de-duplication

We may want convenience methods/properties on PolicyWrap to:

enable events for all individual Policys in a PolicyWrap
subscribe to a stream of events emitted by all Policys in a PolicyWrap

We may need to consider how to do this without creating unintended duplicate subscriptions (at whatever layer) and possible duplicate events. Possible to tag events with Guids and de-duplicate, but better to architect not to allow. May not arise depending on architecture. Intended multiple subscription should be permitted.

5 Events to be broadcast by policy type

Will start a separate document/discussion for the events to be emitted - and thus the statistics which could be aggregated - for each policy type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISCUSSION PROPOSAL Polly eventing and metrics architecture

DISCUSSION PROPOSAL: Polly eventing and metrics architecture

1 Modelling `PolicyEvent`

1.1 PolicyEvent: Type design

1.1.1 Defining shared `EventType`s across producers and consumers

1.2 PolicyEvent: Content

1.2.1 PolicyEvent: common metadata

1.2.2 PolicyEvent: policy-specific and event-specific data

1.2.3 PolicyEvent: user data

1.3 PolicyEvent: compatibility

2 Architecture for emitting and consuming events

2.1 Metrics Architecture: Layers

2.1.1 Metrics Architecture: (1) Core `Polly` package

2.1.2 Metrics Architecture: (2a) `Polly.Events.Rx` and (2b) `Polly.Metrics.Rx`

2.1.2.1 Metrics Architecture: (2a) `Polly.Events.Rx`

2.1.2.2 Metrics Architecture: (2b) `Polly.Metrics.Rx`

2.1.2.3 Classes of information

2.1.3 Metrics architecture: (3) Polly.Metrics.Rx.AppInsights, Polly.Metrics.Rx.HystrixDashboard (etc)

2.2 Threading

2.3 When to ship/flush events 'off box'?

3 Timing/call duration metrics

3.1 What timing metrics should be captured?

3.2 How should timings be calculated?

3.2.1 `DateTime.UtcNow.Ticks`

3.2.2 `System.Diagnostics.Stopwatch.ElapsedTicks`

4 Considerations around `PolicyWrap`

4.1 Dashboarding in a world of custom `PolicyWrap`s

4.2 `PolicyWrap`, subscribing and de-duplication

5 Events to be broadcast by policy type

Getting Started

Clone this wiki locally

DISCUSSION PROPOSAL Polly eventing and metrics architecture

DISCUSSION PROPOSAL: Polly eventing and metrics architecture

1 Modelling PolicyEvent

1.1 PolicyEvent: Type design

1.1.1 Defining shared EventTypes across producers and consumers

1.2 PolicyEvent: Content

1.2.1 PolicyEvent: common metadata

1.2.2 PolicyEvent: policy-specific and event-specific data

1.2.3 PolicyEvent: user data

1.3 PolicyEvent: compatibility

2 Architecture for emitting and consuming events

2.1 Metrics Architecture: Layers

2.1.1 Metrics Architecture: (1) Core Polly package

2.1.2 Metrics Architecture: (2a) Polly.Events.Rx and (2b) Polly.Metrics.Rx

2.1.2.1 Metrics Architecture: (2a) Polly.Events.Rx

2.1.2.2 Metrics Architecture: (2b) Polly.Metrics.Rx

2.1.2.3 Classes of information

2.1.3 Metrics architecture: (3) Polly.Metrics.Rx.AppInsights, Polly.Metrics.Rx.HystrixDashboard (etc)

2.2 Threading

2.3 When to ship/flush events 'off box'?

3 Timing/call duration metrics

3.1 What timing metrics should be captured?

3.2 How should timings be calculated?

3.2.1 DateTime.UtcNow.Ticks

3.2.2 System.Diagnostics.Stopwatch.ElapsedTicks

4 Considerations around PolicyWrap

4.1 Dashboarding in a world of custom PolicyWraps

4.2 PolicyWrap, subscribing and de-duplication

5 Events to be broadcast by policy type

Getting Started

Clone this wiki locally

1 Modelling `PolicyEvent`

1.1.1 Defining shared `EventType`s across producers and consumers

2.1.1 Metrics Architecture: (1) Core `Polly` package

2.1.2 Metrics Architecture: (2a) `Polly.Events.Rx` and (2b) `Polly.Metrics.Rx`

2.1.2.1 Metrics Architecture: (2a) `Polly.Events.Rx`

2.1.2.2 Metrics Architecture: (2b) `Polly.Metrics.Rx`

3.2.1 `DateTime.UtcNow.Ticks`

3.2.2 `System.Diagnostics.Stopwatch.ElapsedTicks`

4 Considerations around `PolicyWrap`

4.1 Dashboarding in a world of custom `PolicyWrap`s

4.2 `PolicyWrap`, subscribing and de-duplication