Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future Polly Roadmap - **community feedback sought** #90

Closed
reisenberger opened this issue Mar 7, 2016 · 60 comments
Closed

Future Polly Roadmap - **community feedback sought** #90

reisenberger opened this issue Mar 7, 2016 · 60 comments

Comments

@reisenberger
Copy link
Member

reisenberger commented Mar 7, 2016

App-vNext has published a Roadmap for Polly.

We are actively seeking community feedback on the roadmap.

Which aspects of the roadmap would be a priority for you, and which not?

Do you have views on how the features in the roadmap should be implemented?

Are there features you'd like to see, that are not on the roadmap?

What is your view on the longer-term possibility, to grow Polly towards a more general resiliency framework for .NET? (as Hystrix is for Java) – for detail see Roadmap. .

Please comment on the Roadmap under this placeholder issue.

@mfjerome
Copy link

mfjerome commented Mar 8, 2016

About your last point:

Several of our teams working on .NET projects have recommended Polly as being useful in building microservice-based systems. It encourages the fluent expression of transient exception-handling policies and the Circuit Breaker pattern, including policies such as Retry, Retry Forever and Wait and Retry. Similar libraries already exist in other languages (Hystrix for Java for example), and Polly is a welcome addition from the .NET community.

Taken from ThoughWorks Technology Radar. There is clearly a lack of good offers in the .NET ecosystem for general resiliency framework and the spotlight is already on Polly as a Hystrix(Java) alternative so why not jump on the occasion =)

@bunceg
Copy link

bunceg commented Mar 19, 2016

On the open issue #6, I appreciate it wasn't the path the team wanted to go at the time, but I really can't see any other clean way to solve the issue below...

Two instances in their own processes, each has an instance of circuit breaker. Instance A trips at 00:00 and lasts for 5 minutes.. Due to random load balancing, most requests just happen to go to instance A. However, by chance, at 04:50 instance B gets some requests and also trips. it's 5 minute trip timer now starts. However, the tripped service is back at 05:00. The instance A breaker is cleared at 05:01 but the instance B breaker is not able to be cleared until 09:50 - 4+ minutes after when it should be able to serve requests.

I take on board all the comments re: negotiation between services and what if connection to the shared state handler goes down etc. but I'm taking the pessimistic view that if a service the breaker is protecting is dead then it's more likely that the service is really dead rather than the alternative that the service is alive but one instance can't seem to talk to it.

Besides, IMO, that can be some kind of "quorum" logic - two instances, two need to vote dead etc. More complex, but this can be part of the state handler to be developed by the poor soul (i.e. me!) who desires this kind of functionality...

@reisenberger
Copy link
Member Author

Hi @bunceg, many thanks for the input! Noted that you are still looking for the best solution to shared circuit-breaker state across processes with Polly, and/or a feature to inject state into a circuit. Further views on this welcome from across the community...

@bunceg I'll respond to the specific scenario back within #6, since that feels like it might turn into a longer discussion.

@reisenberger
Copy link
Member Author

reisenberger commented Apr 19, 2016

Following the live demo today at devinteractions, there was a request to add caching into the policy pipeline. So we've added to the roadmap a proposal drawing on ICacheProvider, to allow any existing .NET cache implementation.

// Possible configuration syntax
var cachingPolicy = Policy
  .CacheProvider(ICacheProvider provider);

// Execution syntax might different slightly from the normal policy .Execute() syntax, eg:
// Possible syntax:
cachingPolicy.ExecuteOrGetExisting<TResult>(Func<TResult> func, string cacheKey);

Other .ExecuteOrGetExisting(...) overloads could also be provided, mimicking ICacheProvider .AddOrGetExisting(...) overloads, for easy comprehension.

Comments welcome on this (or any other feature in the roadmap) under this issue.

@pj-moviestarplanet
Copy link

Have you thought about implementing some kind of metrics functionality similar to what Hystrix has. It would be very useful to be able to see Circuit Breaker metrics in for example Grafana.

So the idea would be to configure for example a statsd client and use it to send metrics such as call latency, number of succeful calls, number of failed calls etc.

@joelhulen
Copy link
Member

This is an interesting suggestion. We've spoken about ways to visualize the process flow, but not necessarily telemetry information. I wonder if capturing that data would be best served as a plug-in, and not a part of the core package? My line of thinking about this is that there is usually a trade-off between capturing this data and performance. I believe the ideal way to capture the data would be via an asynchronous tool similar to Application Insights. Minimal impact and something you can add in.

Do you have an example of the metrics Hystrix captures that you would like to see? For instance, what would be a priority at first?

@reisenberger
Copy link
Member Author

reisenberger commented Apr 21, 2016

Hey @pj-moviestarplanet Great idea, added stats to the roadmap. Happy for any community feedback on an architecture for this.

One question is phasing: whether to start emitting stats/events from Polly as of now or when the fuller Polly Pipeline is in place. Phased would be great if we can see far enough ahead not to introduce breaking changes later.

Related: The Polly Pipeline proposal differs from the Hystrix approach in one significant way. While Hystrix has a fixed structure in which the various elements are composed (fallback, cache, circuit-breaker etc; hmm, no retry), the Polly Pipeline proposal is currently more flexible, allowing users to compose whichever elements of the (envisaged) Polly resilience armoury they want, in any order.

The intended advantage of this flexibility is simplicity: users select and compose only the elements they want. However, a disadvantage may be if (if) that flexibility makes emitting overall stats for calls more difficult ... it may depend whether Polly's role is to emit stats or raw(er) events (perhaps as a ReactiveExtensions Observable<T>) which something else compiles to/pushes to stats.

Keen for any community feedback on the current Pipeline proposal - strengths, weaknesses? Definitely an area of the roadmap that's still up for grabs how we architect it ...

@joelhulen joelhulen mentioned this issue Apr 30, 2016
@PaybackMan
Copy link

Please support .Net Standard Profile

@joelhulen
Copy link
Member

joelhulen commented May 24, 2016

@PaybackMan That is a great suggestion. Out of curiosity, what is the current holdup you are facing in using the PCL version of the Polly library, or is this request more out of providing greater flexibility for 3rd party libraries to be less dependent upon a specific .NET version within a given project?

@KBRETON1
Copy link

Please support async methods in .NET 4.0 by using the portable Microsoft.Bcl.Async nuget.

@reisenberger
Copy link
Member Author

@KBRETON1 Thanks for the vote for this! (async methods in .NET 4.0 using Microsoft.Bcl.Async). This is already in progress as #103 .

@shmuelie
Copy link

Really want to use this in a ASP.NET Core application, any idea how I can currently?

@mattwoberts
Copy link

@SamuelEnglard 👍

I'm also interested in this. A lot of people are coming to dotnet core having used the snappily named "The Transient Fault Handling Application Block for Windows Azure" (topaz), and since that project seems pretty dead, it would be AWESOME if polly could take it's place, and provide a dotnet core version.

@reisenberger
Copy link
Member Author

reisenberger commented Jun 25, 2016

@mattwoberts The Polly team view .NET core compatibility as essential and high prio on the roadmap. The current PCL259 codebase is entirely netstandard1.0 compatible, so no major code issues. The recent flux in tooling from Microsoft has been a barrier (we have wanted to avoid tracking a moving target and that impacting other active feature development over last few weeks). We are hoping for some stability from Microsoft in the tooling coming Monday 27 but remains to be seen ...

@SamuelEnglard has made a start at this (thanks @SamuelEnglard !), @mattwoberts see conversation and links at end of #51 . I am hoping to look at this in the coming days/weeks, @mattwoberts if you also have time to evaluate and take further with @SamuelEnglard please do!

(I am just now completing #14, which will unlock the path to many of the hystrix-like features that @SeanFarrow, I and many others are interested in on the long-term Roadmap ...)

@shmuelie
Copy link

@reisenberger should I make a proper pull request?

@adamhathcock
Copy link

I'm interested in the pipeline idea. Is there more details on a proposal that should be implemented or copied?

@reisenberger
Copy link
Member Author

Hi @adamhathcock , thanks for your interest! We are hoping to start work on the Pipeline in about a week's time (this week's focus is nuget packages supporting .NET Core). Would be great to hear any input you have on how you see this working, or the context you'd be using it in. I'm also hoping to put out an implementation proposal for comment in the next few days - @SeanFarrow and others also interested in this feature.

For now, see nesting policies in the wiki, for a syntax for nesting calls through multiple policies at present. Similar syntax also works for async.

@reisenberger
Copy link
Member Author

@adamhathcock @SeanFarrow For a brief comment on implementation thoughts so far around the policy pipeline: I see a Pipeline represented as a linked list. However, we probably also want to retain the ability to re-use a policy across more than one call, and therefore in more than one pipeline (for benefits of this see wiki). This probably means a pipeline is a linked list of pointers to policy instances, not a linked list of actual policy instances. Thoughts?

@SeanFarrow
Copy link
Contributor

SeanFarrow commented Jul 4, 2016

That makes sense, do we want to give people the ability to add steps at any point in the pipeline? If yes, does a policy have an id?

@adamhathcock
Copy link

@reisenberger @SeanFarrow What I have is an implementation that does exactly what you say: a pipeline that is basically a linked list and can be mutated for different calls. I want to use Polly with it and if you're doing a pipeline thing anyway maybe I could just wholesale remove my implementation.

I don't really like the look of the Nested Policies thing. It looks like there could be a better way to hook things together but each policy remain atomic? Just thinking out loud. Maybe there would have to be a special collection PolicyList that is basically a linked list that allows insertion anywhere and can be executed which invokes the policies in order at anytime.

@adamhathcock
Copy link

I haven't looked closely but I would also like the ability to add custom Policies into the pipeline and I don't think that functionality currently exists (appending a custom Policy). I guess that would basically mean just allowing a custom class that uses the same interface. Maybe that's too big a refactor?

If that was on the table, I would envision policies looking a lot like ASP.NET Core middleware. However the next step would be a method parameter instead of a constructor parameter. Just thinking outloud again :)

@SeanFarrow
Copy link
Contributor

SeanFarrow commented Jul 4, 2016

What custom policies do you envisage adding?

@adamhathcock
Copy link

Well, just anything. Right now, one of things that my pipeline does is handle the custom OAuth token and refreshes it if necessary. I guess I would think I could add something like that in.

I do note that a lot of the exception handling and execute methods do take custom lambdas so you could do it there. I just thought formalizing it into an official custom policy might be better.

@reisenberger
Copy link
Member Author

All thinking out loud welcome! (lots of interesting ideas guys)

Rather than respond piece-by-piece, I'm working up (and will post shortly) a full proposal statement for each of the envisaged wider resilience features.

I think it will help to consider them as a group: a lot of these proposed policies and considerations interlock (eg the need for an id/key on a policy, and/or on an individual execution, also comes into play for using a cache, for emitting events/stats, for configuring policies from a PolicyRegistry, etc...)

@reisenberger
Copy link
Member Author

@adamhathcock , re:

I don't really like the look of the Nested Policies thing. It looks like there could be a better way to
hook things together but each policy remain atomic? [...] Maybe there would have to be ...

Yes, the nested policies thing in the wiki is (for clarity) not a proposal. It's just some standard C# syntax that I put out as doco to guide people with chaining the existing policies, when App-vNext took on Polly six months ago. What you're suggesting is exactly the essence of the Pipeline proposal to provide something nicer 😄 (great to hear; all thinking out loud welcome! 😉 )

@reisenberger
Copy link
Member Author

reisenberger commented Jul 7, 2016

@SeanFarrow,

Could we not make the changes and then work out how we extract an interface at a later date?

Yes (potentially). I'm also thinking about whether some interface segregation could stage things ... this will also become clearer as the new functionality is planned and executed.

Worth pointing out though that it was a huge benefit that the founder of Polly hadn't started with an interface at 1.0; we would have been breaking them repeatedly as the library grew. To judge ongoing.

Also imperative to decide whether to tighten the number of overloads (see foot of roadmap) before publishing an interface.

@SeanFarrow See for example also Hystrix holding off interfaces in places to give themselves room to grow. Big potential interface benefits for reasons previously stated (and usual design reasons), but care also in still-evolving library?

@reisenberger
Copy link
Member Author

reisenberger commented Jul 8, 2016

@PaybackMan You asked for .Net Standard Support for Polly. Just alerting you that we're in beta on this! There are beta packages posted (targeting .NETStandard versions 1.0 for PCL; and 1.6 for .NET Core) in the threads #132 and #133 respectively . Please feel free to try them out! (and we would love feedback). (We are letting interested parties pre-trial them before posting them to nuget, though they should go out there fairly soon...)

@mfjerome
Copy link

mfjerome commented Jul 8, 2016

Hi @reisenberger. I am very happy with the recently updated roadmap. The reason I initially proposed the "Fallback" improvement and pushed for Hystrix-like features in March was because we are currently assembling our project templates/frameworks to build microservices inside large distributed, scalable apps for our finance business and we are using Polly in it.

Priority for us would be to add Timeout, Bulkheads and Fallbacks to policies, and chaining them in a pipeline. We are currently experimenting some implementation ideas proposed in this tool: https://github.com/hudl/Mjolnir .

We might have a programmer or 2 available during the summer if you come up with specific up-for-grabs. Cheers.

@reisenberger
Copy link
Member Author

reisenberger commented Jul 11, 2016

@mfjerome That’s great to hear! And perfect timing – after a big refactor (#130), we’re just about to embark on those Hystrix-like wider resilience policies. Please stay tuned as we’ll be putting up deeper docs on those features in the coming days/24hrs(?) and it would be great to have your feedback!

Extra dev power could well be useful @mfjerome, definitely opportunities. Are you able to contact me briefly off-github at software[at]reisenberger.net, we can talk a little more freely about how you/your devs could contribute? (and timings - I have some away time coming up). All contributors will be credited!

Timeout, Bulkheads and Fallbacks [... for use in ...] project templates/frameworks to build microservices

A good fit. I have experience of these in a microservices environment, part of my desire to apply to Polly.

Thank you for your support and involvement!

@reisenberger
Copy link
Member Author

reisenberger commented Jul 11, 2016

I thought it might be useful to set down some thoughts I had about differences in approach/implementation (on resilience policies) between Hystrix and Polly.

NB In each case, neither of the approaches is necessarily better/worse (I can often see advantages from one perspective, disadvantages from another), but identifying them helps stimulate discussion and think about product direction for Polly. Hystrix-derivatives in .NET also tend to have Hystrix characteristics.

(The comparison is intended to call out differences in approach/emphasis, not be a worklist, so the observations assume the intended features on the Polly roadmap will be in place.)

Hystrix Polly
has a central command-type class which brings together all the resilience features. has a fixed set of elements all of which the dev has potentially to understand and configure before using a command (slightly heavier-weight) has separate components - users can choose to use only the Policys they need
the command class is derived from to manage an execution to be put through Hystrix. Code to be executed through the command is placed in an overridden run() method or similar. a Policy is not coupled to the code it can run. A Policy can be applied to any delegate. Policies can thread-safely be reused across multiple delegates (for benefits see wiki). Policies can be passed to entirely different parts of the codebase (allows decoupling/dependency injection/easier stubbing by DI, for testing).
uses the fixed set of strategies-in-a-command always in the same order allows users to compose policies to apply to a call in any order
uses each strategy (circuit-breaker; fallback etc) once in each command allows configuring PolicyWraps which use individual strategies (eg retry; fallback) multiple times. Combined with filtering, allows for responding to different exceptions differently, eg different retry strategies by exception (retry only once a XException, but on YException 5 times); or eg different fallback strategies/stub values for different outcomes.
has configuration by setting properties on the command instance has fluent configuration during instantiation of a policy
has the ability to recalibrate command parameters after initialisation does not (yet) have the ability to recalibrate policy parameters after initialisation (policies are immutable and pretty much 'black boxes') UPDATE: January 2018, now offered via PolicyRegistry
has central system-wide defaults for configuration parameters has no/few hidden defaults (policies only do what you explicitly configure them to do)
has no retry policies has retry policies
is geared more explicitly around result-returning calls, the execution of remote requests caters natively for both result-returning and void-returning actions
has exception filtering only by exception type; 'greedy' exception-catching with opt out can filter by deeper exception properties; specify explicitly the exceptions to handle (opt in), but can greedily catch-all if desired
is not (check?) able to interpret return values as faults to handle is able to interpret return values as faults to handle

(And maybe other dimensions - if you think there are others to consider, or if a comparison seems inaccurate / could be better stated, please comment!)

(And: There is clearly a raft of things Hystrix-in-Java does beyond fault-handling, in terms of stats/metrics, dashboard for visibility etc, beyond these resilience aspects.)


To date I have tried to grow Polly (since AppvNext took stewardship) in the spirit of the original, that is: fluent configuration; lightweight; minimal impact on your code (to use) after you have configured a policy.

Any thoughts arising from the above? Things we should/shouldn't do in Polly? Should/shouldn't do like Hystrix?

@mfjerome
Copy link

a Policy is not coupled to the code it can run.

This second element in your table is definitely a key difference that should stay going forward!

@reisenberger
Copy link
Member Author

reisenberger commented Jul 11, 2016

@adamhathcock / @Amerdrix / @SeanFarrow / @mfjerome / @joelhulen / anyone interested: I have now posted some more detailed proposals for the possible operation/implementation of the planned, hystrix-like resilience features:

#80 Fallback
#136 Cache
#137 Timeout
#138 Throttle
#139 (Keys, to use with cache, potential later PolicyRegistry and events/stats emission)
#140 Policy Wrap (was: Pipeline)

I appreciate this is a lot of material to read: thanks in advance for anyone's time/interest/involvement. Most are essentially more detailed statements of an implementation already outlined in the Roadmap. The biggest change is in #140 Policy wrap (may add more comments/qs around this shortly).

Please do add comments / qs / thoughts under the relevant issue! (Or equally if you think a proposal sounds good-to-go as it is.)

@reisenberger
Copy link
Member Author

To answer the v good question @SeanFarrow raised under specific new proposed policies: All new policies would cater for all existing Execute syntaxes and sync/async variants unless otherwise stated. The syntax examples given were just examples (too many overloads to state all).

In general:

  • ExecuteAndCapture() should be supported throughout
  • sync and async should be supported throughout

... and if there are exceptions the Scope section (or detailed notes) in each policy proposal should call them out. For example:

  • Cache makes no sense for void returning 😉
  • Timeout requires new co-operative-cancellation overloads with Cancellation for sync .Execute()
  • ExecuteAndCapture() probably wants special treatment in a policy wrap: ... one would want more-inner policies in the wrap still to throw, and only the outermost call to ...Capture?)

Thanks for the great q. I'll go over them again and look for any missed cases. Anyone shout if you think any thinking steps missed on this!

@gsogol
Copy link

gsogol commented Jul 18, 2016

How about reusing and integrating with hystrix dashboard via turbine (https://github.com/Netflix/Turbine/wiki)? This is useful because the hystrix community is huge. No point in recreating dashboards if integration exists via server sent events.

@drewburlingame
Copy link

... and if there are exceptions the Scope section (or detailed notes) in each policy proposal should call them out. For example:

Cache makes no sense for void returning 😉

Ideally, we'd prefer a pass through vs throwing an exception.

The null pattern would be helpful in cases where all the methods for a service could use the same PolicyWrap. If some of the methods were void and the Cache policy threw an exception for void methods, then we'd need to create two PolicyWraps. The null pattern would allow us to use just one PolicyWrap.

If there are cases where we'd want to throw an exception, then hopefully it's a config option we can setup for the policy. i.e. ignoreVoidMethods=true;

@lakario
Copy link
Contributor

lakario commented Jan 9, 2017

Have you thought about implementing some kind of metrics functionality similar to what Hystrix has. It would be very useful to be able to see Circuit Breaker metrics in for example Grafana.

My current project would benefit greatly from being able to export more detailed state from circuit breaker and other policies. Specific information like total number of calls, time stamps for each call (start and end), as well as aggregates by CircuitState.

@tomkerkhove
Copy link

+1 for the metrics

@reisenberger
Copy link
Member Author

COMMUNITY FEEDBACK SOUGHT

Following release of Polly v5.0, the potential Polly roadmap has been updated with a few new items and clarifications.

There is clear community interest in metrics; work on this is beginning, including discussion in our slack channel. Full proposals will be posted on GitHub as they cohere.

Beyond metrics, what is your view of priorities among the other items?

What would you like to see, that is not on the roadmap?

@shmuelie
Copy link

Personally failover and dynamic configurations would be my next top items. I'd have dynamic configuration first since that can be used to create failovers

@g7ed6e
Copy link

g7ed6e commented Apr 2, 2017

CircuitBreaker and AdvancedCircuitBreaker have a locking nature due to the use of a Monitor in the TimedLock implementation. A implementation using the Interlocked class may avoid to lock the circuit state. What do you think about it ? Would it be relevant ?

@reisenberger
Copy link
Member Author

reisenberger commented Apr 8, 2017

@railarmenien re:

CircuitBreaker and AdvancedCircuitBreaker have a locking nature due to the use of a Monitor in the
TimedLock implementation. A implementation using the Interlocked class may avoid to lock the circuit
state. What do you think about it ? Would it be relevant ?

Thanks @railarmenien for this great point! The circuit-breaker design App-vNext inherited when taking over the project, as you say, uses locking. I started investigating moving the design to Interlocked in January when we resolved #216 with lock-free techniques. My broad-brush assessment was:

(a) Metrics calculation and general operation in Closed state: is the key case to optimise. It makes sense to absolutely minimise any latency through contention for concurrent, multi-threaded requests on a healthy system. As most metrics updates consist of increments and assignments/queue-manipulation (rolling the metrics buckets over), these can likely be moved to Interlocked and/or thread-safe collections. Same for HalfOpen state if #239 is pursued.

(b) Transitions of circuit-state (where we have locks guarding modifications to several variables): seems not/less worth optimising away the locks. If your circuit is breaking, the problems you are likely suffering (eg network latency; faulting downstream system; breaking the circuit for presumably several seconds) are all orders of magnitude greater time-wise than the differences between locking techniques. No value to increase code complexity/risks here?

(c) Read locks: Some locks effectively act only as read locks. Some I think could be reasoned away as unnecessary. Others (eg if read-then-modify) cannot.

There's also extensive discussion here on pros/cons including from the eminent Reed Copsey. The broad conclusion is 'measure don't assume' (so we should do that). My instinct, reviewing the code, is that we should pursue (a), and would make gains, but we ought to take an evidence-based approach and get some benchmarking around this (as @ankitbko has suggested elsewhere) .

What do you / does anyone think?

I can take this forward but it requires a dedicated chunk of time. I'd be very happy to hear further comment/contributions from anyone in the community.

@ankitbko
Copy link
Contributor

ankitbko commented Apr 8, 2017

@reisenberger
There may be another approach but I am not sure about this. Hystrix seems to do it little differently (need to check in detail). It extensively uses Reactive for both circuit breaking and metrics implementation. What I saw was that events are written into a thread-local Rx stream. From there it is then aggregated into command-specific stream. The entire circuit breaker logic(bucketing and aggregating) is done through Rx streams (by using window). Since Rx streams are thread safe, this may allow us to circumvent manual locking.

Benchmarking first should allow us to compare the results and also check if introducing addional complexity makes sense or not.

PS: I am not sure about this. I was just glancing over Hystrix code quite some time ago. I am not also very familiar with Rx.

@reisenberger
Copy link
Member Author

Added the idea of a RateLimitPolicy (serial throttle) to the Roadmap.

@MihaMarkic
Copy link

Would be nice to have a .NET Standard 2.0 target in NuGet package as well. The reason is that NuGet would import a ton of referenced packages otherwise.

@reisenberger
Copy link
Member Author

@MihaMarkic @devapalanisamy Polly v6.0.0, now published, includes a .NET Standard 2.0 target.

@reisenberger
Copy link
Member Author

Closing - most of the discussion here is now historic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests