Add section about guarantees to documentation #8

mkaszubowski · 2017-12-13T13:13:56Z

Hello,

While I really like how the library looks, I think it would be nice to have a more precise description of the guarantees that it provides.

For example:

Sage guarantees that either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.

I think we all should be really careful when using words like "guarantees" in case of distributed systems because it's really easy to provide a false sense of confidence.

I think it would be really good to consider following example situations:

What if a process executing a pipeline fails?
What if node is restarted during the pipeline?
Can the pipeline be retried by a different node in a cluster?
What if external API fails and it's impossible to revert a step?
What if a request to the external API actually succeeds, but the reply is lost or there's a timeout?

Note that it isn't a criticism of the library. I know that these are all really hard problems and I don't expect they will be solved by the library (some of them cannot be solved anyway), I just think it's important to provide a more formal section on guarantees so the clients are aware of the potential problems and risks.

Of course I'd be happy to with the description (if I'm competent enough) :)

Thanks!

AndrewDryga · 2017-12-13T13:43:06Z

Hello there,

I think this is an awesome idea, I'll write a draft and if we agree that it's clear enough - I'll add it to the readme 👍.

AndrewDryga · 2017-12-13T15:08:42Z

I think it would be really good to consider following example situations:

Q/A on Execution Guarantees

What if my transaction has bugs or other errors?
Transactions are wrapped in a try..catch block and would tolerate any exception, exit or rescue. After compensations, the error will be reraised.
What if my compensation has bugs or other errors?
By default, compensations would be aborted. But you can write an adapter that will handle those cases. For more information see Critical Error Handling section.
What if the process that executes Sage or whole node fails?
Right now Sage doesn't provide a way to tolerate failures of executing processes. (However, there is an RFC that aims this.)
What if external API fails and it's impossible to revert a step?
It's a subject of manual intervention, but you will be notified as loud as possible. You can make it even more error-prone by submitting compensation errors and state to some external notification system. (Eg. to a Slack channel or a Rollbar.)
What if a request to the external API actually succeeds, but the reply is lost or there's a timeout?
To cover this cases it's a good idea to retry compensations, however, to do so they should be as idempotent as possible. Eg. when deleting resource you should check for both 2XX and 404 response codes, if the resource doesn't exist you can assume that it's already deleted.

@mkaszubowski what do you think?

mkaszubowski · 2017-12-13T15:41:20Z

I'd replace 5th point with something like this:

Can I be absolutely sure that everything went well?

Unfortunately, you cannot. As with any other distributed systems, messages can be lost. This means that you cannot be sure if the compensation transaction was successful even if it is retried.

It is also possible that the reply from the external API is lost even though the request actually succeeded. In such cases, Sage might try to retry the request/compensation which might result in unexpected state. It is best to use idempotent messages when possible and have a proper monitoring tools for critical transactions.

Also, maybe add some introduction before the questions. Something like:

While Sage will do its best to compensate failures in transaction and leave the system in a predictable state, there are some edge cases where it is not possible.

In the 4th point you used "error-prone" which means that something is likely to fail and I think you meant something opposite :) I'd replace the answer to this point with something like:

What if external API fails and it's impossible to revert a step?
In such cases, the process handling the pipeline will crash and the exception will be thrown. Make sure that you have a way of reacting to such cases (in some cases it might be acceptable to ignore the error while others might require a manual intervention).

mkaszubowski · 2017-12-13T15:44:11Z

One more thing: I'd change the sentence from the beginning of the readme:

Sage guarantees that either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.

Since it cannot guarantee this, I propose something like:

Sage tries to ensure that either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.

AndrewDryga mentioned this issue Dec 13, 2017

Added section about guarantees #10

Merged

AndrewDryga closed this as completed in #10 Dec 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add section about guarantees to documentation #8

Add section about guarantees to documentation #8

mkaszubowski commented Dec 13, 2017

AndrewDryga commented Dec 13, 2017

AndrewDryga commented Dec 13, 2017

mkaszubowski commented Dec 13, 2017

mkaszubowski commented Dec 13, 2017

Add section about guarantees to documentation #8

Add section about guarantees to documentation #8

Comments

mkaszubowski commented Dec 13, 2017

AndrewDryga commented Dec 13, 2017

AndrewDryga commented Dec 13, 2017

Q/A on Execution Guarantees

mkaszubowski commented Dec 13, 2017

mkaszubowski commented Dec 13, 2017