Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section about guarantees to documentation #8

Closed
mkaszubowski opened this issue Dec 13, 2017 · 4 comments
Closed

Add section about guarantees to documentation #8

mkaszubowski opened this issue Dec 13, 2017 · 4 comments

Comments

@mkaszubowski
Copy link

Hello,

While I really like how the library looks, I think it would be nice to have a more precise description of the guarantees that it provides.

For example:

Sage guarantees that either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.

I think we all should be really careful when using words like "guarantees" in case of distributed systems because it's really easy to provide a false sense of confidence.

I think it would be really good to consider following example situations:

  1. What if a process executing a pipeline fails?
  2. What if node is restarted during the pipeline?
  3. Can the pipeline be retried by a different node in a cluster?
  4. What if external API fails and it's impossible to revert a step?
  5. What if a request to the external API actually succeeds, but the reply is lost or there's a timeout?

Note that it isn't a criticism of the library. I know that these are all really hard problems and I don't expect they will be solved by the library (some of them cannot be solved anyway), I just think it's important to provide a more formal section on guarantees so the clients are aware of the potential problems and risks.

Of course I'd be happy to with the description (if I'm competent enough) :)

Thanks!

@AndrewDryga
Copy link
Member

Hello there,

I think this is an awesome idea, I'll write a draft and if we agree that it's clear enough - I'll add it to the readme 👍.

@AndrewDryga
Copy link
Member

I think it would be really good to consider following example situations:

Q/A on Execution Guarantees

  1. What if my transaction has bugs or other errors?
    Transactions are wrapped in a try..catch block and would tolerate any exception, exit or rescue. After compensations, the error will be reraised.

  2. What if my compensation has bugs or other errors?
    By default, compensations would be aborted. But you can write an adapter that will handle those cases. For more information see Critical Error Handling section.

  3. What if the process that executes Sage or whole node fails?
    Right now Sage doesn't provide a way to tolerate failures of executing processes. (However, there is an RFC that aims this.)

  4. What if external API fails and it's impossible to revert a step?
    It's a subject of manual intervention, but you will be notified as loud as possible. You can make it even more error-prone by submitting compensation errors and state to some external notification system. (Eg. to a Slack channel or a Rollbar.)

  5. What if a request to the external API actually succeeds, but the reply is lost or there's a timeout?
    To cover this cases it's a good idea to retry compensations, however, to do so they should be as idempotent as possible. Eg. when deleting resource you should check for both 2XX and 404 response codes, if the resource doesn't exist you can assume that it's already deleted.

@mkaszubowski what do you think?

@mkaszubowski
Copy link
Author

I'd replace 5th point with something like this:

  1. Can I be absolutely sure that everything went well?

Unfortunately, you cannot. As with any other distributed systems, messages can be lost. This means that you cannot be sure if the compensation transaction was successful even if it is retried.

It is also possible that the reply from the external API is lost even though the request actually succeeded. In such cases, Sage might try to retry the request/compensation which might result in unexpected state. It is best to use idempotent messages when possible and have a proper monitoring tools for critical transactions.

Also, maybe add some introduction before the questions. Something like:

While Sage will do its best to compensate failures in transaction and leave the system in a predictable state, there are some edge cases where it is not possible.

In the 4th point you used "error-prone" which means that something is likely to fail and I think you meant something opposite :) I'd replace the answer to this point with something like:

  1. What if external API fails and it's impossible to revert a step?
    In such cases, the process handling the pipeline will crash and the exception will be thrown. Make sure that you have a way of reacting to such cases (in some cases it might be acceptable to ignore the error while others might require a manual intervention).

@mkaszubowski
Copy link
Author

One more thing: I'd change the sentence from the beginning of the readme:

Sage guarantees that either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.

Since it cannot guarantee this, I propose something like:

Sage tries to ensure that either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants