Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idempotency and Lambda timeouts #1038

Closed
n2N8Z opened this issue Oct 5, 2021 · 22 comments
Closed

Idempotency and Lambda timeouts #1038

n2N8Z opened this issue Oct 5, 2021 · 22 comments
Assignees
Labels
bug Something isn't working p2 researching

Comments

@n2N8Z
Copy link
Contributor

n2N8Z commented Oct 5, 2021

Is your feature request related to a problem? Please describe.
Currently if a lambda timesout the idempotency item is left in the 'INPROGRESS' status and subsequent lambda retries fail with IdempotencyAlreadyInProgressError.

Describe the solution you'd like
Add an additional field to the persistence record e.g. function_timeout and store in it the timestamp when the lambda will timeout: now().timestamp() + int(context.get_remaining_time_in_millis() / 1000)
Expand the DDB condition: OR function_timeout < :now

Describe alternatives you've considered
Create a signal handler in the lambda that raises an exception just before the lambda is about to timeout.

Additional context
N/A

@heitorlessa
Copy link
Contributor

@cakepietoast didn't we have that covered in the initial implementation?

Unsure if it's a regression or if we accidentally missed that

@michaelbrewer
Copy link
Contributor

@heitorlessa could this be referring to cases where the lambda execution times out before updating the status from INPROGRESS to COMPLETED?

One recommendation would be to increase the lambda timeout to be much longer (even if the api gw requests timeout within 25 seconds, the lambda can run for much longer.)

Otherwise, the idempotent handler could check the remaining time and clean up before exiting?

@heitorlessa
Copy link
Contributor

heitorlessa commented Oct 7, 2021 via email

@michaelbrewer
Copy link
Contributor

Yes, a TTL on INPROGRESS records would allow for this to clean itself up. However, i would still suggest to extend the lambda timeout to be as long as possible.

@n2N8Z
Copy link
Contributor Author

n2N8Z commented Oct 11, 2021

When the Lambda fails the first retry occurs 1 minute later, and the second retry 2 minutes after that.
So if the Lambda has failed due to a timeout, then the idempotency item needs to be cleaned within 1 minute otherwise the retry will fail. Using TTL to try to have an item reliably deleted in under 1 minute sounds like it will not be reliable - "TTL typically deletes expired items within 48 hours of expiration."
If the timeout timestamp is in the idempotency item then it can reliably be checked on the next run.

@heitorlessa
Copy link
Contributor

It depends on the event source/invocation type. But what I meant by shorter INPROGRESS was both TTL and timestamp condition, since TTL - as you pointed out - wouldn't be sufficient and reliable

@to-mc
Copy link
Contributor

to-mc commented Oct 13, 2021

When the Lambda fails the first retry occurs 1 minute later, and the second retry 2 minutes after that. So if the Lambda has failed due to a timeout, then the idempotency item needs to be cleaned within 1 minute otherwise the retry will fail.
To take your example - just because the Lambda has timed out doesn't mean there were no side effects. The idempotency utility is designed to safely wrap code that should not be run more than once.

The issue here is that there's an assumption being made that if Lambda times out, it is safe to retry. Consider an example where a Lambda function starts executing, then code runs that causes one side effects (like charging a payment), then the function times out before completion. If we delete the INPROGRESS record and allow a retry, the payment will be charged again. Put another way, there is no way for the idempotency utility to know whether a retry can safely be made or not - that is domain/implementation specific. We can't change the default behaviour there as customers are likely relying on it.

Using TTL to try to have an item reliably deleted in under 1 minute sounds like it will not be reliable - "TTL typically deletes expired items within 48 hours of expiration." If the timeout timestamp is in the idempotency item then it can reliably be checked on the next run.

As Heitor mentions, this isn't a problem - even if the item still persists in DynamoDB, the idempotency utility checks the expiry timestamp to see if the record has expired or not.

Taking the above into account, the only solution I can think of is to add an option to allow for setting a shorter timeout on INPROGRESS records so they can be expired on a different schedule to successful requests. We can also add a section to clarify the behavior (what happens when a Lambda function times out?).

@n2N8Z
Copy link
Contributor Author

n2N8Z commented Oct 14, 2021 via email

@to-mc
Copy link
Contributor

to-mc commented Oct 15, 2021

One small clarification even though I don't think its the most relevant piece here: Lambda doesn't really have a default for retries - it depends on how it is invoked. For example in the case of a synchronous execution, there are no retries, and the client needs to decide how to handle them.

agreed, but currently you are charging the AWS default.

I don't see it that way. We're adding a utility which makes it simple(r) to implement business logic which causes retries fail if we can't guarantee the code didn't already run. You could also think of the utility as protecting functionality from retries - regardless of where they come from.

setting it to the lambda's actual timeout would be about the same as what I suggested. Anything else will potentially interfere with the default retries.

"Interfering with retries" is in many ways exactly what the idempotency utility is supposed to do. The utility wraps a piece of your code that you consider unsafe to execute more than once (with the same arguments, within a certain time period). After a Lambda function times out, it may have been executed completely, partially, or not at all. If we execute the function again without knowing more than that, its no longer idempotent:

"The property of idempotency means that an operation does not cause additional side effects if it is called more than once with the same input parameters."

If you want retries to re-trigger your function by default after a timeout - even if you aren't sure how much of your code has already run - my gut feeling is that you probably shouldn't be using the idempotency utility. That said, I'm always keen to learn about new use cases we might not have thought of, so if you're able to share more details about what you want to achieve I'd be happy to dive deeper. If you prefer Slack please drop by the #lambda-powertools channel on the AWS Developers slack (details are in this project README).

@n2N8Z
Copy link
Contributor Author

n2N8Z commented Oct 18, 2021

Lambda doesn't really have a default for retries - it depends on how it is invoked. For example in the case of a synchronous execution, there are no retries, and the client needs to decide how to handle them.

True, there are no retries for synchronous execution, but for event based execution there are default retries that occur as I've mentioned previously.

I don't see it that way. We're adding a utility which makes it simple(r) to implement business logic which causes retries fail if we can't guarantee the code didn't already run. You could also think of the utility as protecting functionality from retries - regardless of where they come from.

The utility is treating explicit errors from the lambda differently to timeout errors, whereas AWS performs retries regardless of whether it was an application error or a timeout.

"Interfering with retries" is in many ways exactly what the idempotency utility is supposed to do.

I guess that's largely true, but I want the Idempotency utility to only stop reprocessing of successfully processed events. If the lambda fails or timesout I want the retries to occur.

@to-mc
Copy link
Contributor

to-mc commented Oct 18, 2021

I guess that's largely true, but I want the Idempotency utility to only stop reprocessing of successfully processed events. If the lambda fails or timesout I want the retries to occur.

Understood. I'll try to get a PR in this week to add a new, independently configurable "inprogress_expires_after_seconds" arg, which you could then set to the same value as your Lambda timeout. Something like this:

persistence_store = DynamoDBPersistenceLayer(
    table_name="SomeDynamoDBTable",
    expires_after_seconds=3600,  # successful idempotent records will expire after 60 minutes
    inprogress_expires_after_seconds=30  # in progress idempotent records will expire after 30 seconds 
)


@idempotent(persistence_store=persistence_store)
def handler(event, context):
    ...

Can you confirm that would solve for your use case?

@n2N8Z
Copy link
Contributor Author

n2N8Z commented Oct 18, 2021

In principle yes, but the details matter:
If it relies on DDB expiry to delete the idempotency record, then it will probably not work.
I think the actual timestamp for when the lambda invocation will timeout needs to stored, rather than the lambda duration: "now().timestamp() + int(context.get_remaining_time_in_millis() / 1000)" because the idempotency record MUST be deleted, or be able to be verified as invalid, when the retry invocation occurs at 1 minute and 2 minutes after failures.

@heitorlessa heitorlessa transferred this issue from aws-powertools/powertools-lambda-python Nov 13, 2021
@heitorlessa heitorlessa transferred this issue from aws-powertools/powertools-lambda Feb 22, 2022
@heitorlessa
Copy link
Contributor

@n2N8Z and everyone - We have recently found a better way to solve this using an internal no-op extension.

POC below - This allows us to receive a SIGTERM in a background thread to do any cleanup, like changing the status of the idempotency data record -- we let it takes its course if we don't receive a SIGTERM signal.

https://gist.github.com/heitorlessa/4aad06c39a1d520ff8c42adc72b0bcd5

@n2N8Z
Copy link
Contributor Author

n2N8Z commented Mar 24, 2022

Sorry, kept meaning to get back to this and forgetting.
What is sending the SIGTERM ?
Is it the actual lambda scheduler ?
I would think it would be a very bad idea to be intercepting the actual timeout mechanism, and thereby telling AWS - "No we are not going to terminate now as requested, but instead we are going to do some more work, and we'll finish when we're good and ready.".
The other problem with bringing signal handlers into this is that you now rely on every module you use not to interfere with your signal handler. I unfortunately ran into this because I had a long running lambda where I was scheduling a timed signal to do a quick bit of cleanup before the real timeout occurs, but my python code was not receiving the SIGALRM when the binary mysql module was executing.

Why is this proposal better than adding an additional field to the persistence record e.g. function_timeout and storing in it the timestamp when the lambda will timeout: now().timestamp() + int(context.get_remaining_time_in_millis() / 1000), and expand the DDB condition with: OR function_timeout < :now ?

@heitorlessa
Copy link
Contributor

Yes, it is the Lambda scheduler who orchestrates sandboxes that are recycled when a time out happens. A SIGTERM is only sent if an internal or external extension is registered -- the former is what the POC above does using a no-op internal extension -- The lambda scheduler guarantees 500ms for this logic to run.

SIGALRM, in your experience, is
not guaranteed as you discovered. For cleanup activities whether a timeout happens or not (clean exit,
Idle for X time), an external Lambda extension is more appropriate since you not only have 2s to handle it but it doesn't interfere in the request path (e.g., early return is possible while cleanup happens in the background).

https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html

Hopefully that explains it.

@cakepietoast had thoughts about both solutions not being good enough

@to-mc
Copy link
Contributor

to-mc commented Mar 25, 2022

Fundamentally, I'm still of the mind that deleting INPROGRESS records violates idempotency. When "stuck" INPROGRESS records occur, it is because there's no way for the utility to know whether the Lambda failed before, during, or after the unit of work its being asked to perform. As such, in the rare cases where this occurs, you would need to evaluate/investigate to be able to make the correct decision on how to proceed.

With that said, the discussion on this thread demonstrates that we can improve this utility to apply to a wider scope of problems, where strict idempotency guarantees are not required. Of the 2 possible solutions we have, I'm in favour of the initial suggestion to use a separate timeout counter for INPROGRESS records. It is simpler, requiring less work for users to implement, and less likely to cause issues similar to the one @n2N8Z mentions in the last comment.

I haven't had time to work on this - and am not likely to have time for it in the near future. If anyone would like to submit a PR for it at a time when the repository is accepting feature requests, the team would be happy to review.

@heitorlessa
Copy link
Contributor

Sounds like we have a winner implementation then - use a timeout counter property and a OR as originally suggested by @n2N8Z.

Two clarifications tho:

  • The timeout extension doesn't require any work from the customer, it'd happen transparently, no code change
  • SIGTERM is guaranteed by Lambda Scheduler for internal/external extensions upon function timeout. SIGALRM, however, isn't and we wouldn't do that as a solution for accuracy and reliability reasons.

Given this is borderline bug, I'd be happy to review and contribute to any PR. If no one has the bandwidth, we can revisit in August-onwards timeframe.

Thanks a lot everyone

michaelbrewer added a commit to gyft/aws-lambda-powertools-python that referenced this issue May 1, 2022
Changes:
- Initial draft on an option to clean up on function timeout

close aws-powertools#1038
@michaelbrewer
Copy link
Contributor

I am working on an initial draft PR based on what @n2N8Z has suggested #1198 Still untested and more to see the impact of the changes.

This does not include fancy extension method @heitorlessa spoke about.

If i have time to complete the test coverage and do some end to end testing, then maybe @cakepietoast or @heitorlessa can look at it further.

Otherwise the docs will have to be updated with the list of caveats. :)

@michaelbrewer
Copy link
Contributor

michaelbrewer commented May 4, 2022

Deployed and tested a version of my PR #1198 and it works

template.yaml with powertools based on #1198 :

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Testing idempotency timeouts
Globals:
    Function:
        Timeout: 2

Resources:
  IdempotencyTable:
    Type: AWS::DynamoDB::Table
    Properties:
      AttributeDefinitions:
        -   AttributeName: id
            AttributeType: S
      KeySchema:
        -   AttributeName: id
            KeyType: HASH
      TimeToLiveSpecification:
        AttributeName: expiration
        Enabled: true
      BillingMode: PAY_PER_REQUEST

  IdempotencyFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: IdempotencyFunction
      CodeUri: src/
      Handler: app.handler
      Runtime: python3.9
      MemorySize: 512
      Layers: 
        - !Ref PowertoolsLayer
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref IdempotencyTable
      Environment:
        Variables:
          LOG_LEVEL: INFO
          POWERTOOLS_SERVICE_NAME: example
          DYNAMODB_TABLE: !Ref IdempotencyTable

  PowertoolsLayer:
    Type: AWS::Serverless::LayerVersion
    Properties:
      LayerName: MyLambdaLayer
      ContentUri: layer/
      CompatibleRuntimes:
        - python3.9

app.py handler with a sleep longer than the function timeout:

import time
from os import environ
from uuid import uuid4

from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent,
)

DYNAMODB_TABLE = environ["DYNAMODB_TABLE"]
config = IdempotencyConfig(function_timeout_clean_up=True)
persistence_layer = DynamoDBPersistenceLayer(table_name=DYNAMODB_TABLE)


@idempotent(persistence_store=persistence_layer, config=config)
def handler(event, context):
    time.sleep(3)
    uuid_value: str = str(uuid4())
    return {"message": uuid_value}

@michaelbrewer
Copy link
Contributor

I have tested this approach on a separate repo and it all seems to be good so far.

@michaelbrewer
Copy link
Contributor

@n2N8Z - i have deployed and tested the PR which fixes this issue. If you would like to have a look to see if it fits what you where expecting.

@heitorlessa heitorlessa added the bug Something isn't working label Jul 19, 2022
@github-actions github-actions bot added the pending-release Fix or implementation already in dev waiting to be released label Jul 29, 2022
@github-actions
Copy link
Contributor

This is now released under 1.26.7 version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p2 researching
Projects
None yet
Development

No branches or pull requests

5 participants