Idempotency and Lambda timeouts #1038

n2N8Z · 2021-10-05T16:31:59Z

Is your feature request related to a problem? Please describe.
Currently if a lambda timesout the idempotency item is left in the 'INPROGRESS' status and subsequent lambda retries fail with IdempotencyAlreadyInProgressError.

Describe the solution you'd like
Add an additional field to the persistence record e.g. function_timeout and store in it the timestamp when the lambda will timeout: now().timestamp() + int(context.get_remaining_time_in_millis() / 1000)
Expand the DDB condition: OR function_timeout < :now

Describe alternatives you've considered
Create a signal handler in the lambda that raises an exception just before the lambda is about to timeout.

Additional context
N/A

heitorlessa · 2021-10-05T16:40:32Z

@cakepietoast didn't we have that covered in the initial implementation?

Unsure if it's a regression or if we accidentally missed that

michaelbrewer · 2021-10-07T05:28:49Z

@heitorlessa could this be referring to cases where the lambda execution times out before updating the status from INPROGRESS to COMPLETED?

One recommendation would be to increase the lambda timeout to be much longer (even if the api gw requests timeout within 25 seconds, the lambda can run for much longer.)

Otherwise, the idempotent handler could check the remaining time and clean up before exiting?

heitorlessa · 2021-10-07T05:36:00Z

Yep that’s the case. Because Lambda doesn’t offer a timeout hook besides extensions it’s a tricky endeavour — I had a private event signal utility for Powertools so I could send traces before it timed out, and it had side effects due to varying hardware. Increasing time out is surely one. But if it fails regardless it’ll take another hour for that specific request to succeed - OR all requests if Idempotency token is not configured correctly. @cakepietoast and I were chatting briefly about it. Maybe allowing a configurable expiration time for INPROGRESS is part of the solution - docs must accompany to provide guidance on dealing with timeouts and how to make it more amenable. Now that we have @idempotent_function, it’s easier to have more control over these failures too as Lambda Handler can wrap it with a try/except — still not ideal. We need to think this through carefully

…

On Thu, 7 Oct 2021 at 07:29, Michael Brewer ***@***.***> wrote: @heitorlessa <https://github.com/heitorlessa> could this be referring to cases where the lambda execution times out before updating the status from INPROGRESS to COMPLETED? One recommendation would be to increase the lambda timeout to be much longer (even if the api gw requests timeout within 25 seconds, the lambda can run for much longer.) Otherwise, the idempotent handler could check the remaining time and clean up before exiting? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/awslabs/aws-lambda-powertools-python/issues/738#issuecomment-937462731>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZPQBCLUV4FEO45256SA5LUFUVZZANCNFSM5FMDIRCA> .

michaelbrewer · 2021-10-07T05:42:20Z

Yes, a TTL on INPROGRESS records would allow for this to clean itself up. However, i would still suggest to extend the lambda timeout to be as long as possible.

n2N8Z · 2021-10-11T17:46:25Z

When the Lambda fails the first retry occurs 1 minute later, and the second retry 2 minutes after that.
So if the Lambda has failed due to a timeout, then the idempotency item needs to be cleaned within 1 minute otherwise the retry will fail. Using TTL to try to have an item reliably deleted in under 1 minute sounds like it will not be reliable - "TTL typically deletes expired items within 48 hours of expiration."
If the timeout timestamp is in the idempotency item then it can reliably be checked on the next run.

heitorlessa · 2021-10-11T19:17:39Z

It depends on the event source/invocation type. But what I meant by shorter INPROGRESS was both TTL and timestamp condition, since TTL - as you pointed out - wouldn't be sufficient and reliable

to-mc · 2021-10-13T12:13:52Z

When the Lambda fails the first retry occurs 1 minute later, and the second retry 2 minutes after that. So if the Lambda has failed due to a timeout, then the idempotency item needs to be cleaned within 1 minute otherwise the retry will fail.
To take your example - just because the Lambda has timed out doesn't mean there were no side effects. The idempotency utility is designed to safely wrap code that should not be run more than once.

The issue here is that there's an assumption being made that if Lambda times out, it is safe to retry. Consider an example where a Lambda function starts executing, then code runs that causes one side effects (like charging a payment), then the function times out before completion. If we delete the INPROGRESS record and allow a retry, the payment will be charged again. Put another way, there is no way for the idempotency utility to know whether a retry can safely be made or not - that is domain/implementation specific. We can't change the default behaviour there as customers are likely relying on it.

Using TTL to try to have an item reliably deleted in under 1 minute sounds like it will not be reliable - "TTL typically deletes expired items within 48 hours of expiration." If the timeout timestamp is in the idempotency item then it can reliably be checked on the next run.

As Heitor mentions, this isn't a problem - even if the item still persists in DynamoDB, the idempotency utility checks the expiry timestamp to see if the record has expired or not.

Taking the above into account, the only solution I can think of is to add an option to allow for setting a shorter timeout on INPROGRESS records so they can be expired on a different schedule to successful requests. We can also add a section to clarify the behavior (what happens when a Lambda function times out?).

n2N8Z · 2021-10-14T00:19:02Z

"The issue here is that there's an assumption being made that if Lambda times out, it is safe to retry." - that's the current default AWS behaviour, which is changed when powertools are used. "We can't change the default behaviour there as customers are likely relying on it." - agreed, but currently you are charging the AWS default. "Taking the above into account, the only solution I can think of is to add an option to allow for setting a shorter timeout on INPROGRESS records so they can be expired on a different schedule to successful requests." - setting it to the lambda's actual timeout would be about the same as what I suggested. Anything else will potentially interfere with the default retries.

…

________________________________ From: Tom McCarthy ***@***.***> Sent: 13 October 2021 14:14 To: awslabs/aws-lambda-powertools-python ***@***.***> Cc: n2N8Z ***@***.***>; Author ***@***.***> Subject: Re: [awslabs/aws-lambda-powertools-python] Idempotency and Lambda timeouts (#738) When the Lambda fails the first retry occurs 1 minute later, and the second retry 2 minutes after that. So if the Lambda has failed due to a timeout, then the idempotency item needs to be cleaned within 1 minute otherwise the retry will fail. To take your example - just because the Lambda has timed out doesn't mean there were no side effects. The idempotency utility is designed to safely wrap code that should not be run more than once. The issue here is that there's an assumption being made that if Lambda times out, it is safe to retry. Consider an example where a Lambda function starts executing, then code runs that causes one side effects (like charging a payment), then the function times out before completion. If we delete the INPROGRESS record and allow a retry, the payment will be charged again. Put another way, there is no way for the idempotency utility to know whether a retry can safely be made or not - that is domain/implementation specific. We can't change the default behaviour there as customers are likely relying on it. Using TTL to try to have an item reliably deleted in under 1 minute sounds like it will not be reliable - "TTL typically deletes expired items within 48 hours of expiration.<https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html>" If the timeout timestamp is in the idempotency item then it can reliably be checked on the next run. As Heitor mentions, this isn't a problem - even if the item still persists in DynamoDB, the idempotency utility checks the expiry timestamp to see if the record has expired or not. Taking the above into account, the only solution I can think of is to add an option to allow for setting a shorter timeout on INPROGRESS records so they can be expired on a different schedule to successful requests. We can also add a section to clarify the behavior (what happens when a Lambda function times out?). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://github.com/awslabs/aws-lambda-powertools-python/issues/738#issuecomment-942240247>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AE4J5ADJ3GFQ2FBEXE5C6KLUGVZYVANCNFSM5FMDIRCA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

to-mc · 2021-10-15T09:26:32Z

One small clarification even though I don't think its the most relevant piece here: Lambda doesn't really have a default for retries - it depends on how it is invoked. For example in the case of a synchronous execution, there are no retries, and the client needs to decide how to handle them.

agreed, but currently you are charging the AWS default.

I don't see it that way. We're adding a utility which makes it simple(r) to implement business logic which causes retries fail if we can't guarantee the code didn't already run. You could also think of the utility as protecting functionality from retries - regardless of where they come from.

setting it to the lambda's actual timeout would be about the same as what I suggested. Anything else will potentially interfere with the default retries.

"Interfering with retries" is in many ways exactly what the idempotency utility is supposed to do. The utility wraps a piece of your code that you consider unsafe to execute more than once (with the same arguments, within a certain time period). After a Lambda function times out, it may have been executed completely, partially, or not at all. If we execute the function again without knowing more than that, its no longer idempotent:

"The property of idempotency means that an operation does not cause additional side effects if it is called more than once with the same input parameters."

If you want retries to re-trigger your function by default after a timeout - even if you aren't sure how much of your code has already run - my gut feeling is that you probably shouldn't be using the idempotency utility. That said, I'm always keen to learn about new use cases we might not have thought of, so if you're able to share more details about what you want to achieve I'd be happy to dive deeper. If you prefer Slack please drop by the #lambda-powertools channel on the AWS Developers slack (details are in this project README).

n2N8Z · 2021-10-18T12:56:34Z

Lambda doesn't really have a default for retries - it depends on how it is invoked. For example in the case of a synchronous execution, there are no retries, and the client needs to decide how to handle them.

True, there are no retries for synchronous execution, but for event based execution there are default retries that occur as I've mentioned previously.

I don't see it that way. We're adding a utility which makes it simple(r) to implement business logic which causes retries fail if we can't guarantee the code didn't already run. You could also think of the utility as protecting functionality from retries - regardless of where they come from.

The utility is treating explicit errors from the lambda differently to timeout errors, whereas AWS performs retries regardless of whether it was an application error or a timeout.

"Interfering with retries" is in many ways exactly what the idempotency utility is supposed to do.

I guess that's largely true, but I want the Idempotency utility to only stop reprocessing of successfully processed events. If the lambda fails or timesout I want the retries to occur.

to-mc · 2021-10-18T15:32:47Z

I guess that's largely true, but I want the Idempotency utility to only stop reprocessing of successfully processed events. If the lambda fails or timesout I want the retries to occur.

Understood. I'll try to get a PR in this week to add a new, independently configurable "inprogress_expires_after_seconds" arg, which you could then set to the same value as your Lambda timeout. Something like this:

persistence_store = DynamoDBPersistenceLayer(
    table_name="SomeDynamoDBTable",
    expires_after_seconds=3600,  # successful idempotent records will expire after 60 minutes
    inprogress_expires_after_seconds=30  # in progress idempotent records will expire after 30 seconds 
)


@idempotent(persistence_store=persistence_store)
def handler(event, context):
    ...

Can you confirm that would solve for your use case?

n2N8Z · 2021-10-18T15:40:32Z

In principle yes, but the details matter:
If it relies on DDB expiry to delete the idempotency record, then it will probably not work.
I think the actual timestamp for when the lambda invocation will timeout needs to stored, rather than the lambda duration: "now().timestamp() + int(context.get_remaining_time_in_millis() / 1000)" because the idempotency record MUST be deleted, or be able to be verified as invalid, when the retry invocation occurs at 1 minute and 2 minutes after failures.

heitorlessa · 2022-02-22T07:34:29Z

@n2N8Z and everyone - We have recently found a better way to solve this using an internal no-op extension.

POC below - This allows us to receive a SIGTERM in a background thread to do any cleanup, like changing the status of the idempotency data record -- we let it takes its course if we don't receive a SIGTERM signal.

https://gist.github.com/heitorlessa/4aad06c39a1d520ff8c42adc72b0bcd5

n2N8Z · 2022-03-24T18:24:53Z

Sorry, kept meaning to get back to this and forgetting.
What is sending the SIGTERM ?
Is it the actual lambda scheduler ?
I would think it would be a very bad idea to be intercepting the actual timeout mechanism, and thereby telling AWS - "No we are not going to terminate now as requested, but instead we are going to do some more work, and we'll finish when we're good and ready.".
The other problem with bringing signal handlers into this is that you now rely on every module you use not to interfere with your signal handler. I unfortunately ran into this because I had a long running lambda where I was scheduling a timed signal to do a quick bit of cleanup before the real timeout occurs, but my python code was not receiving the SIGALRM when the binary mysql module was executing.

Why is this proposal better than adding an additional field to the persistence record e.g. function_timeout and storing in it the timestamp when the lambda will timeout: now().timestamp() + int(context.get_remaining_time_in_millis() / 1000), and expand the DDB condition with: OR function_timeout < :now ?

heitorlessa · 2022-03-24T19:28:25Z

Yes, it is the Lambda scheduler who orchestrates sandboxes that are recycled when a time out happens. A SIGTERM is only sent if an internal or external extension is registered -- the former is what the POC above does using a no-op internal extension -- The lambda scheduler guarantees 500ms for this logic to run.

SIGALRM, in your experience, is
not guaranteed as you discovered. For cleanup activities whether a timeout happens or not (clean exit,
Idle for X time), an external Lambda extension is more appropriate since you not only have 2s to handle it but it doesn't interfere in the request path (e.g., early return is possible while cleanup happens in the background).

https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html

Hopefully that explains it.

@cakepietoast had thoughts about both solutions not being good enough

to-mc · 2022-03-25T10:04:15Z

Fundamentally, I'm still of the mind that deleting INPROGRESS records violates idempotency. When "stuck" INPROGRESS records occur, it is because there's no way for the utility to know whether the Lambda failed before, during, or after the unit of work its being asked to perform. As such, in the rare cases where this occurs, you would need to evaluate/investigate to be able to make the correct decision on how to proceed.

With that said, the discussion on this thread demonstrates that we can improve this utility to apply to a wider scope of problems, where strict idempotency guarantees are not required. Of the 2 possible solutions we have, I'm in favour of the initial suggestion to use a separate timeout counter for INPROGRESS records. It is simpler, requiring less work for users to implement, and less likely to cause issues similar to the one @n2N8Z mentions in the last comment.

I haven't had time to work on this - and am not likely to have time for it in the near future. If anyone would like to submit a PR for it at a time when the repository is accepting feature requests, the team would be happy to review.

heitorlessa · 2022-03-25T11:58:56Z

Sounds like we have a winner implementation then - use a timeout counter property and a OR as originally suggested by @n2N8Z.

Two clarifications tho:

The timeout extension doesn't require any work from the customer, it'd happen transparently, no code change
SIGTERM is guaranteed by Lambda Scheduler for internal/external extensions upon function timeout. SIGALRM, however, isn't and we wouldn't do that as a solution for accuracy and reliability reasons.

Given this is borderline bug, I'd be happy to review and contribute to any PR. If no one has the bandwidth, we can revisit in August-onwards timeframe.

Thanks a lot everyone

Changes: - Initial draft on an option to clean up on function timeout close aws-powertools#1038

michaelbrewer · 2022-05-01T17:34:58Z

I am working on an initial draft PR based on what @n2N8Z has suggested #1198 Still untested and more to see the impact of the changes.

This does not include fancy extension method @heitorlessa spoke about.

If i have time to complete the test coverage and do some end to end testing, then maybe @cakepietoast or @heitorlessa can look at it further.

Otherwise the docs will have to be updated with the list of caveats. :)

michaelbrewer · 2022-05-04T19:34:32Z

Deployed and tested a version of my PR #1198 and it works

template.yaml with powertools based on #1198 :

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Testing idempotency timeouts
Globals:
    Function:
        Timeout: 2

Resources:
  IdempotencyTable:
    Type: AWS::DynamoDB::Table
    Properties:
      AttributeDefinitions:
        -   AttributeName: id
            AttributeType: S
      KeySchema:
        -   AttributeName: id
            KeyType: HASH
      TimeToLiveSpecification:
        AttributeName: expiration
        Enabled: true
      BillingMode: PAY_PER_REQUEST

  IdempotencyFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: IdempotencyFunction
      CodeUri: src/
      Handler: app.handler
      Runtime: python3.9
      MemorySize: 512
      Layers: 
        - !Ref PowertoolsLayer
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref IdempotencyTable
      Environment:
        Variables:
          LOG_LEVEL: INFO
          POWERTOOLS_SERVICE_NAME: example
          DYNAMODB_TABLE: !Ref IdempotencyTable

  PowertoolsLayer:
    Type: AWS::Serverless::LayerVersion
    Properties:
      LayerName: MyLambdaLayer
      ContentUri: layer/
      CompatibleRuntimes:
        - python3.9

app.py handler with a sleep longer than the function timeout:

import time
from os import environ
from uuid import uuid4

from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent,
)

DYNAMODB_TABLE = environ["DYNAMODB_TABLE"]
config = IdempotencyConfig(function_timeout_clean_up=True)
persistence_layer = DynamoDBPersistenceLayer(table_name=DYNAMODB_TABLE)


@idempotent(persistence_store=persistence_layer, config=config)
def handler(event, context):
    time.sleep(3)
    uuid_value: str = str(uuid4())
    return {"message": uuid_value}

michaelbrewer · 2022-05-05T15:33:42Z

I have tested this approach on a separate repo and it all seems to be good so far.

michaelbrewer · 2022-05-19T06:43:55Z

@n2N8Z - i have deployed and tested the PR which fixes this issue. If you would like to have a look to see if it fits what you where expecting.

github-actions · 2022-07-29T14:18:56Z

This is now released under 1.26.7 version!

heitorlessa transferred this issue from aws-powertools/powertools-lambda-python Nov 13, 2021

heitorlessa added Idempotency researching labels Nov 13, 2021

heitorlessa transferred this issue from aws-powertools/powertools-lambda Feb 22, 2022

heitorlessa added area/idempotency and removed python researching labels Feb 22, 2022

heitorlessa mentioned this issue Feb 22, 2022

Papercuts priorization before we consider v2 #1009

Closed

35 tasks

heitorlessa added p2 enhancement labels Mar 7, 2022

heitorlessa added the researching label Apr 20, 2022

michaelbrewer added a commit to gyft/aws-lambda-powertools-python that referenced this issue May 1, 2022

feat(idempotency): Clean up on lambda timeout

a67dd59

Changes: - Initial draft on an option to clean up on function timeout close aws-powertools#1038

heitorlessa removed the enhancement label Jun 1, 2022

heitorlessa added the bug Something isn't working label Jul 19, 2022

heitorlessa assigned rubenfonseca Jul 20, 2022

heitorlessa mentioned this issue Jul 27, 2022

feat(idempotency): handle lambda timeout scenarios for INPROGRESS records #1387

Merged

7 tasks

github-actions bot added the pending-release Fix or implementation already in dev waiting to be released label Jul 29, 2022

github-actions bot closed this as completed Jul 29, 2022

github-actions bot removed the pending-release Fix or implementation already in dev waiting to be released label Jul 29, 2022

jeromevdl mentioned this issue Aug 5, 2022

feat(idempotency): handle lambda timeout scenarios for INPROGRESS records aws-powertools/powertools-lambda-java#933

Merged

6 tasks

ijemmy mentioned this issue Aug 17, 2022

RFC: idempotency utility aws-powertools/powertools-lambda-typescript#447

Closed

bertpl mentioned this issue Jul 15, 2024

Bug: in_progress_expiration field is not set in Idempotency record when too close to lambda timeout #4759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idempotency and Lambda timeouts #1038

Idempotency and Lambda timeouts #1038

n2N8Z commented Oct 5, 2021

heitorlessa commented Oct 5, 2021

michaelbrewer commented Oct 7, 2021

heitorlessa commented Oct 7, 2021 via email

michaelbrewer commented Oct 7, 2021

n2N8Z commented Oct 11, 2021

heitorlessa commented Oct 11, 2021

to-mc commented Oct 13, 2021

n2N8Z commented Oct 14, 2021 via email

to-mc commented Oct 15, 2021

n2N8Z commented Oct 18, 2021

to-mc commented Oct 18, 2021

n2N8Z commented Oct 18, 2021

heitorlessa commented Feb 22, 2022

n2N8Z commented Mar 24, 2022

heitorlessa commented Mar 24, 2022

to-mc commented Mar 25, 2022

heitorlessa commented Mar 25, 2022

michaelbrewer commented May 1, 2022

michaelbrewer commented May 4, 2022 •

edited

Loading

michaelbrewer commented May 5, 2022

michaelbrewer commented May 19, 2022

github-actions bot commented Jul 29, 2022

Idempotency and Lambda timeouts #1038

Idempotency and Lambda timeouts #1038

Comments

n2N8Z commented Oct 5, 2021

heitorlessa commented Oct 5, 2021

michaelbrewer commented Oct 7, 2021

heitorlessa commented Oct 7, 2021 via email

michaelbrewer commented Oct 7, 2021

n2N8Z commented Oct 11, 2021

heitorlessa commented Oct 11, 2021

to-mc commented Oct 13, 2021

n2N8Z commented Oct 14, 2021 via email

to-mc commented Oct 15, 2021

n2N8Z commented Oct 18, 2021

to-mc commented Oct 18, 2021

n2N8Z commented Oct 18, 2021

heitorlessa commented Feb 22, 2022

n2N8Z commented Mar 24, 2022

heitorlessa commented Mar 24, 2022

to-mc commented Mar 25, 2022

heitorlessa commented Mar 25, 2022

michaelbrewer commented May 1, 2022

michaelbrewer commented May 4, 2022 • edited Loading

michaelbrewer commented May 5, 2022

michaelbrewer commented May 19, 2022

github-actions bot commented Jul 29, 2022

michaelbrewer commented May 4, 2022 •

edited

Loading