Fix panic when failing to create `Duration` for exponential backoff from a large float #3621

ysaito1001 · 2024-05-01T20:13:06Z

Motivation and Context

Avoids panic when Duration for exponential backoff could not be created from a large float.

Description

Duration::from_secs_f64 may panic. This PR switches to use a fallible sibling Duration::try_from_secs_f64 to avoid panics. If Duration::try_from_secs_f64 returns an Err we fallback to max_backoff for subsequent retries. Furthermore, we learned from internal discussion that jitter also needs to be applied to max_backoff. This PR updates calculate_exponential_backoff to handle all the said business logic in one place.

Testing

Added a unit test should_not_panic_when_exponential_backoff_duration_could_not_be_created
Manually verified reproduction steps provided in the original PR

More details

#[tokio::test]
async fn repro_1133() {
    let config = aws_config::from_env()
        .region(Region::new("non-existing-region")) // forces retries
        .retry_config(
            RetryConfig::standard()
                .with_initial_backoff(Duration::from_millis(1))
                .with_max_attempts(100),
        )
        .timeout_config(
            TimeoutConfig::builder()
                .operation_attempt_timeout(Duration::from_secs(180))
                .operation_timeout(Duration::from_secs(600))
                .build(),
        )
        .load()
        .await;

    let client: Client = Client::new(&config);
    let res = client
        .list_objects_v2()
        .bucket("bucket-name-does-not-matter")
        .send()
        .await;

    dbg!(res);
}

Without changes in this PR:

---- repro_1133 stdout ----
thread 'repro_1133' panicked at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/time.rs:741:23:
can not convert float seconds to Duration: value is either too big or NaN
stack backtrace:
...

failures:
    repro_1133

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 9 filtered out; finished in 338.18s

With changes in this PR:

// no panic
---- repro_1133 stdout ----
res = Err(
    TimeoutError(
        TimeoutError {
            source: MaybeTimeoutError {
                kind: Operation,
                duration: 600s,
            },
        },
    ),
)

Checklist

I have updated CHANGELOG.next.toml if I made changes to the smithy-rs codegen or runtime crates
I have updated CHANGELOG.next.toml if I made changes to the AWS SDK, generated SDK code, or SDK runtime crates

Appendix

runtime-versioner bug fix

This PR also fixes a limitation in runtime-versioner audit. I included the fix in the PR because the issue occurred with special conditions, and we don't get to reproduce it every day. The way the issue manifests is this.

We have a branch from the main whose latest release tag at the time was release-2024-04-30
The main has moved ahead and a new smithy-rs release has been made release-2024-05-08
We perform git merge main, pre-commit hooks run, and we then get audit failures from runtime-versioner:

2024-05-10T16:54:36.434968Z  WARN runtime_versioner::command::audit: there are newer releases since 'release-2024-04-30'
aws-config was changed and version bumped, but the new version number (1.4.0) has already been published to crates.io. Choose a new version number.
aws-runtime was changed and version bumped, but the new version number (1.2.2) has already been published to crates.io. Choose a new version number.
aws-smithy-runtime-api was changed and version bumped, but the new version number (1.6.0) has already been published to crates.io. Choose a new version number.
aws-smithy-types was changed and version bumped, but the new version number (1.1.9) has already been published to crates.io. Choose a new version number.
aws-types was changed and version bumped, but the new version number (1.2.1) has already been published to crates.io. Choose a new version number.
Error: there are audit failures in the runtime crates

This happens because when the latest main is being merged to our branch, runtime-versioner audit should use previous_release_tag release-2024-05-08 to perform audit but pre-commit hooks run the tool using the latest previous release tag that can be traced back from HEAD of our branch, which is release-2024-04-30. Hence the error.

The fix adds an environment variable SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG to specify a previous release tag override, in addition to a --previous-release-tag command-line argument.
Furthermore, the fix has relaxed certain checks in audit. Taking our example for instance, when HEAD is now behind release-2024-05-08, it's OK to fail even if release-2024-05-08 is not the ancestor of HEAD (as stated above, git merge-base --is-ancestor does not know that while main is being merged) as long as release-2024-04-28 (the latest release seen from HEAD) is the ancestor of release-2024-05-08.

   release-2024-04-28               release-2024-05-08           
            │                                │                   
────────────┼───────────────┬────────────────┼───────x─────► main
            │               │HEAD            │       x           
                            │ of                     x           
                            │branch                  x           
                            │                        x           
                            │                        x           
                            │              xxxxxxxxxxx           
                            │              x                     
                            │              x  git merge main     
                            │              x                     
                            │              ▼                     
                            │                                    
                            └──────────────────► feature branch

To use the fix, set the environment variable with the new release tag and perform git merge main:

➜  smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ export SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG=release-2024-05-08
➜  smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ git merge main
     ...
     Running `/Users/awsaito/src/smithy-rs/tools/target/debug/runtime-versioner audit`
2024-05-10T19:32:26.665578Z  WARN runtime_versioner::tag: expected previous release to be 'release-2024-04-30', but 'release-2024-05-08' was specified. Proceeding with 'release-2024-05-08'.
SUCCESS

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

rust-runtime/aws-smithy-runtime/Cargo.toml

github-actions · 2024-05-01T20:29:52Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
No codegen difference in the Client Test
No codegen difference in the Server Test
No codegen difference in the Server Test Python
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

aajtodd · 2024-05-02T13:03:06Z

rust-runtime/aws-smithy-runtime/src/client/retries/strategy/standard.rs

+                            tracing::warn!(
+                                "could not create `Duration` for exponential backoff: {e}"
+                            );
+                            Err(ShouldAttempt::No)


This seems logical to me but just to be sure we've done our due diligence I'm going to poke at this.

Why does it make sense to not attempt instead of just capping the backoff to retry_cfg.max_backoff()?

I glanced through the SEP I don't see any guidance on this scenario, have we asked other SDKs what they do to ensure there is consistency? It may be worth updating the SEP if this is a grey area.

Separate question, if we are setting 100 attempts what does the token bucket look like here? I kind of would have thought we'd be at least getting close to exceeding our quotas.

Yeah, I think this would be better:

Duration::try_from_secs_f64(backoff).unwrap_or(Duration::MAX).min(retry_cfg.max_backoff())

After internal discussion, we learned it's still debatable whether we should continue making attempts. So we're not going to deliberately make that decision in this PR (and not making that decision doesn't make anythings worse than it is today). We can still make an improvement in that area once we've reached a consensus.

This PR allows customers to perform retries even when overflow occurs, which is updated in 54ad1f0.

Duration::try_from_secs_f64(backoff).unwrap_or(Duration::MAX).min(retry_cfg.max_backoff())

This one-liner is tempting, but there was some clarification in the design that jitter also needs to be applied to max_backoff. To make everything contained in one function (worrying about overflow, whether it should fallback to max_backoff, and for the ease of testing), I put things the way I did in the above commit. Let me know if you can think of a different way.

github-actions · 2024-05-10T20:29:09Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
No codegen difference in the Client Test
No codegen difference in the Server Test
No codegen difference in the Server Test Python
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

github-actions · 2024-05-10T20:59:32Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
No codegen difference in the Client Test
No codegen difference in the Server Test
No codegen difference in the Server Test Python
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

github-actions · 2024-05-10T21:44:43Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
No codegen difference in the Client Test
No codegen difference in the Server Test
No codegen difference in the Server Test Python
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

aajtodd · 2024-05-13T12:34:08Z

tools/ci-build/runtime-versioner/src/command/audit.rs

@@ -21,6 +21,9 @@ use std::{
    process::Command,
 };

+const SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG: &str =


question: Why do we need an env variable when the arguments include previous_release_tag.

Also can we not make use of the env attribute so that clap sources this for us into the Audit command line arguments?

During development, we usually run runtime-versioner as part of pre-commit hooks. Do you know if there is a way to specify --previous-release-tag <some release tag> to the above invocation done by pre-commit hooks?

Also can we not make use of the env attribute

Haven't used this feature before. Let me try that. Updated in 3b76d4b. Thanks for the suggestion.

This commit addresses #3621 (comment)

github-actions · 2024-05-13T16:42:45Z

A new generated diff is ready to view.

AWS SDK (ignoring whitespace)
No codegen difference in the Client Test
No codegen difference in the Server Test
No codegen difference in the Server Test Python
No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

ysaito1001 added 2 commits May 1, 2024 14:28

Fix panics when failing to create exp backoff Duration

2cfe3bb

Update CHANGELOG.next.toml

c7e188b

ysaito1001 commented May 1, 2024

View reviewed changes

rust-runtime/aws-smithy-runtime/Cargo.toml Outdated Show resolved Hide resolved

ysaito1001 marked this pull request as ready for review May 1, 2024 21:09

ysaito1001 requested review from a team as code owners May 1, 2024 21:09

ysaito1001 added the needs-sdk-review label May 1, 2024

aajtodd reviewed May 2, 2024

View reviewed changes

ysaito1001 added 3 commits May 10, 2024 14:41

Allow previous release override to be descendant of HEAD

8b18308

Merge branch 'main' into ysaito/fix-panic-in-exp-backoff

2119f09

Update the explanatory comment for is_ancestor

fe140bb

Perform retries with max_backoff in the case of overflow

54ad1f0

Make the name of previous release tag env var unique

b2f015b

aajtodd approved these changes May 13, 2024

View reviewed changes

ysaito1001 added 2 commits May 13, 2024 10:55

Use the env attribute of the clap to simplify

3b76d4b

This commit addresses #3621 (comment)

Finish incomplete sentence in the comment

86854f4

ysaito1001 enabled auto-merge May 13, 2024 16:16

ysaito1001 added this pull request to the merge queue May 13, 2024

Merged via the queue into main with commit 9454074 May 13, 2024
44 checks passed

ysaito1001 deleted the ysaito/fix-panic-in-exp-backoff branch May 13, 2024 17:20

ysaito1001 mentioned this pull request May 13, 2024

Standard retry backoff calculation may panic awslabs/aws-sdk-rust#1133

Closed

Velfi removed the needs-sdk-review label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix panic when failing to create `Duration` for exponential backoff from a large float #3621

Fix panic when failing to create `Duration` for exponential backoff from a large float #3621

ysaito1001 commented May 1, 2024 •

edited

Loading

github-actions bot commented May 1, 2024

aajtodd May 2, 2024

Velfi May 10, 2024

ysaito1001 May 10, 2024

ysaito1001 May 10, 2024

github-actions bot commented May 10, 2024

github-actions bot commented May 10, 2024

github-actions bot commented May 10, 2024

aajtodd May 13, 2024

ysaito1001 May 13, 2024 •

edited

Loading

github-actions bot commented May 13, 2024

Fix panic when failing to create Duration for exponential backoff from a large float #3621

Fix panic when failing to create Duration for exponential backoff from a large float #3621

Conversation

ysaito1001 commented May 1, 2024 • edited Loading

Motivation and Context

Description

Testing

Checklist

Appendix

github-actions bot commented May 1, 2024

aajtodd May 2, 2024

Choose a reason for hiding this comment

Velfi May 10, 2024

Choose a reason for hiding this comment

ysaito1001 May 10, 2024

Choose a reason for hiding this comment

ysaito1001 May 10, 2024

Choose a reason for hiding this comment

github-actions bot commented May 10, 2024

github-actions bot commented May 10, 2024

github-actions bot commented May 10, 2024

aajtodd May 13, 2024

Choose a reason for hiding this comment

ysaito1001 May 13, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented May 13, 2024

Fix panic when failing to create `Duration` for exponential backoff from a large float #3621

Fix panic when failing to create `Duration` for exponential backoff from a large float #3621

ysaito1001 commented May 1, 2024 •

edited

Loading

ysaito1001 May 13, 2024 •

edited

Loading