Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic when failing to create Duration for exponential backoff from a large float #3621

Merged
merged 9 commits into from
May 13, 2024

Conversation

ysaito1001
Copy link
Contributor

@ysaito1001 ysaito1001 commented May 1, 2024

Motivation and Context

Avoids panic when Duration for exponential backoff could not be created from a large float.

Description

Duration::from_secs_f64 may panic. This PR switches to use a fallible sibling Duration::try_from_secs_f64 to avoid panics. If Duration::try_from_secs_f64 returns an Err we fallback to max_backoff for subsequent retries. Furthermore, we learned from internal discussion that jitter also needs to be applied to max_backoff. This PR updates calculate_exponential_backoff to handle all the said business logic in one place.

Testing

  • Added a unit test should_not_panic_when_exponential_backoff_duration_could_not_be_created
  • Manually verified reproduction steps provided in the original PR
More details
#[tokio::test]
async fn repro_1133() {
    let config = aws_config::from_env()
        .region(Region::new("non-existing-region")) // forces retries
        .retry_config(
            RetryConfig::standard()
                .with_initial_backoff(Duration::from_millis(1))
                .with_max_attempts(100),
        )
        .timeout_config(
            TimeoutConfig::builder()
                .operation_attempt_timeout(Duration::from_secs(180))
                .operation_timeout(Duration::from_secs(600))
                .build(),
        )
        .load()
        .await;

    let client: Client = Client::new(&config);
    let res = client
        .list_objects_v2()
        .bucket("bucket-name-does-not-matter")
        .send()
        .await;

    dbg!(res);
}

Without changes in this PR:

---- repro_1133 stdout ----
thread 'repro_1133' panicked at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/time.rs:741:23:
can not convert float seconds to Duration: value is either too big or NaN
stack backtrace:
...

failures:
    repro_1133

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 9 filtered out; finished in 338.18s

With changes in this PR:

// no panic
---- repro_1133 stdout ----
res = Err(
    TimeoutError(
        TimeoutError {
            source: MaybeTimeoutError {
                kind: Operation,
                duration: 600s,
            },
        },
    ),
)

Checklist

  • I have updated CHANGELOG.next.toml if I made changes to the smithy-rs codegen or runtime crates
  • I have updated CHANGELOG.next.toml if I made changes to the AWS SDK, generated SDK code, or SDK runtime crates

Appendix

runtime-versioner bug fix

This PR also fixes a limitation in runtime-versioner audit. I included the fix in the PR because the issue occurred with special conditions, and we don't get to reproduce it every day. The way the issue manifests is this.

  1. We have a branch from the main whose latest release tag at the time was release-2024-04-30
  2. The main has moved ahead and a new smithy-rs release has been made release-2024-05-08
  3. We perform git merge main, pre-commit hooks run, and we then get audit failures from runtime-versioner:
2024-05-10T16:54:36.434968Z  WARN runtime_versioner::command::audit: there are newer releases since 'release-2024-04-30'
aws-config was changed and version bumped, but the new version number (1.4.0) has already been published to crates.io. Choose a new version number.
aws-runtime was changed and version bumped, but the new version number (1.2.2) has already been published to crates.io. Choose a new version number.
aws-smithy-runtime-api was changed and version bumped, but the new version number (1.6.0) has already been published to crates.io. Choose a new version number.
aws-smithy-types was changed and version bumped, but the new version number (1.1.9) has already been published to crates.io. Choose a new version number.
aws-types was changed and version bumped, but the new version number (1.2.1) has already been published to crates.io. Choose a new version number.
Error: there are audit failures in the runtime crates

This happens because when the latest main is being merged to our branch, runtime-versioner audit should use previous_release_tag release-2024-05-08 to perform audit but pre-commit hooks run the tool using the latest previous release tag that can be traced back from HEAD of our branch, which is release-2024-04-30. Hence the error.

The fix adds an environment variable SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG to specify a previous release tag override, in addition to a --previous-release-tag command-line argument.
Furthermore, the fix has relaxed certain checks in audit. Taking our example for instance, when HEAD is now behind release-2024-05-08, it's OK to fail even if release-2024-05-08 is not the ancestor of HEAD (as stated above, git merge-base --is-ancestor does not know that while main is being merged) as long as release-2024-04-28 (the latest release seen from HEAD) is the ancestor of release-2024-05-08.

   release-2024-04-28               release-2024-05-08           
            │                                │                   
────────────┼───────────────┬────────────────┼───────x─────► main
            │               │HEAD            │       x           
                            │ of                     x           
                            │branch                  x           
                            │                        x           
                            │                        x           
                            │              xxxxxxxxxxx           
                            │              x                     
                            │              x  git merge main     
                            │              x                     
                            │              ▼                     
                            │                                    
                            └──────────────────► feature branch  

To use the fix, set the environment variable with the new release tag and perform git merge main:

➜  smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ export SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG=release-2024-05-08
➜  smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ git merge main
     ...
     Running `/Users/awsaito/src/smithy-rs/tools/target/debug/runtime-versioner audit`
2024-05-10T19:32:26.665578Z  WARN runtime_versioner::tag: expected previous release to be 'release-2024-04-30', but 'release-2024-05-08' was specified. Proceeding with 'release-2024-05-08'.
SUCCESS

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link

github-actions bot commented May 1, 2024

A new generated diff is ready to view.

  • AWS SDK (ignoring whitespace)
  • No codegen difference in the Client Test
  • No codegen difference in the Server Test
  • No codegen difference in the Server Test Python
  • No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

@ysaito1001 ysaito1001 marked this pull request as ready for review May 1, 2024 21:09
@ysaito1001 ysaito1001 requested review from a team as code owners May 1, 2024 21:09
tracing::warn!(
"could not create `Duration` for exponential backoff: {e}"
);
Err(ShouldAttempt::No)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems logical to me but just to be sure we've done our due diligence I'm going to poke at this.

  1. Why does it make sense to not attempt instead of just capping the backoff to retry_cfg.max_backoff()?
  2. I glanced through the SEP I don't see any guidance on this scenario, have we asked other SDKs what they do to ensure there is consistency? It may be worth updating the SEP if this is a grey area.

Separate question, if we are setting 100 attempts what does the token bucket look like here? I kind of would have thought we'd be at least getting close to exceeding our quotas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this would be better:

Duration::try_from_secs_f64(backoff).unwrap_or(Duration::MAX).min(retry_cfg.max_backoff())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After internal discussion, we learned it's still debatable whether we should continue making attempts. So we're not going to deliberately make that decision in this PR (and not making that decision doesn't make anythings worse than it is today). We can still make an improvement in that area once we've reached a consensus.

This PR allows customers to perform retries even when overflow occurs, which is updated in 54ad1f0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duration::try_from_secs_f64(backoff).unwrap_or(Duration::MAX).min(retry_cfg.max_backoff())

This one-liner is tempting, but there was some clarification in the design that jitter also needs to be applied to max_backoff. To make everything contained in one function (worrying about overflow, whether it should fallback to max_backoff, and for the ease of testing), I put things the way I did in the above commit. Let me know if you can think of a different way.

Copy link

A new generated diff is ready to view.

  • AWS SDK (ignoring whitespace)
  • No codegen difference in the Client Test
  • No codegen difference in the Server Test
  • No codegen difference in the Server Test Python
  • No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

Copy link

A new generated diff is ready to view.

  • AWS SDK (ignoring whitespace)
  • No codegen difference in the Client Test
  • No codegen difference in the Server Test
  • No codegen difference in the Server Test Python
  • No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

Copy link

A new generated diff is ready to view.

  • AWS SDK (ignoring whitespace)
  • No codegen difference in the Client Test
  • No codegen difference in the Server Test
  • No codegen difference in the Server Test Python
  • No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

@@ -21,6 +21,9 @@ use std::{
process::Command,
};

const SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG: &str =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why do we need an env variable when the arguments include previous_release_tag.

Also can we not make use of the env attribute so that clap sources this for us into the Audit command line arguments?

Copy link
Contributor Author

@ysaito1001 ysaito1001 May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During development, we usually run runtime-versioner as part of pre-commit hooks. Do you know if there is a way to specify --previous-release-tag <some release tag> to the above invocation done by pre-commit hooks?

Also can we not make use of the env attribute

Haven't used this feature before. Let me try that. Updated in 3b76d4b. Thanks for the suggestion.

@ysaito1001 ysaito1001 enabled auto-merge May 13, 2024 16:16
Copy link

A new generated diff is ready to view.

  • AWS SDK (ignoring whitespace)
  • No codegen difference in the Client Test
  • No codegen difference in the Server Test
  • No codegen difference in the Server Test Python
  • No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

@ysaito1001 ysaito1001 added this pull request to the merge queue May 13, 2024
Merged via the queue into main with commit 9454074 May 13, 2024
44 checks passed
@ysaito1001 ysaito1001 deleted the ysaito/fix-panic-in-exp-backoff branch May 13, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants