Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RFC: Providing fallback credentials on timeout #2218

Merged
merged 21 commits into from
Jan 27, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
06ea250
Add RFC: providing fallback credentials on timeout
ysaito1001 Jan 17, 2023
77c2b4f
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
8904a5e
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
d5657d3
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
e4c571c
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
9323dad
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
3834c0e
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
9be64b2
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
e5db721
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
7f55765
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
c79eee5
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 17, 2023
e79eca1
Update design/src/rfcs/rfc0031_providing_fallback_credentials_on_time…
ysaito1001 Jan 18, 2023
69477ae
Merge branch 'main' into ysaito/rfc-providing-fallback-credentials-on…
LukeMathWalker Jan 18, 2023
d413e58
Incorporate review feedback into RFC
ysaito1001 Jan 20, 2023
5bb7f01
Update rfc0031_providing_fallback_credentials_on_timeout.md
ysaito1001 Jan 20, 2023
7b62d47
Update RFC
ysaito1001 Jan 21, 2023
7311f87
Update rfc0031_providing_fallback_credentials_on_timeout.md
ysaito1001 Jan 22, 2023
7aca08b
Update rfc0031_providing_fallback_credentials_on_timeout.md
ysaito1001 Jan 23, 2023
9608f25
Update rfc0031_providing_fallback_credentials_on_timeout.md
ysaito1001 Jan 26, 2023
16a14ba
Merge branch 'main' into ysaito/rfc-providing-fallback-credentials-on…
ysaito1001 Jan 26, 2023
6cc0a72
Merge branch 'main' into ysaito/rfc-providing-fallback-credentials-on…
ysaito1001 Jan 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions design/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
- [RFC-0028: SDK Credential Cache Type Safety](./rfcs/rfc0028_sdk_credential_cache_type_safety.md)
- [RFC-0029: Finding New Home for Credential Types](./rfcs/rfc0029_new_home_for_cred_types.md)
- [RFC-0030: Serialization And Deserialization](./rfcs/rfc0030_serialization_and_deserialization.md)
- [RFC-0031: Providing Fallback Credentials on Timeout](./rfcs/rfc0031_providing_fallback_credentials_on_timeout.md)

- [Contributing](./contributing/overview.md)
- [Writing and debugging a low-level feature that relies on HTTP](./contributing/writing_and_debugging_a_low-level_feature_that_relies_on_HTTP.md)
2 changes: 2 additions & 0 deletions design/src/rfcs/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,5 @@
- [RFC-0027: Endpoints 2.0](./rfc0027_endpoints_20.md)
- [RFC-0028: SDK Credential Cache Type Safety](./rfc0028_sdk_credential_cache_type_safety.md)
- [RFC-0029: Finding New Home for Credential Types](./rfc0029_new_home_for_cred_types.md)
- [RFC-0030: Serialization And Deserialization](./rfc0030_serialization_and_deserialization.md)
- [RFC-0031: Providing Fallback Credentials on Timeout](./rfc0031_providing_fallback_credentials_on_timeout.md)
251 changes: 251 additions & 0 deletions design/src/rfcs/rfc0031_providing_fallback_credentials_on_timeout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
RFC: Providing fallback credentials on external timeout
=======================================================

> Status: RFC
>
> Applies to: client

For a summarized list of proposed changes, see the [Changes Checklist](#changes-checklist) section.

This RFC proposes a fallback mechanism for credentials providers on external timeout (see the [Terminology](#terminology) section), allowing them to continue serving (possibly stale) credentials for the sake of overall reliability of the intended service; the IMDS credentials provider is an example that must fulfill such a requirement to support static stability.
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved

Terminology
-----------

- External timeout: A timeout within which a call to the `provide_credentials` method must complete. `provide_credentials` returns a future and if it does not complete within this timeout, it gets dropped on the floor and there is no output retrieved from the future.
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved
- Internal timeout: A timeout imposed on functions/methods internally called by `provide_credentials` that, when failing to meet any of them, renders `provide_credentials` return an `CredentialsError`. For instance, internal timeout includes connection timeout, TLS negotiation timeout, and HTTP request timeout, if `provide_credentials` needs to fetch credentials via HTTP(S). This is a kind of timeout `provide_credentials` can react and handle it accordingly, i.e. returning fallback credentials or an `CredentialsError`.
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved

Assumption
----------

This RFC is concerned only with external timeout, as the cost of designing APIs poorly is much higher than that for internal timeout. The former will affect a public trait implemented by all credentials providers whereas the latter can be handled locally by individual credentials providers without affecting one another.
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved

Problem
-------

We have mentioned static stability. Supporting it calls for the following functional requirement, among others:
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved
- REQ 1: Once a credentials provider has served credentials, it should continue serving them in the event of a timeout (whether internal or external) while obtaining refreshed credentials.

Today, we have the following trait method to obtain credentials:
```rust
fn provide_credentials<'a>(&'a self) -> future::ProvideCredentials<'a>
where
Self: 'a,
```
This method can be used in async contexts, which means that nothing stops it from being raced against a timeout future, as demonstrated by the following code snippet from `LazyCredentialsCache`:
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved

```rust
let timeout_future = self.sleeper.sleep(self.load_timeout); // by default self.load_timeout is 5 seconds.
// --snip--
let future = Timeout::new(provider.provide_credentials(), timeout_future);
let result = cache
.get_or_load(|| {
async move {
let credentials = future.await.map_err(|_err| {
CredentialsError::provider_timed_out(load_timeout)
})??;
// --snip--
}
}).await;
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved
// --snip--
```
This creates an external timeout for `provide_credentials`. If `timeout_future` wins the race, a future for `provide_credentials` gets dropped, `timeout_future` returns an error, and the error is mapped to `CredentialsError::ProviderTimedOut` and returned. This makes it impossible for the variable `provider` above to serve credentials as stated in REQ 1.

A slightly more complex use case involves `CredentialsProviderChain`. It is a manifestation of the chain of responsibility pattern and keeps calling the `provide_credentials` method on each credentials provider down the chain until credentials are returned by one of them. In addition to REQ 1, we have the following functional requirement with respect to `CredentialsProviderChain`:
- REQ 2: Once a credentials provider in the chain has returned credentials, it should continue serving them even in the event of a timeout (whether internal or external) without falling back to another credentials provider.

Referring back to the code snippet above, we analyze two relevant cases (and suppose provider 2 below must meet REQ 1 and REQ 2 in each case):

**Case 1:** Provider 2 successfully loaded credentials but later failed to do so because an external timeout kicked in.

<p align="center">
<img width="750" alt="chain-provider-ext-timeout-1" src="https://user-images.githubusercontent.com/15333866/212421638-d08e4821-2dbe-497f-82c5-c78aab8acbe9.png">
</p>

The figure above illustrates an example. This `CredentialsProviderChain` consists of three credentials providers. When `CredentialsProviderChain::provide_credentials` is called, provider 1's `provide_credentials` is called but does not find credentials so passes the torch to provider 2, which in turn successfully loads credentials and returns them. The next time the method is called, provider 1 does not find credentials but neither does provider 2 this time, because an external timeout by `timeout_future` given to the whole chain kicked in and the future is dropped while provider 2's `provide_credentials` was running. Given the functional requirements, provider 2 should return the previously available credentials but today the code snippet from `LazyCredentialsCache` returns a `CredentialsError::ProviderTimedOut` instead.

**Case 2:** Provider 2 successfully loaded credentials but later was not reached because its preceding provider was still running when an external timeout kicked in.

<p align="center">
<img width="750" alt="chain-provider-ext-timeout-2" src="https://user-images.githubusercontent.com/15333866/212421712-8c6eab11-a0c1-4229-8ba3-67b0bb6056e7.png">
</p>

The figure above illustrates an example with the same setting as the previous figure. Again, when `CredentialsProviderChain::provide_credentials` is called the first time, provider 1 does not find credentials but provider 2 does. The next time the method is called, provider 1 is still executing `provide_credentials` and then an external time by `timeout_future` kicked in. Consequently, the execution of `CredentialsProviderChain::provide_credentials` has been terminated. Given the functional requirements, provider 2 should return the previously available credentials but today the code snippet from `LazyCredentialsCache` returns `CredentialsError::ProviderTimedOut` instead.
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved


Proposal
--------
To address the problem in the previous section, we propose to add a new method to the `ProvideCredentials` trait called `on_timeout` that looks something like this:
```rust
mod future {
// --snip--

pub struct OnTimeout<'a>(NowOrLater<Option<Credentials>, BoxFuture<'a, Option<Credentials>>>);
Velfi marked this conversation as resolved.
Show resolved Hide resolved

// impls for OnTimeout similar to those for the ProvideCredentials future newtype
}

pub trait ProvideCredentials: Send + Sync + std::fmt::Debug {
// --snip--

fn on_timeout<'a>(&'a self) -> future::OnTimeout<'a> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason this shouldn't be synchronous? What's the use-case for asynchronously retrieving the credentials on timeout? Would we have a timeout for the on_timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason it shouldn't. I think each has pros/cons:

  • synchronous
    pros: Easier to reason about.
    cons: Will be a breaking change if we want to switch to async in the future.
  • asynchronous
    pros: It's async from the beginning, so less likely to introduce a breaking change.
    cons: Exactly what you brought up, may need to specify a timeout.

Currently, there are no concrete use cases requiring this to be async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we have a timeout for the on_timeout?

It is possible, although I don't think we specify that as an input argument to fallback_on_interrupt for the same reason I described in the Alternative section in the RFC. Whoever uses fallback_on_timeout can race it against a timeout future just like LazyCredentialsCache does with provide_credentials today. It just that the timeout for fallback_on_timeout will probably be way shorter than that for provide_credentials, as the execution of fallback_on_timeout shouldn't take too long.

Haven't decided whether the timeout for fallback_on_timeout should be introduced today (something like load_timeout for LazyCredentialsCache). If we are to introduce it later, we need to see if we can do so without causing breaking changes.

future::OnTimeout::ready(None)
}
}
```
This method allows credentials providers to have a fallback mechanism on an external timeout and to serve credentials to users if needed. It comes with a default implementation to return a future that immediately yields `None`. Credentials providers that need a different behavior can choose to override the method.

The user experience for the code snippet in question will look like this once this proposal is implemented:
```rust
let timeout_future = self.sleeper.sleep(self.load_timeout); // by default self.load_timeout is 5 seconds.
// --snip--
let future = Timeout::new(provider.provide_credentials(), timeout_future);
let result = cache
.get_or_load(|| {
async move {
let credentials= match future.await {
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved
Ok(creds) => creds?,
Err(_err) => match provider.on_timeout().await { // can provide fallback credentials
Some(creds) => creds,
None => return Err(CredentialsError::provider_timed_out(load_timeout)),
}
};
// --snip--
}
}).await;
// --snip--
```


How to actually implement this RFC
----------------------------------

Almost all credentials providers do not have to implement their own `on_timeout` except for `CredentialsProviderChain` (`ImdsCredentialsProvider` also needs to implement `on_timeout` when we are adding static stability support to it but that is outside the scope of this RFC).

Considering the two cases we analyzed above, implementing `CredentialsProviderChain::on_timeout` is not so straightforward. Keeping track of whose turn in the chain it is to call `provide_credentials` when an external timeout has ocurred is a challening task. Even if we figured it out, that would still not satisfy `Case 2` above, because it was provider 1 that was actively running when the external timeout kicked in, but the chain should return credentials from provider 2, not from provider 1.
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved

With all that taken into account, we can consider starting with the following simple approach:
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved
```rust
impl ProvideCredentials for CredentialsProviderChain {
// --snip--

fn on_timeout<'a>(&'a self) -> future::OnTimeout<'a> {
future::OnTimeout::new(async move {
for (_, provider) in &self.providers {
match provider.on_timeout().await {
creds @ Some(_) => return creds,
None => {}
}
}
None
})
}
}
```
`CredentialsProviderChain::on_timeout` will invoke each provider's `on_timeout` method until credentials are returned by one of them. It ensures that the updated code snippet for `LazyCredentialsCache` can return credentials from provider 2 in both `Case 1` and `Case 2`. Even if `timeout_future` wins the race, the execution subsequently calls `provider.on_timeout().await` to obtain fallback credentials from provider 2, assuming provider 2's `on_timeout` is implemented to return credentials retrieved on the first call to `CredentialsProviderChain::provide_credentials`.
Velfi marked this conversation as resolved.
Show resolved Hide resolved

The downside of this simple approach is that the behavior is not clear if more than one credentials provider in the chain can return credentials from their `on_timeout`. Note, however, that it is the exception rather than the norm for a provider's `on_timeout` to return fallback credentials, at least at the time of writing (01/13/2023). The fact that it returns fallback credentials means that the provider successfully loaded credentials at least once, and it usually continues serving credentials on subsequent calls to `provide_credentials`.

Should we have more than one provider in the chain that can potentially return fallback credentials from `on_timeout`, we could configure the behavior of `CredentialsProviderChain` managing in what order and how each `on_timeout` should be executed; it can be possible to introduce a new API without breaking changes. That being said, we first need to investigate and understand the use case behind it to design the right API.

Alternative
-----------

In this section, we will describe our alternative approach that we ended up dismissing as unworkable.
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved

Instead of `on_timeout`, we considered the following method to be added to the `ProvideCredentials` trait:
```rust
pub trait ProvideCredentials: Send + Sync + std::fmt::Debug {
// --snip--

/// Returns a future that provides credentials within the given `timeout`.
///
/// The default implementation races `provide_credentials` against
/// a timeout future created from `timeout`.
fn provide_credentials_with_timeout<'a>(
&'a self,
sleeper: Arc<dyn AsyncSleep>,
timeout: Duration,
) -> future::ProvideCredentials<'a>
where
Self: 'a,
{
let timeout_future = sleeper.sleep(timeout);
let future = Timeout::new(self.provide_credentials(), timeout_future);
future::ProvideCredentials::new(async move {
let credentials = future
.await
.map_err(|_err| CredentialsError::provider_timed_out(timeout))?;
credentials
})
}
```
`provide_credentials_with_timeout` encapsulated the timeout race and allowed users to specify how long the external timeout for `provide_credentials` would be. The code snippet from `LazyCredentialsCache` then looked like
```rust
let sleeper = Arc::clone(&self.sleeper);
let load_timeout = self.load_timeout; // by default self.load_timeout is 5 seconds.
// --snip--
let result = cache
.get_or_load(|| {
async move {
let credentials = provider
.provide_credentials_with_timeout(sleeper, load_timeout)
.await?;
// --snip--
}
}).await;
// --snip--
```
However, implementing `CredentialsProviderChain::provide_credentials_with_timeout` quickly ran into the following problem:
```rust
impl ProvideCredentials for CredentialsProviderChain {
// --snip--

fn provide_credentials_with_timeout<'a>(
&'a self,
sleeper: Arc<dyn AsyncSleep>,
timeout: Duration,
) -> future::ProvideCredentials<'a>
where
Self: 'a,
{
future::ProvideCredentials::new(self.credentials_with_timeout(sleeper, timeout))
}
}

impl CredentialsProviderChain {
// --snip--

async fn credentials_with_timeout(
&self,
sleeper: Arc<dyn AsyncSleep>,
timeout: Duration,
) -> provider::Result {
for (_, provider) in &self.providers {
match provider
.provide_credentials_with_timeout(Arc::clone(&sleeper), /* how do we calculate timeout for each provider ? */)
.await
{
Ok(credentials) => {
return Ok(credentials);
}
Err(CredentialsError::ProviderTimedOut(_)) => {
// --snip--
}
Err(err) => {
// --snip--
}
}
}
Err(CredentialsError::provider_timed_out(timeout))
}
```
There are mainly two problems with this approach. The first problem is that as shown above, there is no sensible way to calculate a timeout for each provider in the chain. The second problem is that exposing a parameter like `timeout` at a public trait's level is giving too much control to users; delegating overall timeout to the individual provider means each provider has to get it right.

Changes checklist
-----------------

- [ ] Add `on_timeout` method to the `ProvideCredentials` trait with the default implementation
- [ ] Implement the `OnTimeout` future newtype
- [ ] Implement `CredentialsProviderChain::on_timeout`
- [ ] Add unit tests for `Case 1` and `Case 2`
ysaito1001 marked this conversation as resolved.
Show resolved Hide resolved