Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache AWS Config's CredentialsProvider to reduce STS calls #1235

Merged
merged 7 commits into from
Mar 28, 2024

Conversation

erhancagirici
Copy link
Collaborator

@erhancagirici erhancagirici commented Mar 22, 2024

Description of your changes

Fixes #997
Introduces a global credential cache to reduce AWS STS calls.
Only IRSA credentials are cached.

The new provider cache is a two-layer hierarchical cache. L1 cache is an AWS SDK Go aws.CredentialsCache and the L2 cache is for caching the aws.CredentialsCaches. The cache key for the L2 cache is derived from the well known IRSA authentication parameters as well as the contents of the OIDC ID token file. The new cache also caches the AWS account ID for a given IRSA configuration and replaces the identity cache for IRSA configurations.

Background:

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Tested manually with provider configs with:

  • IRSA auth configuration
  • IRSA + RoleChain configuration
  • WebIdentity + RoleChain configuration [no credential caching with this PR]
  • @turkenf has also validated the provider package index.docker.io/ulucinar/provider-aws-ec2:v1.3.0-0fbbf02b3656352c729396851646d12ef80a1496 for Upbound authentication on Upbound Cloud.
  • A static (long term) credential configuration (with authentication type Secret) has succeeded here: https://github.com/crossplane-contrib/provider-upjet-aws/actions/runs/8468707015

Two experiments were done using 4 managed resources (MRs) with a plain IRSA configuration and an IRSA configuration with an assume role chain of length two with the following ProviderConfig.aws:

apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  assumeRoleChain:
  - roleARN: arn:aws:iam::<account ID>:role/alper-rc-1
  - roleARN: arn:aws:iam::<account ID>:role/alper-rc-2
  credentials:
    source: IRSA

During these experiments, we forced frequent reconciliations of the MRs (every 3 seconds) in constant update loops and
we also observed the AWS CloudTrail event history for an extended period of time. Here are the relevant events from CloudTrail:
image

As the logs show, for these 4 MRs, at most only one sts.AssumeRoleWithWebIdentity operation per an hour has been recorded, showing the effectiveness of the credential cache for IRSA authentication. Please note that the temporary credentials issued by the sts.AssumeRoleWithWebIdentity are valid for one hour. It's the L1 cache that discards these temporary credentials after one hour and renews them. During this extended period, because the L2 cache item is not discarded, only one sts.GetCallerIdentity operation has been observed.

I also did a test for the L2 cache by invaliding the cache entry prematurely. The following event logs show how this results in a premature call to sts.AssumeRoleWithWebIdentity:
image

After the temporary credentials were fetched at March 28, 2024, 16:36:47, we would not expect them to be refreshed before an hour but causing the L2 cached entry go stale, there's been a premature call at March 28, 2024, 16:46:22.

Also tested the PR on top of @mergenci's API Call Counters PR. Under an update loop, the reported API call counters for sts.AssumeRoleWithWebIdentity & sts.GetCallerIdentity are not increasing:

❯ curl -s http://localhost:8080/metrics | grep upjet | grep upjet_resource_external_api_calls_total
# HELP upjet_resource_external_api_calls_total The number of external API calls.
# TYPE upjet_resource_external_api_calls_total counter
upjet_resource_external_api_calls_total{operation="AssumeRole",service="STS"} 2
upjet_resource_external_api_calls_total{operation="AssumeRoleWithWebIdentity",service="STS"} 1
upjet_resource_external_api_calls_total{operation="CreateRole",service="IAM"} 1
upjet_resource_external_api_calls_total{operation="GetCallerIdentity",service="STS"} 1
upjet_resource_external_api_calls_total{operation="GetRole",service="IAM"} 26
upjet_resource_external_api_calls_total{operation="GetRolePolicy",service="IAM"} 25
upjet_resource_external_api_calls_total{operation="ListAttachedRolePolicies",service="IAM"} 25
upjet_resource_external_api_calls_total{operation="ListRolePolicies",service="IAM"} 25
upjet_resource_external_api_calls_total{operation="PutRolePolicy",service="IAM"} 1
❯ curl -s http://localhost:8080/metrics | grep upjet | grep upjet_resource_external_api_calls_total
# HELP upjet_resource_external_api_calls_total The number of external API calls.
# TYPE upjet_resource_external_api_calls_total counter
upjet_resource_external_api_calls_total{operation="AssumeRole",service="STS"} 2
upjet_resource_external_api_calls_total{operation="AssumeRoleWithWebIdentity",service="STS"} 1
upjet_resource_external_api_calls_total{operation="CreateRole",service="IAM"} 1
upjet_resource_external_api_calls_total{operation="GetCallerIdentity",service="STS"} 1
upjet_resource_external_api_calls_total{operation="GetRole",service="IAM"} 61
upjet_resource_external_api_calls_total{operation="GetRolePolicy",service="IAM"} 60
upjet_resource_external_api_calls_total{operation="ListAttachedRolePolicies",service="IAM"} 60
upjet_resource_external_api_calls_total{operation="ListRolePolicies",service="IAM"} 60
upjet_resource_external_api_calls_total{operation="PutRolePolicy",service="IAM"} 1

@ulucinar ulucinar force-pushed the aws-credentials-cache branch from 607c80f to 81321c1 Compare March 26, 2024 11:48
Copy link
Collaborator

@ulucinar ulucinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @erhancagirici, left some comments to record the suggested changes I'll push as a separate commit. I've also broken the comment lines at (most) 80 chars, the convention in Crossplane repositories.

Let's also add the relevant unit tests.

}

type awsCredentialsProviderCacheEntry struct {
*aws.Config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had better remove the *aws.Config to be on the safe side for concurrency as discussed from the cache.

// since this is a hot-path in the execution, do not always update
// the last access times, it is fine to evict the LRU entry on a less
// granular precision.
if time.Since(cacheEntry.AccessedAt) > 10*time.Minute {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be on the thread-safe side, we had better read the last access time in the critical section above as the cache entry is a pointer.

cacheEntry.AccessedAt = time.Now()
c.mu.Unlock()
}
return cacheEntry.credProvider.Retrieve(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we assume the aws.CredentialsProvider is thread-safe, which may not be true. Either we call aws.CredentialsProvider.Retrieve in a critical section, or we can also make sure that the aws.CredentialProvider implementation is properly synchronized, e.g., make sure we use an aws.CredentialCache, which is properly synchronized, by enforcing the field type.

AccessedAt time.Time
}

func (c *AWSCredentialsProviderCache) RetrieveCredentials(ctx context.Context, pc *v1beta1.ProviderConfig, awsCfg *aws.Config) (aws.Credentials, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hide the *aws.Config here as we just need the AWS credential cache & region:

Suggested change
func (c *AWSCredentialsProviderCache) RetrieveCredentials(ctx context.Context, pc *v1beta1.ProviderConfig, awsCfg *aws.Config) (aws.Credentials, error) {
func (c *AWSCredentialsProviderCache) RetrieveCredentials(ctx context.Context, pc *v1beta1.ProviderConfig, region string, awsCredCache *aws.CredentialCache) (aws.Credentials, error) {

// NewAWSCredentialsProviderCache returns a new empty *AWSCredentialsProviderCache with the default GetAWSConfig method.
func NewAWSCredentialsProviderCache(opts ...AWSCredentialsProviderCacheOption) *AWSCredentialsProviderCache {
// zl := zap.New(zap.UseDevMode(false))
logr := logging.NewLogrLogger(zap.New(zap.UseDevMode(false)).WithName("provider-aws-credentials-cache"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we will not be able to configure the default logger in development (debug) mode and we do not honor the -d command-line option of the provider. I will suggest we configure a noop logger here and pass the root logger down to the cache manager so that we can use a child logger of it.

)

// GlobalAWSCredentialsProviderCache is a global AWS CredentialsProvider cache to be used by all controllers.
var GlobalAWSCredentialsProviderCache = NewAWSCredentialsProviderCache()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may consider getting rid of this global cache variable by initializing a cache in clients.SelectTerraformSetup.

@ulucinar
Copy link
Collaborator

/test-examples="examples/iam/v1beta1/role.yaml"

@ulucinar ulucinar force-pushed the aws-credentials-cache branch from a09b3cd to 3e748f1 Compare March 26, 2024 16:47
if err != nil {
return aws.Credentials{}, errors.Wrap(err, "cannot calculate the hash for the credentials file")
}
cacheKeyParams = append(cacheKeyParams, authKeyIRSA, tokenHash, os.Getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), os.Getenv("AWS_ROLE_ARN"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cacheKeyParams already contains the credential source name (pc.Spec.Credentials.Source), so no need to append it once more:

Suggested change
cacheKeyParams = append(cacheKeyParams, authKeyIRSA, tokenHash, os.Getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), os.Getenv("AWS_ROLE_ARN"))
cacheKeyParams = append(cacheKeyParams, tokenHash, os.Getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), os.Getenv("AWS_ROLE_ARN"))

@ulucinar ulucinar force-pushed the aws-credentials-cache branch 2 times, most recently from 3af8512 to b61758b Compare March 27, 2024 11:39
@ulucinar ulucinar changed the title cache AWS Config's CredentialsProvider to reduce STS calls Cache AWS Config's CredentialsProvider to reduce STS calls Mar 27, 2024
@ulucinar ulucinar marked this pull request as ready for review March 28, 2024 10:38
@ulucinar
Copy link
Collaborator

/test-examples="examples/sns/v1beta1/topic.yaml"

@ulucinar ulucinar force-pushed the aws-credentials-cache branch from 0fc27f1 to 18856ac Compare March 28, 2024 13:21
@sergenyalcin
Copy link
Collaborator

/test-examples="examples/sns/v1beta1/topic.yaml"

erhancagirici and others added 7 commits March 28, 2024 17:09
Signed-off-by: Erhan Cagirici <erhan@upbound.io>
- Use an aws.CredentialCache in the cache manager,
  which is known to be thread-safe.
- Break comments in creds_cache.go at line 80.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- Remove the global variable for the cache manager

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
the credential cache key.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
…cache misses

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
…thentication

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
@ulucinar ulucinar force-pushed the aws-credentials-cache branch from 18856ac to 0fbbf02 Compare March 28, 2024 14:17
@ulucinar
Copy link
Collaborator

/test-examples="examples/sns/v1beta1/topic.yaml"

Copy link
Collaborator

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @erhancagirici and @ulucinar LGTM!

@ulucinar
Copy link
Collaborator

Thanks @erhancagirici, @sergenyalcin, lgtm.

@ulucinar ulucinar merged commit 80644da into crossplane-contrib:main Mar 28, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Excessive calls to AssumeRoleWithWebIdentity w/ IRSA
3 participants