-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache AWS Config's CredentialsProvider to reduce STS calls #1235
Cache AWS Config's CredentialsProvider to reduce STS calls #1235
Conversation
607c80f
to
81321c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @erhancagirici, left some comments to record the suggested changes I'll push as a separate commit. I've also broken the comment lines at (most) 80 chars, the convention in Crossplane repositories.
Let's also add the relevant unit tests.
internal/clients/creds_cache.go
Outdated
} | ||
|
||
type awsCredentialsProviderCacheEntry struct { | ||
*aws.Config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had better remove the *aws.Config
to be on the safe side for concurrency as discussed from the cache.
internal/clients/creds_cache.go
Outdated
// since this is a hot-path in the execution, do not always update | ||
// the last access times, it is fine to evict the LRU entry on a less | ||
// granular precision. | ||
if time.Since(cacheEntry.AccessedAt) > 10*time.Minute { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be on the thread-safe side, we had better read the last access time in the critical section above as the cache entry is a pointer.
internal/clients/creds_cache.go
Outdated
cacheEntry.AccessedAt = time.Now() | ||
c.mu.Unlock() | ||
} | ||
return cacheEntry.credProvider.Retrieve(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we assume the aws.CredentialsProvider
is thread-safe, which may not be true. Either we call aws.CredentialsProvider.Retrieve
in a critical section, or we can also make sure that the aws.CredentialProvider
implementation is properly synchronized, e.g., make sure we use an aws.CredentialCache
, which is properly synchronized, by enforcing the field type.
internal/clients/creds_cache.go
Outdated
AccessedAt time.Time | ||
} | ||
|
||
func (c *AWSCredentialsProviderCache) RetrieveCredentials(ctx context.Context, pc *v1beta1.ProviderConfig, awsCfg *aws.Config) (aws.Credentials, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's hide the *aws.Config
here as we just need the AWS credential cache & region:
func (c *AWSCredentialsProviderCache) RetrieveCredentials(ctx context.Context, pc *v1beta1.ProviderConfig, awsCfg *aws.Config) (aws.Credentials, error) { | |
func (c *AWSCredentialsProviderCache) RetrieveCredentials(ctx context.Context, pc *v1beta1.ProviderConfig, region string, awsCredCache *aws.CredentialCache) (aws.Credentials, error) { |
internal/clients/creds_cache.go
Outdated
// NewAWSCredentialsProviderCache returns a new empty *AWSCredentialsProviderCache with the default GetAWSConfig method. | ||
func NewAWSCredentialsProviderCache(opts ...AWSCredentialsProviderCacheOption) *AWSCredentialsProviderCache { | ||
// zl := zap.New(zap.UseDevMode(false)) | ||
logr := logging.NewLogrLogger(zap.New(zap.UseDevMode(false)).WithName("provider-aws-credentials-cache")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we will not be able to configure the default logger in development (debug) mode and we do not honor the -d
command-line option of the provider. I will suggest we configure a noop logger here and pass the root logger down to the cache manager so that we can use a child logger of it.
internal/clients/creds_cache.go
Outdated
) | ||
|
||
// GlobalAWSCredentialsProviderCache is a global AWS CredentialsProvider cache to be used by all controllers. | ||
var GlobalAWSCredentialsProviderCache = NewAWSCredentialsProviderCache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may consider getting rid of this global cache variable by initializing a cache in clients.SelectTerraformSetup
.
/test-examples="examples/iam/v1beta1/role.yaml" |
a09b3cd
to
3e748f1
Compare
internal/clients/creds_cache.go
Outdated
if err != nil { | ||
return aws.Credentials{}, errors.Wrap(err, "cannot calculate the hash for the credentials file") | ||
} | ||
cacheKeyParams = append(cacheKeyParams, authKeyIRSA, tokenHash, os.Getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), os.Getenv("AWS_ROLE_ARN")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cacheKeyParams
already contains the credential source name (pc.Spec.Credentials.Source
), so no need to append it once more:
cacheKeyParams = append(cacheKeyParams, authKeyIRSA, tokenHash, os.Getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), os.Getenv("AWS_ROLE_ARN")) | |
cacheKeyParams = append(cacheKeyParams, tokenHash, os.Getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), os.Getenv("AWS_ROLE_ARN")) |
3af8512
to
b61758b
Compare
/test-examples="examples/sns/v1beta1/topic.yaml" |
0fc27f1
to
18856ac
Compare
/test-examples="examples/sns/v1beta1/topic.yaml" |
Signed-off-by: Erhan Cagirici <erhan@upbound.io>
- Use an aws.CredentialCache in the cache manager, which is known to be thread-safe. - Break comments in creds_cache.go at line 80. Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
- Remove the global variable for the cache manager Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
the credential cache key. Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
…cache misses Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
…thentication Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
18856ac
to
0fbbf02
Compare
/test-examples="examples/sns/v1beta1/topic.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @erhancagirici and @ulucinar LGTM!
Thanks @erhancagirici, @sergenyalcin, lgtm. |
Description of your changes
Fixes #997
Introduces a global credential cache to reduce AWS STS calls.
Only IRSA credentials are cached.
The new provider cache is a two-layer hierarchical cache. L1 cache is an AWS SDK Go
aws.CredentialsCache
and the L2 cache is for caching theaws.CredentialsCache
s. The cache key for the L2 cache is derived from the well known IRSA authentication parameters as well as the contents of the OIDC ID token file. The new cache also caches the AWS account ID for a given IRSA configuration and replaces the identity cache for IRSA configurations.Background:
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
Tested manually with provider configs with:
index.docker.io/ulucinar/provider-aws-ec2:v1.3.0-0fbbf02b3656352c729396851646d12ef80a1496
forUpbound
authentication on Upbound Cloud.Secret
) has succeeded here: https://github.com/crossplane-contrib/provider-upjet-aws/actions/runs/8468707015Two experiments were done using 4 managed resources (MRs) with a plain IRSA configuration and an IRSA configuration with an assume role chain of length two with the following
ProviderConfig.aws
:During these experiments, we forced frequent reconciliations of the MRs (every 3 seconds) in constant update loops and
we also observed the AWS CloudTrail event history for an extended period of time. Here are the relevant events from CloudTrail:
As the logs show, for these 4 MRs, at most only one
sts.AssumeRoleWithWebIdentity
operation per an hour has been recorded, showing the effectiveness of the credential cache for IRSA authentication. Please note that the temporary credentials issued by thests.AssumeRoleWithWebIdentity
are valid for one hour. It's the L1 cache that discards these temporary credentials after one hour and renews them. During this extended period, because the L2 cache item is not discarded, only onests.GetCallerIdentity
operation has been observed.I also did a test for the L2 cache by invaliding the cache entry prematurely. The following event logs show how this results in a premature call to
sts.AssumeRoleWithWebIdentity
:After the temporary credentials were fetched at
March 28, 2024, 16:36:47
, we would not expect them to be refreshed before an hour but causing the L2 cached entry go stale, there's been a premature call atMarch 28, 2024, 16:46:22
.Also tested the PR on top of @mergenci's API Call Counters PR. Under an update loop, the reported API call counters for
sts.AssumeRoleWithWebIdentity
&sts.GetCallerIdentity
are not increasing: