Get credentials from AWS ECR when needed #174

squaremo · 2021-09-30T10:41:23Z

A rebase of #147 (to avoid force pushing over a main branch).

As discussed in #139

If the flag --user-aws-ecr is set, the controller will log in to AWS
Elastic Container Registry by fetching a temporary token from AWS
(when a given image is hosted in ECR).

TODOs carried over (in a comment, and based on Flux v1's implementation):

Caching of tokens (one per account/region pair), this fetches a fresh token every time
- handling of expiry

The quota for getting an auth token is 500/second/region: https://docs.aws.amazon.com/general/latest/gr/ecr.html. Since image scanning tends to be done on periods of O(minutes), you would need more than 500 * 60 images to scan (ImageRepository objects) before it was not viable to get an auth token for each scan. You might have that many if you run automation in lots of clusters in the same region, under the same account. But in my estimation, not bothering with caching is the 80/20 (80% of the value with 20% of the complexity, in this case) solution.

Expiry only needs to be accounted for if tokens are cached.

Back-Off in case of errors

Yes, retrying after failures is an easy way to burn through your API quota (see above). As it happens though, if an image is detected as being from ECR, and the flag for ECR auto-login was set, then failing to get a token will error out of .Reconcile(), and the controller-runtime machinery will back off.

Possibly: special behaviour for non-global partitions (China, GovCloud)

I'm not aware of any special treatment needed for these -- perhaps if you wanted to scan an image in ECR for some non-global partition, from another partition. But let's cross that bridge when we get to it.

squaremo · 2021-09-30T10:42:13Z

I would love to see tests for this, but mocking AWS services is no small task. Perhaps something like https://github.com/localstack/localstack could help.

main.go

controllers/imagerepository_controller.go

stefanprodan · 2021-09-30T11:25:47Z

@squaremo does this work with IRSA or with node IAM bindings or with both?

squaremo · 2021-09-30T12:27:01Z

does this work with IRSA or with node IAM bindings or with both?

From the explanation in #139 (comment) I would expect it works with the node IAM bindings that EKS does automatically, as well as IRSA (IAM roles for service accounts) and possibly other mechanisms, since "one can grant additional IAM permissions via a multitude of mechanics (IAM node role, IAM service account, or even configuring access keys manually via environment variables), and all official AWS SDKs will automatically try each until it gets a valid identity". @MartinEmrich Are you able to confirm that this is the correct expectation?

MartinEmrich · 2021-09-30T12:51:27Z

@squaremo @stefanprodan

It works with the IAM instance profile that each EKS node has to have.
I guess AWS puts the ecr credentials helper (https://github.com/awslabs/amazon-ecr-credential-helper) in their provided AMIs, or they run a patched kubelet which regularly gets an ECR login token.
See also: https://docs.aws.amazon.com/AmazonECR/latest/userguide/ECR_on_EKS.html
The authenticated entity here is the Node/Kubelet itself, not the user workload.

Container image pulling should have no relationship to IAM service accounts/IRSA (after all, service accounts are assigned to pods, which run an already pulled image).

IRSA (or, putting access keys somewhere in your workloads, which is always bad and unneccesary), operates within your workloads (Pods), so are on a completely different level.

(Disclaimer: I a just a dude with a google, I have no hidden insights in how AWS does stuff internally.)

squaremo · 2021-09-30T14:31:54Z

Regarding the name of the flag: I don't really have a strong opinion on use vs enable vs with; consistency can guide us there. But I would like it to be precise and descriptive, so I would go for --use-aws-iam-for-ecr or --aws-use-iam-for-ecr (I'm undecided on how I would want the flags to sort, whether by aws or by enable or use; but again, consistency should be the guide).

MartinEmrich · 2021-10-04T07:36:35Z

As for the flag, I would like to urge you just one final time to make it enabled by the default, to follow the Principle Of Least Surprise. Again, this feature has no security implications, drawbacks or unexpected behaviour. (IMHO, even a flag to disable it makes no sense, as it does not protect anything from anything)

squaremo · 2021-10-05T10:16:43Z

This is my understanding:

Flux v1 uses ecrClient.GetAuthorizationToken. This will work if the pod has permissions to read ECR in the account given.
- A controller pod in EKS will generally have read permissions to ECR, granted through the node IAM role (mysteriously, the mechanism whereby pods get the node permissions is not mentioned there, but it is implied elsewhere)
- Another possible mechanism for granting permissions in EKS is using IAM Roles for Service Accounts (IRSA) to give the controller service account the appropriate IAM role; it's possible to block the node permissions being granted to pods, in which case this would become necessary.
- If you create an EKS cluster with eksctl or by following the getting started guide then the cluster nodes are given permission to read ECR
Flux v1 will only use this facility if you specifically allow it to, or if it detects (via the metadata service) that it's running in EC2 (which has its own problems).
At present, Flux v2 does not try to get authentication credentials for ECR.

This PR unconditionally asks ECR for credentials (given the enable flag, or if it were always enabled). That has the following effect:

Controller environment	Effect
EKS	By default, all `ImageRepository` objects can read ECR for the account under which EKS cluster is run
AWS in general	All `ImageRepository` objects can read ECR according to assigned IAM permissions
Outside EC2	For ECR images, `ecr.GetAuthorizationToken` will fail and the `ImageRepository` object will be marked unready (whether or not another secret was supplied)

So I think there are two reasons to guard this feature with a flag:

if you are not running in AWS, but you are using ECR for some reason, you don't want it to fail unconditionally (this can be fixed in the code by only trying ecr.GetAuthorizationToken when there's no secret referenced)
Usually, every pod can read everything in (the cluster account in) ECR, and having ImageRepository objects work differently would be strange; however, it's possible to remove that global permission for pods, so it ought to be possible to remove it for ImageRepository objects.

stefanprodan · 2021-10-05T10:23:38Z

if you are not running in AWS, but you are using ECR for some reason, you don't want it to fail unconditionally (this can be fixed in the code by only trying ecr.GetAuthorizationToken when there's no secret referenced)

I think this is the way to go, especially since many orgs don't use the same account for ECR and EKS. We can't assume that credentials present on nodes are for the registry that hosts the app images.

stefanprodan

Can we add an example to the API docs and explain how people can enable auto login of ECR?

squaremo · 2021-10-07T14:55:37Z

Can we add an example to the API docs and explain how people can enable auto login of ECR?

I was wondering about that. It's not part of the API, but there's no "running the controller" docs kept in the individual repo. I think it would go somewhere in the docs website.

squaremo · 2021-10-07T15:11:28Z

Can we add an example to the API docs and explain how people can enable auto login of ECR?

I was wondering about that. It's not part of the API, but there's no "running the controller" docs kept in the individual repo. I think it would go somewhere in the docs website.

Having said that, it can be mentioned in the API guide somewhere around the SecretRef explanation.

As discussed in #139 If the flag `--user-aws-ecr` is set, the controller will log in to AWS Elastic Container Registry by fetching a temporary token from AWS (when a given image is hosted in ECR). Signed-off-by: Martin Emrich <martin@martinemrich.me>

- rename to parseAwsImage, since it's not a URL - make return value include a bool rather than an error, since it's not an error for the image to not be ECR - return values explicitly rather than assigning them, since that can be prone to shadowing mistakes Signed-off-by: Michael Bridgen <michael@weave.works>

Checking for ECR whether or not there's a secret referenced means the reconciliation can fail despite there being a valid credential. If you are not running in EKS (or otherwise with AWS IAM permissions), but using ECR, you will still want to be able to supply a secret. Signed-off-by: Michael Bridgen <michael@weave.works>

This is my attempt at a descriptive flag for this feature. It mentions AWS and ECR, and suggests automatically getting credentials. It starts with `aws` so it will sort alongside any other AWS-specific flags that come along later. Signed-off-by: Michael Bridgen <michael@weave.works>

The TODO items for scanning ECR were based on Flux v1's ECR-specific code for image scanning; but it turns out they are not necessary for a viable implementation. I have removed the TODOs, and given an explanation why it's fit for purpose as it is. Signed-off-by: Michael Bridgen <michael@weave.works>

Signed-off-by: Michael Bridgen <michael@weave.works>

main.go

stefanprodan

LGTM

Thanks @squaremo ☕

PS. Maybe we should create issues for Azure and GCP auto login and point to this PR as an example for future contributors.

squaremo · 2021-10-14T10:56:49Z

Maybe we should create issues for Azure and GCP auto login and point to this PR as an example for future contributors.

Good thinking:

There's also #11 which is something like an umbrella issue for these.

jtnz · 2021-10-21T04:00:35Z

Hi @squaremo, I just tried this in Production today and it works. Thanks! Just one thing to note, I believe that you only have to log into a registry (account + region combination) once. This implementation appears to be doing if for every repository within the same registry, which will bulk out the logs and hit the API quite a lot when we start scanning our 100s of repos every minute or so. Perhaps some naive approach of working out all registries in the loop that will need logging into then just doing them once (in our use case that's 1 login vs. n logins for n repos)?

squaremo · 2021-10-21T10:32:46Z

I just tried this in Production today and it works.

🎉 Thank you for guinea-pigging this feature!

I believe that you only have to log into a registry (account + region combination) once. This implementation appears to be doing if for every repository within the same registry, which will bulk out the logs and hit the API quite a lot when we start scanning our 100s of repos every minute or so.

Yes, it's the simplest possible implementation to start with. I did a quick calculation to check that it would be well within the quota for requests for most users (you can check my assumption here). But, I do understand being nervous about quotas, and there is certainly room to reduce the number of requests required. Flux v1 caches the tokens it gets, and since this is a flag that applies to all image scans, the implementation here could do the same.

…d): elaborate on how to enable autologin for ECR fluxcd/flux#1619 fluxcd/flux2#107 fluxcd#11 fluxcd#139 fluxcd#147 fluxcd#174 fluxcd#179 fluxcd#180 fluxcd/website#601 fluxcd#193 fluxcd/website#611 https://fluxcd.io/docs/guides/image-update/#aws-elastic-container-registry Signed-off-by: lloydchang <lloydchang@gmail.com>

…d): elaborate on how to enable autologin for ECR fluxcd/flux#1619 fluxcd/flux2#107 fluxcd#11 fluxcd#139 fluxcd#147 fluxcd#174 fluxcd#179 fluxcd#180 fluxcd/website#601 fluxcd#193 fluxcd/website#611 https://fluxcd.io/docs/guides/image-update/#aws-elastic-container-registry Signed-off-by: lloydchang <lloydchang@gmail.com> Signed-off-by: Kingdon Barrett <kingdon@weave.works>

stefanprodan reviewed Sep 30, 2021

View reviewed changes

main.go Outdated Show resolved Hide resolved

controllers/imagerepository_controller.go Outdated Show resolved Hide resolved

squaremo force-pushed the eks-iam branch 2 times, most recently from 879ce9f to c90d8c5 Compare October 6, 2021 13:46

squaremo requested a review from stefanprodan October 7, 2021 14:27

stefanprodan reviewed Oct 7, 2021

View reviewed changes

MartinEmrich and others added 6 commits October 14, 2021 10:00

Mention ECR/EKS-specific flag in API guide

4760575

Signed-off-by: Michael Bridgen <michael@weave.works>

squaremo force-pushed the eks-iam branch from 2091215 to 4760575 Compare October 14, 2021 09:00

stefanprodan reviewed Oct 14, 2021

View reviewed changes

main.go Show resolved Hide resolved

stefanprodan approved these changes Oct 14, 2021

View reviewed changes

This was referenced Oct 14, 2021

GKE/GCR "autologin" #179

Closed

Azure/ACR "autologin" #180

Closed

squaremo merged commit 0db137f into main Oct 14, 2021

squaremo deleted the eks-iam branch October 14, 2021 11:17

squaremo mentioned this pull request Oct 14, 2021

Log in to AWS ECR automatically using the cluster-provided credentials. #147

Closed

MartinEmrich mentioned this pull request Oct 19, 2021

Authentication failure with ECR #139

Closed

kingdonb mentioned this pull request Oct 31, 2021

Add note about AWS Autologin to Image Update guide fluxcd/website#611

Merged

lloydchang mentioned this pull request Oct 31, 2021

docs(image-reflector-controller/docs/spec/v1beta1/imagerepositories.md): elaborate on how to enable autologin for ECR kingdonb/image-reflector-controller#1

Merged

darkowlzz mentioned this pull request Sep 28, 2022

Add regex to support ECR as OCI Helm registry fluxcd/pkg#362

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get credentials from AWS ECR when needed #174

Get credentials from AWS ECR when needed #174

squaremo commented Sep 30, 2021 •

edited

Loading

squaremo commented Sep 30, 2021

stefanprodan commented Sep 30, 2021

squaremo commented Sep 30, 2021

MartinEmrich commented Sep 30, 2021

squaremo commented Sep 30, 2021

MartinEmrich commented Oct 4, 2021

squaremo commented Oct 5, 2021

stefanprodan commented Oct 5, 2021 •

edited

Loading

stefanprodan left a comment

squaremo commented Oct 7, 2021

squaremo commented Oct 7, 2021

stefanprodan left a comment

squaremo commented Oct 14, 2021

jtnz commented Oct 21, 2021

squaremo commented Oct 21, 2021 •

edited

Loading

Get credentials from AWS ECR when needed #174

Get credentials from AWS ECR when needed #174

Conversation

squaremo commented Sep 30, 2021 • edited Loading

squaremo commented Sep 30, 2021

stefanprodan commented Sep 30, 2021

squaremo commented Sep 30, 2021

MartinEmrich commented Sep 30, 2021

squaremo commented Sep 30, 2021

MartinEmrich commented Oct 4, 2021

squaremo commented Oct 5, 2021

stefanprodan commented Oct 5, 2021 • edited Loading

stefanprodan left a comment

Choose a reason for hiding this comment

squaremo commented Oct 7, 2021

squaremo commented Oct 7, 2021

stefanprodan left a comment

Choose a reason for hiding this comment

squaremo commented Oct 14, 2021

jtnz commented Oct 21, 2021

squaremo commented Oct 21, 2021 • edited Loading

squaremo commented Sep 30, 2021 •

edited

Loading

stefanprodan commented Oct 5, 2021 •

edited

Loading

squaremo commented Oct 21, 2021 •

edited

Loading