Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Native ECR auth #1619

Merged
merged 10 commits into from
Jan 8, 2019
Merged

Native ECR auth #1619

merged 10 commits into from
Jan 8, 2019

Conversation

squaremo
Copy link
Member

@squaremo squaremo commented Dec 27, 2018

Taking inspiration (and code) from #1455, this PR builds support for fetching authorization tokens into the registry credentials code.

The design is simple: it's a wrapper for ImageCreds which keeps an authentication token, refreshing when it needs to via the AWS API. This is pretty similar to ImageCredsWithDefaults, and is composed with that in fluxd's main.go. The arguments --registry-ecr-region, --registry-ecr-include-id and --registry-ecr-exclude-id control which images it will scan:

  • region gives the region(s) that will be scanned; if empty, the cluster region will be assumed
  • include-id gives the AWS accounts (registries) to scan; if empty, any image in the allowed regions will be scanned
  • exclude-id excludes an AWS account (registry) from those to scan; it defaults to the account used for EKS "system" images (e.g., coredns), so images from there won't be scanned.

None of the arguments are necessary; the default will be "scan images from any registry in the cluster region, except the EKS system images".

If fluxd is not running on EC2 (or hasn't been rigged so that it can authenticate as an AWS user, to access ECR), the wrapper will fail to be initialised, a message will be logged, and image scanning will proceed as before.

Fixes #539.

(EDIT: update to reflect latest changes)

@squaremo squaremo changed the title [WIP] Native ECR auth Native ECR auth Dec 31, 2018
@squaremo
Copy link
Member Author

Image pushed to quay.io/squaremo/flux:issue-539-native-ecr-auth-b76ac67 should anyone wish to try this out ...

site/daemon.md Outdated Show resolved Hide resolved
cmd/fluxd/main.go Outdated Show resolved Hide resolved
@errordeveloper
Copy link
Contributor

Please don't consider my comments as blockers for the PR, those are just a few thoughts on how we can improve things in the future...

@squaremo
Copy link
Member Author

squaremo commented Jan 2, 2019

Please don't consider my comments as blockers for the PR, those are just a few thoughts on how we can improve things in the future...

The question is, if we released this as is, would it limit our ability to improve it? Probably not -- we can always deprecate those arguments, or reinterpret them to mean "just auth this region / these account IDs". But it would be good to get it right the first time, too :-)

@squaremo
Copy link
Member Author

squaremo commented Jan 2, 2019

Two more points:

  • usually we don't know all the registry URLs in play; so we wouldn't be able to authenticate against all of them at once unless we're told them explicitly. So perhaps --registry-id could mean "authenticate for these and only these ECR registries".
  • @errordeveloper tells me that using a registry in another region would incur data costs (of course -- everything costs, on AWS). So although, in principle, we could authenticate for whichever region is in the registry URL, it might be desirable to limit it to the region in which the cluster is running, by default.

With those in mind, I propose that I change the arguments:

  • --aws-region=[] -- if empty, only authenticate against the cluster region; otherwise, authenticate against all listed regions (if they come up in a registry URL)
  • --aws-account-id=[] -- if empty, attempt to authenticate against any account ID that is used in a registry URL; otherwise only authenticate for those given.

@squaremo
Copy link
Member Author

squaremo commented Jan 3, 2019

Image built from the most recent commit at quay.io/squaremo/flux:issue-539-native-ecr-auth-dd7e7cd

@runningman84
Copy link

@squaremo I tried the new image but I still get logs like this:
flux/flux-79477cfbbf-t5xpx[flux]: ts=2019-01-03T16:04:36.487877004Z caller=warming.go:192 component=warmer canonical_name=602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns auth="{map[602401143452.dkr.ecr.eu-central-1.amazonaws.com:<registry creds for AWS@602401143452.dkr.ecr.eu-central-1.amazonaws.com, from AWS API>]}" err="requesting tags: denied: User: arn:aws:sts::1234567890:assumed-role/eks-preprod20190102102137043800000008/i-06edfb64ff3d5add4 is not authorized to perform: ecr:ListImages on resource: arn:aws:ecr:eu-central-1:602401143452:repository/eks/coredns"
Is this a known problem for eks internal repos?

@squaremo
Copy link
Member Author

squaremo commented Jan 3, 2019

@squaremo I tried the new image

Thank you!

... but I still get logs like this:

Yep, this is the problem I was referring to in

EKS uses images from some other ECR registry for some infrastructural services, which any old IAM user doesn't have full access to

The way to shut it up is to mention your own account ID in the argument --aws-account-id, which will mean it restricts scanning to just your own registry and ignores images elsewhere in ECR.

Having said that, it looks like it's the same account ID causing log noise in eu-central-1 as it is in eu-west-1, so I wonder if I could make the default to ignore that account ID ...

@runningman84
Copy link

Yes please ignore it by default this seams to be an aws account of the eks team.

if len(config.Regions) == 0 {
// this forces the AWS SDK to load config, so we can get the default region
sess := session.Must(session.NewSessionWithOptions(session.Options{
SharedConfigState: session.SharedConfigEnable,
Copy link
Contributor

@errordeveloper errordeveloper Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good start, hopefully it will be sufficient long term and we won't need to add too many input variables, as it will always run in a container and won't need to support a host of configuration modes that eksctl has to support (environment variables, flags and ~/.aws/config file with profiles and MFA etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to leave space for mounting config files (e.g., by attempting to load config before using the ec2 metadata service). You are right that contexts etc. complicate this :-/ We'll see ..

site/faq.md Outdated
- Google Container Registry works this way; Flux will
automatically attempt to use platform-provided credentials when
scanning images in GCR.
- (Amazon) Elastic Container Registry has its own authentication
Copy link
Contributor

@errordeveloper errordeveloper Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to

Amazon Elastic Container Registry (ECR) has its own authentication;
Flux will attempt to scan for images in ECR, if EC2 instance it is running
on has access to ECR.

Which also suggest that we might want to consider to support for ECR outside of EC2, but I don't think kubelet supports that kind of mode yet (IIRC, it was mentioned in the comment in kubelet code linked from here).

As far as EC2 itself is concerned, there is a possibility that user may wish to avoid using instance roles and pass credentials to flux directly (but we should check if kubelet supports that also). The reason someone may wish to do that is security, i.e. for the lack of real IAM role enforcement for pods, someone may prefer to disallow access to ECR for any but the chose process in their cluster. One can use separate sets of nodes to isolate privileged/unprivileged pods, but that won't work in the case, as kubelets will have to fetch images on all nodes. However, it's worse noting that fine-grain IAM roles can be created to avoid the possibility of certain images being fetched on certain nodes.

Just a few thoughts (well, more of a brain dump on ECR and IAM), perhaps we could open an issue for now, if it's worse it.

Copy link
Member Author

@squaremo squaremo Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be possible (though ungainly) to use credentials on the host to grant kubelet access to ECR when not running in EC2. And, I think it'll be possible (with this PR) to make credentials available to flux too, by providing a ~/.aws/credentials.

But I'd expect this to be well off the beaten path. I agree that it's important to set out the caveats early -- I'll try to rephrase.

site/faq.md Outdated
service accounts, platform-provided credentials on GCP or AWS, and
a Docker config file if you mount one into the fluxd container (see
the [command-line usage](./daemon.md)).
- When using images in ECR, from AWS, the IAM account used to run the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rephrase as:

 - When using images in AWS ECR, the EC2 instance role of the node
  fluxd container runs on must have permissions to query the ECR registry
  or registries in question. In turns, this is a requirement for kubelet also,
  so if your cluster can run containers from ECR, Flux should work too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it definitely the EC2 instance role that decides whether you have access? I confess I skipped lightly over the question of how the container gets any IAM permissions at all (and Just Made It Work TM). I'm happy to be more specific about the mechanism.

Copy link
Contributor

@errordeveloper errordeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, hopefully helpful.

This takes the essential workings of
#1455 and creates a helper for
using AWS authorization with ECR (the AWS container registry).
Instead of asking for the AWS region and set of registry IDs (really
account IDs), take these from the registry URLs as they come up.

This means we have to make a token request for each region; so, keep
track of the expiry, per region. And so we don't keep asking when
there's a failure, embargo requests for a region (for ten minutes)
whenever we fail to get a token.

The arguments `--aws-region` and `--aws-registry-id` are now ignored,
since the values are taken from the registry URLs; but, the plumbing
for those values is left in, because the arguments are useful with an
altered meaning of "only these regions/accountIDs". This change will
be made in a subsequent commit.
After the last commit, which takes the region and account ID from the
registry URL, this commit repurposes the arguments --aws-region and
(renamed) --aws-account-id to be "allow lists".

They differ in the default policy: if no account IDs are given, all
accounts IDs are allowed; whereas, if no region is given, the allowed
regions is taken to be the "cluster region".

What is the cluster region? It's the region as detected from
configuration (so it can be overridden), or as given by the EC2
metadata service, as a last resort. The rationale for always
restricting the region is that using images from a different region
will incurs costs, so it's better to encourage people to be deliberate
about it.

 - [ ] does it still work with minikube (or does it fail at creating wrapper)
This removes a duplicate label from the AWS-specific logger.
This introduces the flag `--registry-ecr-exclude-id`, and renames
`--registry-ecr-account-id` to `--registry-ecr-include-id` for
consistency.

Oddly, `--registry-ecr-exclude-id` is most useful when you don't
supply it, or its "include" sibling, since then fluxd will scan
everything except the registry used for EKS system images (which will
always fail, so far as I can determine).
@squaremo
Copy link
Member Author

squaremo commented Jan 7, 2019

Image that works as described in the newly-edited description: quay.io/squaremo/flux:issue-539-native-ecr-auth-1b5a3f2

This build

  • changes the argument names
  • excludes the EKS system images from being scanned, by default

@squaremo
Copy link
Member Author

squaremo commented Jan 7, 2019

@errordeveloper I've tweaked the wording in the FAQ (two places) -- is it any better, do you think?

@squaremo squaremo merged commit 3ca823d into master Jan 8, 2019
@squaremo
Copy link
Member Author

squaremo commented Jan 8, 2019

Many thanks @errordeveloper and @hiddeco !

@squaremo squaremo deleted the issue/539-native-ecr-auth branch January 8, 2019 10:46
@roffe
Copy link
Contributor

roffe commented Jan 8, 2019

awesome sauce!

@runningman84
Copy link

just waiting for a new release...

kingdonb pushed a commit to lloydchang/image-reflector-controller that referenced this pull request Oct 31, 2021
kingdonb pushed a commit to kingdonb/image-reflector-controller that referenced this pull request Oct 31, 2021
kingdonb pushed a commit to kingdonb/image-reflector-controller that referenced this pull request Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants