-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.0.1 Authentication failures with token retrieved via aws_eks_cluster_auth #1131
Comments
Hi, same problem here with Terraform v0.14.5, but different error message:
And the configuration is the same as with previous version provider.
|
Can you try running
Alternatively, running the Kubernetes provider in separate There's also a working EKS example you can compare with your configs. There are some improvements coming soon for the example, since we're working on related authentication issues. |
@dak1n1 I am considering this as a temporary workaround. |
Not sure about the 15mins issue, as we've been using this provider for almost a year now and the token validity has never been a problem. In fact, downgrading the provider to <2.0 works as expected. I'll try force refreshing the token and report back the results. |
Thanks! And about the downgrade fixing this -- that makes sense. Depending on your provider configuration, prior to 2.0, the Kubernetes provider may have actually been reading the |
The KUBECONFIG issue is not present in our environment as we run Terraform in GitLab CI and never use that file to authenticate to clusters from it. |
I tried an apply with a clean state using the exec instead of the token in the kubernetes provider on the initial run when the eks cluster is created. I get the same Using the exec required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 3.26.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.0.2"
}
}
}
provider "aws" {
region = var.region
}
data "aws_eks_cluster_auth" "cluster_token" {
name = module.eks.name
}
provider "kubernetes" {
host = module.eks.endpoint
cluster_ca_certificate = base64decode(module.eks.certificate)
exec {
api_version = "client.authentication.k8s.io/v1alpha1"
args = ["eks", "get-token", "--cluster-name", module.eks.name]
command = "aws"
}
} The kubernetes resources are created correctly on a retry of the pipeline as stated in the comments above; using the token or exec method. |
@loungerider Thanks for testing this. I believe the issue in your case has to do with certain parameters passed into the Kubernetes provider which are unknown at the time of the provider initialization. I'm guessing In the data source, the value of
I'm assuming you're using the EKS module here, which has an output that waits for the cluster API to be ready ( I also added a data source to read the cluster's hostname and CA cert data, so it will be able to read the new hostname and certs, if those ever change, such as on the first apply, or during cluster replacement. Although a single apply scenario like this is less reliable than running apply twice, it is possible to do, it just has these gotchas to be aware of. |
@dak1n1 I'm getting the same errors with the following:
As you can see the the only change I'm attempting is to upgrade EKS from 1.18 to 1.19. With out posting all the code the relevant portions:
My module follows the same conventions as the module you mentioned above except that I'm using the token instead of the exec method. We use Terraform Cloud for our workflow and I don't believe the AWS CLI is installed on those workers. The docs also warn against trying to install extra software on workers and even if you decide to ignore that advise doing so is kinda hacky to say the least. So IMO using the aws cli to generate creds should not be a solution to this issue. I've tried running this multiple times and always get errors like these:
My first question would be, is the token being stored somewhere in the state? I would assume the data source would be refreshed every run in case something changed (in this case I assume the token would be new with every run) therefore the 15 minute expiration should only be an issue on initial cluster creation where the token is created before the cluster. In the case above I would assume that should never happen due to the dependency chain of If the token is refreshed every time then why am I seeing this error when specifying an upgrade to an already provisioned cluster. The upgrade is not changing the cluster name, it should change in place. The existing cluster should be there, so the token should be created and the provider should be able to read the cluster state and make an appropriate plan. I also find it very curious that I don't see any errors like this related to resources provisioned by the helm provider. I don't know if maybe that's because the errors in the kubernetes provider are ending the plan before it gets to helm or if there is something different in how Helm is doing things that dodges this issue. I may try downgrading my provider to < 2.0 to see if this works there. If that's the case it's not a hidden |
Did some further digging and we may be barking in the wrong place: hashicorp/terraform-provider-aws#10269 (comment) |
@jw-maynard I'm glad you found that other issue! It sounds like the EKS cluster could be getting replaced rather than updated in-place. Could you do a What I saw in your configuration is what we call a "single apply" scenario (that is, a configuration which contains both the EKS cluster ( This is a known limitation in Terraform core, which I recently saw described well in this comment. It's a problem any time you have a provider that depends on a resource (in this case, the Kubernetes provider is dependent on information from If an underlying Kubernetes cluster is going to be replaced, and you already have Kubernetes resources provisioned using the Kubernetes provider, you'll have to work around this issue by doing a This workaround is only needed in single-apply scenarios where you have the cluster and the Kubernetes resources sharing a single state. In general, it's more reliable to keep the Kubernetes resources in a separate state from the EKS cluster resource (for example, a different workspace in TFC, or a different root module). Two applies will work every time, but a single apply involves some work-arounds, depending on the scenario. |
@dak1n1 It never gets that far because the plan errors but I know that version upgrades in EKS are an update in place scenario for sure. I guess they could have introduced a bug in the aws provider but I don't think so. I did a lot of digging around in logs at the TRACE level for this plan and found some differences in how a successful plan handles the two data sources compared to how it handles them in a plan where I try to upgrade the version. Unfortunately I'm not familiar enough with the inner workings of TF and it's providers to know if this is fixable in the provider or not. I'm happy to share my findings privately with anyone at HashiCorp who's willing to listen. Single apply scenarios seem to be something that a fair number of people would like to be able to do when working with Kubernetes on cloud providers. I can share what I think it the difference in the two runs. The failed one ends up in here for both EKS data sources (I'm just sharing
This appears to becoming from here https://github.com/hashicorp/terraform/blob/618a3edcd13f5231a77a699b7ba2a3fba352b7a3/terraform/eval_read_data_plan.go#L65 which tells me that A working run where the version is not updated I don't see the above at all but I see this:
Then a call to eks/DescribeCluster. This So in the failed state it seems like the data source is not even updating for some reason. Odd considering the cluster would be updated in place. The fact that there's no read of the data source in the failure when something is changing just makes me feel like there's a logical bug somewhere maybe in core, but I don't feel knowledgeable enough to articulate it in an issue over there. All that being said I am aware of the pitfalls with single apply scenarios and this certainly maybe one of those issues. The unfortunate part is that like they do with the EKS module you posted above, there are some things in EKS that require managing resource inside the cluster ( |
@dak1n1 This config worked for me. Thanks!
|
Using |
We faced with the same issue when running destroy (introduced in Terraform 0.14). Actually multiple providers affected Related issue is (which is closed): For example any providers using datasource aws_eks_cluster_auth will fail on destroy:
The proposed workaround is to run plan or refresh (which may not be the best solution for every team). |
I had a similar problem run an apply from the CI/CD.
The apply worked locally because, I had the AWS region configured in my AWS credentials but not in the pipeline. This configuration files worked for medata "aws_eks_cluster" "default" {
name = var.cluster_name
}
data "aws_eks_cluster_auth" "default" {
name = var.cluster_name
}
provider "kubernetes" {
host = data.aws_eks_cluster.default.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
exec {
api_version = "client.authentication.k8s.io/v1alpha1"
args = ["eks", "get-token", "--cluster-name", var.cluster_name, "--region", var.aws_region]
command = "aws"
}
} |
Correct me if i'm wrong, but that assumes you have |
This is to fix the authentication error, caused by this issue: hashicorp/terraform-provider-kubernetes#1131
This is to fix the authentication error, caused by this issue: hashicorp/terraform-provider-kubernetes#1131
We run into this issue with virtually every apply now that we use Atlantis:
This happens whenever the time between step 2 and step 4 is more than 15 minutes. The workaround of calling Is it a limitation of Terraform that this provider cannot refresh the token during |
@jbg without logs and samples of your configuration, there isn't a lot to go on in your report. Also, no Terraform, providers and cluster versions involved. Please help us help you. |
Sorry @alexsomesan I thought that the issue was well understood from earlier discussion in this issue. Is it not the case that a) the token expires after 15 minutes, and b) this provider does not request a new token during the (TF 1.1.5, terraform-provider-kubernetes 2.8.0, k8s 1.21, but the issue has existed for more than a year while always using the latest version of TF and the provider, and through k8s 1.19->1.20->1.21. It just affected us less before we moved to using Atlantis because we usually applied very soon after planning.) |
The provider will only request a new token if you configure it to use the cloud provider's auth plugin, by using the If you use the data-source from the AWS provider, that will only refresh once per operation (apply or plan). If your apply takes longer than then token expiration period, by using the data source you run the risk of using an expired token at some point in your apply run. |
This is not an issue of the apply taking more than 15 minutes. The problem occurs if the gap between plan and apply is more than 15 minutes. Applying a single resource (which takes mere seconds) still demonstrates the issue. It appears that the token is not refreshed (data source is not re-read) at apply time. The |
I would not call the Have a look at the contents of a kubeconfig file produces by the AWS CLI: ➤ aws eks update-kubeconfig --name k8s-dev 12:48:03
Added new context arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev to /Users/alex/.kube/config
➤ kubectl config view 12:48:35
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: DATA+OMITTED
server: https://XXXXXXXXXXXXXXXXXXXXXXXX.gr7.eu-central-1.eks.amazonaws.com
name: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
contexts:
- context:
cluster: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
user: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
name: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
current-context: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
kind: Config
preferences: {}
users:
- name: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
user:
exec:
apiVersion: client.authentication.k8s.io/v1alpha1
args:
- --region
- eu-central-1
- eks
- get-token
- --cluster-name
- k8s-dev
command: aws
env: null
interactiveMode: IfAvailable
provideClusterInfo: false The same happens on GKE, and for good reason. Most IAM systems advise to use short lived credentials obtained via some sort of dynamic role impersonation. EKS doesn't allow setting the lifespan of the token for the same reason. They want users to adopt role impersonation, which is the least risky way to handle credentials. This really isn't a hack. Back on the topic of Terraform, there is a solid reason why the datasource is not refreshed before apply in your scenario. SInce Atlantis is supplying a pre-generated plan to the In conclusion, there really isn't any better way of handling these short-lived credentials other than auth plugins. |
Thanks, this is the key insight I was missing, it is indeed not possible for the data source to be refreshed at apply time. It's unfortunate though that this means terraform cloud users are out of luck. We can build AWS CLI into our Atlantis image and set up processes for keeping it up to date, it's an inconvenience but not that bad, but on some platforms there is no similar solution that would allow the |
TFC allows one to use custom agents, as docker containers. Should be easy to add the auth plugins to those. |
This issue in the very least should require a review of all of the official documentation, since you cannot actually use the provider in it's documented state. |
A related issue to this, is that this provider seems to update the state with the changes that it attempted to apply, as if the apply was successful, even though the authentication failed due to expired credentials. So if you plan a change, and then wait 15 minutes, and then try to apply the plan, you will get an error like "Error: the server has asked for the client to provide credentials". Then if you try to plan again with |
I'm just using local exec to deploy the few Kubernetes resources I want to "manage" with Terraform. At the moment I don't want to split my rather small Terraform state into at least two layers just to be able to use the Kubernetes provider properly with an AWS EKS Kubernetes cluster 💁♀️ |
You can get around this with Kubernetes Service Account Tokens. The code snippet would look something like this: # create service account
resource "kubernetes_service_account_v1" "terraform_admin" {
metadata {
name = "terraform-admin"
namespace = "kube-system"
labels = local.labels
}
}
# grant privileges to the service account
module "terraform_admin" {
source = "aidanmelen/kubernetes/rbac"
version = "v0.1.1"
labels = local.labels
cluster_roles = {
"cluster-admin" = {
create_cluster_role = false
cluster_role_binding_name = "terraform-admin-global"
cluster_role_binding_subjects = [
{
kind = "ServiceAccount"
name = kubernetes_service_account_v1.terraform_admin.metadata[0].name
}
]
}
}
}
# retreive service account token from secret
data "kubernetes_secret" "terraform_admin" {
metadata {
name = kubernetes_service_account_v1.terraform_admin.metadata[0].name
namespace = kubernetes_service_account_v1.terraform_admin.metadata[0].namespace
}
}
# call provider with long-lived service account token
provider "kubernetes" {
alias = "terraform-admin"
host = "https://kubernetes.docker.internal:6443"
cluster_ca_certificate = data.kubernetes_secret.terraform_admin.data["ca.crt"]
token = data.kubernetes_secret.terraform_admin.data["token"]
} Please see authn-authz example from the aidanmelen/kubernetes/rbac module for more information. |
I run into the same problem in TFC. provider "aws" {
assume_role {
role_arn = var.assume_role_arn
}
} I solved this problem by explicitly specifying the IAM role when I get a token such as: exec {
api_version = "client.authentication.k8s.io/v1alpha1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", module.eks.name, "--role-arn", var.assume_role]
} Also, you may have to add your AWS region. |
This does work when using Terraform Cloud. It's how we have it working. |
in the general terraform k8s provide, is defined to use |
But the difference here is that you are using the AWS CLI to produce that kubeconfig. Of course it is sensible to also use the aws cli exec pattern to get the token within that kubeconfig. In Terraform, the expectation is that I should be able to utilize Terraform to interact with AWS for all things I need including getting a valid kubernetes auth token. I completely understand the limitations within terraform that prevent this data source from being resolved during apply if a plan state is provided, but maybe a reasonable solution would be to make the token TTL configurable (obviously an ask for the aws provider, not this kubernetes provider) |
Running the code below with the commented out block gives us the same error message as above: data "aws_eks_cluster_auth" "cluster" {
name = data.terraform_remote_state.kubernetes.outputs.cluster_name
}
provider "kubernetes" {
host = data.terraform_remote_state.kubernetes.outputs.cluster_endpoint
cluster_ca_certificate = data.terraform_remote_state.kubernetes.outputs.cluster_ca_certificate
token = data.aws_eks_cluster_auth.cluster.token
}
/*
provider "helm" {
kubernetes {
host = data.terraform_remote_state.kubernetes.outputs.cluster_endpoint
cluster_ca_certificate = data.terraform_remote_state.kubernetes.outputs.cluster_ca_certificate
token = data.aws_eks_cluster_auth.cluster.token
}
}
*/ This seems pretty weird to me, not sure if it's helpful. |
i find a big issue terraform-aws-modules/terraform-aws-eks#1234 this is my part and it is working for 2 years
|
after a few day since i created my EKS cluster, we are facing the same issue
it seems that the kubernetes and helm provider can not communicate with my cluster. here is my configuration
i’ve tried to run a terraform refresh —target module.eks, and it works, but it doesn’t seems to work as i have the same error when i try a refresh, plan, apply |
Marking this issue as stale due to inactivity. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. This helps our maintainers find and focus on the active issues. Maintainers may also remove the stale label at their discretion. Thank you! |
Terraform Version, Provider Version and Kubernetes Version
Affected Resource(s)
Terraform Configuration Files
Debug Output
Panic Output
Steps to Reproduce
Expected Behavior
What should have happened?
Resources should have been created/modified/deleted.1
Actual Behavior
What actually happened?
Important Factoids
No, we're just using EKS.
References
Community Note
The text was updated successfully, but these errors were encountered: