Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWSMachine controller leaks EC2 instances when specifying a Name tag in additionalTags #4629

Closed
mjlshen opened this issue Nov 10, 2023 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@mjlshen
Copy link
Contributor

mjlshen commented Nov 10, 2023

/kind bug

What steps did you take and what happened:
After applying two infrastructure.cluster.x-k8s.io/v1beta2 AWSMachines with the following additionalTags:

spec:
  additionalTags:
    Name: sample-name-tag

124 EC2 instances were created in quick succession for these AWSMachines before .spec.providerID was populated. All of the EC2 instances (leaked and not-leaked) do have the tags Name: sample-name-tag

❯ k get awsmachine                                                  
NAME                                    CLUSTER                            STATE     READY   INSTANCEID                              MACHINE
sample-name-tag-251d8688-cjd7r   ${ID}   running   true    aws:///us-east-2a/i-019c51ee0ded3e51a   sample-name-tag-workers-55b788fd4dx5n4zb-kn2h6
sample-name-tag-251d8688-hwhb4   ${ID}   running   true    aws:///us-east-2a/i-0c9169fb1781d2a8e   sample-name-tag-workers-55b788fd4dx5n4zb-8brbs

What did you expect to happen:
One EC2 instance to be created per AWSMachine, with the tag Name: sample-name-tag

Anything else you would like to add:
I think this occurs because:

  1. We try to find an instance during a normal reconcile loop -
    // Find existing instance
    instance, err := r.findInstance(machineScope, ec2svc)
    if err != nil {
    machineScope.Error(err, "unable to find instance")
    conditions.MarkUnknown(machineScope.AWSMachine, infrav1.InstanceReadyCondition, infrav1.InstanceNotFoundReason, err.Error())
    return ctrl.Result{}, err
    }
  2. We look for the instance with these tags, including the name of the AWSMachine
    input := &ec2.DescribeInstancesInput{
    Filters: []*ec2.Filter{
    filter.EC2.VPC(s.scope.VPC().ID),
    filter.EC2.ClusterOwned(s.scope.Name()),
    filter.EC2.Name(scope.Name()),
    filter.EC2.InstanceStates(ec2.InstanceStateNamePending, ec2.InstanceStateNameRunning),
    },
    }
  3. If instance == nil, then a new EC2 instance is created https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/56c9a39dd834640ee4a027e679ad2a5757098dfd/controllers/awsmachine_controller.go#L517C29-L522
  4. However, this only happens when the instance cannot be found by tags (or an AWS not found/permission error). So the bug is with a change in v1beta2 to allow overwriting the Name tag: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/api/v1beta2/tags.go#L244-L264

vs v1beta1

// Build builds tags including the cluster tag and returns them in map form.
func Build(params BuildParams) Tags {
tags := make(Tags)
for k, v := range params.Additional {
tags[k] = v
}
if params.ClusterName != "" {
tags[ClusterTagKey(params.ClusterName)] = string(params.Lifecycle)
}
if params.Role != nil {
tags[NameAWSClusterAPIRole] = *params.Role
}
if params.Name != nil {
tags["Name"] = *params.Name
}
return tags
}
where the AWSMachine's own .metadata.name had precedence

So I think we need to either not allow overwriting the Name tag or we need to be smarter about how we find the instance based off of tags. Even if we update the filtering logic to account for the additional Name tag, we don't want to find an instance that's already been claimed by another AWSMachine.

Environment:

  • Cluster-api-provider-aws version: v2.0.2
  • Kubernetes version: (use kubectl version):
Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.6+f67aeb3
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority labels Nov 10, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 10, 2023
@mjlshen
Copy link
Contributor Author

mjlshen commented Nov 13, 2023

As for why findInstance even needs to be called - normally providerID is populated when an EC2 instance has been created at least once, however I noticed that this bug occurred in my case because CAPA did not have the ability to perform ec2:CreateTags on ENIs

// Once all the network interfaces attached to the specific instanace are found, the similar tags of instance are created for network interfaces too
if len(networkInterfaces) > 0 {
s.scope.Debug("Attempting to create tags from resource", "resource-id", out.ID)
for _, networkInterface := range networkInterfaces {
// Create/Update tags in AWS.
if err := s.UpdateResourceTags(networkInterface.NetworkInterfaceId, out.Tags, nil); err != nil {
return nil, errors.Wrapf(err, "failed to create tags for resource %q: ", *networkInterface.NetworkInterfaceId)
}
}
}
so the providerID was never returned and lost, resulting in findInstance needing to fallback to tag-based detection

@cnmcavoy
Copy link
Contributor

cnmcavoy commented Nov 16, 2023

I am surprised the AWSMachine controller uses Name as a key to look up AWS resources. I would have expected to fallback to a tag key CAPA explicitly controlled, e.g sigs.k8s.io/cluster-api-provider-aws/aws-machine. Perhaps CAPA should use a tag key it knows it can own and control rather than one that users expect to manage, like Name, as Name exists mostly for the human UI in the first place.

@k8s-ci-robot
Copy link
Contributor

@celebdor: Re-titling can only be requested by trusted users, like repository collaborators.

In response to this:

/retitle OCPBUGS-23174: AWSMachine controller leaks EC2 instances when specifying a Name tag in additionalTags

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@celebdor
Copy link

sorry for the noise

@dlipovetsky
Copy link
Contributor

From today's office hours: I wonder if can address this once and for all by using the client token strategy described in https://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_Instance_Idempotency.html

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 9, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
6 participants