ASG Warmpool instances join before Lifecycle hook is in effect #16582

jim-barber-he · 2024-05-21T04:29:24Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Client version: 1.28.5 (git-v1.28.5)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Server Version: v1.28.9

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Create a number of new instance groups, some of which have the spec.warmPool defined with spec.warmPool.enableLifecycleHook: true set.
In my case I was creating 9 instance groups of which 3 of them had Warmpools enabled.
All instances were created by passing their manifests to kops create -f, then once that is done for all of them a kops update cluster --admin --yes command is run to put them into effect.

5. What happened after the commands executed?

All the instance groups were created fine and it appeared to be without issue...

But about an hour later when some of the instances that were in the warmpool were pressed into service, they joined the cluster as expected and started serving web traffic for our application until the ASG terminated them after they had been in service for 10 minutes.
Investigation via the AWS CloudTrail logs showed these EC2 instances had tried to run the CompleteLifecycleAction event a number of times when they joined the Warmpool all producing a ValidationException error with the message No active Lifecycle Action found with instance ID i-xxxxxxxxxxxxxxxxx.
So they have come up before the Lifecycle hook was in place.

Then later when put into service, the Lifecycle hook is now in place on the ASG, but these instances do not try to notify it, so the ASG assumes they are unhealthy after 10 minutes and terminates them.
In our case it killed our webservers causing a complete outage of our application for a while.

6. What did you expect to happen?

Instances that have joined the Warmpool should be properly executing the ASG Lifecycle hook notification when pressed into service so that the ASG won't terminate them after 10 minutes.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I am unable to reproduce this problem if I just create one new instance group at a time.
In that case the Lifecycle hook manages to be in effect before new instances need it.

But I am able to reliably reproduce it when I create a number of instance groups at once (9 of them since I just reproduced the problem we had in production).
I guess you need a lot of them to cause some AWS API server throttling or something?

I have a Slack topic detailing the issue: https://kubernetes.slack.com/archives/C3QUFP0QM/p1715733980241579

The problem would be fixed if the ASG didn't add instances to it until all lifecycle hooks are created.

The text was updated successfully, but these errors were encountered:

Fixes: kubernetes#16582 When creating a new ASG with a Warmpool using Lifecycle hooks, the instances that first join the Warmpool when the ASG is created could come up before the Lifecycle hook is in effect. This can lead to problems such as those instances not calling the Lifecycle hook notification when pressed into service, causing the ASG to terminate them approximately 10 mins after they've been performing work in the cluster. Setting a dependency on for the Warmpool to take into account any Lifecycle hooks means that the Warmpool won't be created until the hooks are ready. Signed-off-by: Jim Barber <jim.barber@healthengine.com.au>

jim-barber-he · 2024-05-21T04:37:05Z

I created a pull request to address the above: #16583

jim-barber-he · 2024-05-31T06:21:57Z

We had the ASG terminate an instance today that didn't get the Lifecycle hook notification even though there were no issues with the instance when it joined the Warmpool.
The nodeup command added this in the version where we started having problems:

	loader.Builders = append(loader.Builders, &networking.AmazonVPCRoutedENIBuilder{NodeupModelContext: modelContext})

And that step does the change to set MACAddressPolicy = none and to run systemctl restart systemd-networkd
Is there a chance this restart of the network is happening at a bad time interfering with the ASG Lifecycle hook notification sometimes?
Are these loader.Builders run in order, or in parallel? Since the networking.AmazonVPCRoutedENIBuilder is called towards the end of the list and a long time after model.HookBuilder for example...

justinsb · 2024-06-04T11:09:12Z

Based on your PR, I think you've figured this out, but ....

Are these loader.Builders run in order, or in parallel? Since the networking.AmazonVPCRoutedENIBuilder is called towards the end of the list and a long time after model.HookBuilder for example...

The builders are run in order, but they set up a list of Tasks which run in parallel. The Tasks don't run fully parallelized, rather they run in "waves", the first wave has all the tasks that have no dependencies, then when all of those complete we move onto the next wave, which comprises the tasks that are newly unblocked.

Dependencies are either because we see that an object has a reference to a Task (and there's some "resolving" that happens based on the Task ID so we don't need to pass around Tasks everywhere), or because there are explicit Dependencies.

(We could change it to run each task as soon as possible, "waves" are just how it works today. I don't think it will make a huge difference in execution time, and it might be a little less deterministic i.e. more bugs, so that's why we haven't prioritized it)

At least that's my recollection :-)

justinsb · 2024-06-04T11:27:16Z

Looking at how the lifecycle hook should be called from nodeup:

kops/upup/pkg/fi/nodeup/command.go

Lines 375 to 382 in 0e70556

    
           if nodeupConfig.EnableLifecycleHook { 
        
           	if bootConfig.CloudProvider == api.CloudProviderAWS { 
        
           		err := completeWarmingLifecycleAction(ctx, cloud.(awsup.AWSCloud), modelContext) 
        
           		if err != nil { 
        
           			return fmt.Errorf("failed to complete lifecylce action: %w", err) 
        
           		} 
        
           	} 
        
           }

That is actually called after all the nodeup tasks. So if the node isn't delivering that hook, then I don't think it's about task execution.

A few things to check:

Is the node getting enableLifecycleHook: true in the nodeup config (e.g.

kops/tests/integration/update_cluster/minimal-warmpool/data/aws_s3_object_nodeupconfig-nodes_content

Line 20 in 0e70556

EnableLifecycleHook: true

) from one of the tests
Is nodeup printing klog.Info("No ASG lifecycle hook found")

kops/upup/pkg/fi/nodeup/command.go

Line 434 in 0e70556

klog.Info("No ASG lifecycle hook found")

- that would be a race between setting up the hooks (which seems a little unlikely, but ... possible?)
Did it print Lifecycle action completed ? If so ... very curious, because that suggests that everything worked.

Based on the CloudTrail events, I'm assuming that the No active Lifecycle Action found with instance ID i-xxxxxxxxxxxxxxxxx. is coming from the CompleteLifecycleAction call here, but that's surprising I think given we just fetched it. Maybe the key word is "active" there, like maybe there's a delay in the lifecycle action becoming ready?

One thing though, you mention model.HookBuilder - IIUC that is for custom hooks, not for the built-in warmpool hook I've linked above. Are you using a custom hook for the lifecycle action?

On the ENI task, I did a quick spot check and the the error handling / retry logic looked correct, so I'd be surprised if that was interfering with the CompleteLifecycleAction call. But ... something is plainly going wrong.

jim-barber-he · 2024-06-05T00:49:52Z

Hi @justinsb
Thanks for looking into this.

This issue has been overloaded a little bit in that it is covering two issues and perhaps you're mixing the two given your responses above?

The first (and the main thing this issue was about), was the instances in the ASG starting before the ASG Lifecycle Hook was in place.
This occurs as soon as the ASG is created and happens to both the Warmpool instances and the instances that skip the Warmpool to join the ASG to be put into service.
It's these instances that encounter the No active Lifecycle Action found with instance ID i-xxxxxxxxxxxxxxxxx error.
In the case of the instances that skipped the Warmpool and were put into service at this time, it isn't an issue for them since the ASG won't kill them because the Lifecycle wasn't in place at the time they joined.
But when the Warmpool instances go into service at a later time due to a scale-up or node replacement, they don't attempt to perform the Lifecycle notification, and so will be killed by the ASG causing an outage.

The Pull Request I raised fixes the problem above for the instances that join the Warmpool only. They longer encounter the No active Lifecycle Action found with instance ID i-xxxxxxxxxxxxxxxxx error.
However the instances that skip the Warmpool to go into service immediately will still throw these messages, but it's not an issue and they are safe from being reaped by the ASG because they joined before the hook was in effect.

The above problem has nothing to do with the loader.Builders as far as I can tell, and are purely because they've started before the Lifecycle hook was active on the ASG.

For kOps 1.28.4 and earlier (where we were using Ubuntu 20.04) we have never seen the above issue before.
It all started at kOps 1.28.5 (where we also switched to Ubuntu 22.04) and persists in kOps 1.29.0.
As to why we had never encountered it before I cannot say.

The 2nd issue is what I posted about on May 31, and this is the one where I was wondering if the loader.Builders came into play.

In this case these instances joined the Warmpool a long time after the ASG and its Lifecycle hook was established and had no errors at all.
CloudTrail shows no issues with the instances as they joined the Warmpool, and shows that they properly performed the Lifecycle hook notification at the time they joined the Warmpool.

But later when they leave the Warmpool to go into service, there are no logs in CloudTrail showing them attempting to perform the Lifecycle hook notification (successfully or not); and after 10 mins the ASG kills them for having not completed the Lifecycle notification, causing an outage.
So either they didn't attempt the Lifecycle notification at all (which seems unlikely) OR they did, but it got lost.
If it got lost, I was wondering if it was because it happened at the time the task created by the loader.Builder executes the systemctl restart systemd-networkd command causing networking to go offline for a moment.

Note that what I failed to mention at the time I wrote about this 2nd problem on May 31 was that it was encountered on a Kubernetes 1.29.5 cluster that was created with kOps 1.29.0.
But this problem doesn't happen often and we didn't run the Kubernetes 1.28.9 cluster built with kOps 1.28.5 for very long (about 2 weeks) so for all I know the problem existed there as well, but not necessarily.

Still talking about the 2nd issue but addressing your questions...

One thing though, you mention model.HookBuilder - IIUC that is for custom hooks, not for the built-in warmpool hook I've linked above. Are you using a custom hook for the lifecycle action?

No that was me making a bad assumption that it might have had something to do with the creation of the lifecycle hooks.
The point I was trying to make is that there are all sorts of tasks being created there, where the networking.AmazonVPCRoutedENIBuilder one that restarts networking is towards the end.
I was thinking this probably needed to be one of the first tasks performed, and anything that requires the use of the network should probably depend on it having called systemctl restart systemd-networkd before they attempt to do anything network related.
However you're saying the lifecycle hook is called after all the node tasks are completed, so I guess my theory above about the network restart possibly interrupting things can't be right...

A few things to check

Unfortunately with this problem any node specific configuration and the logs produced by nodeup or caught by the systemd journal are lost, since we are not actually aware there is a problem with the instance until the ASG has already terminated it.
The vast majority of the instances do not encounter the problem at all, and it has only happened a couple of times.

We've since stopped deploying our own node-termination-handler that was running as a Daemon Set in IMDS mode to catch spot interruptions into having kOps deploy it in queue-processor mode so that it also handles when the ASG terminates the instances that fail lifecycle hooks.
This makes it less obvious if we are still having nodes terminated due to this problem since we'll no longer have outages caused by a node suddenly going away while still serving requests; so I'd have to go back to CloudTrail again to try and find the frequency this is happening.

Is the node getting enableLifecycleHook: true in the nodeup config

It should be getting the same configuration as all the other instances on the same ASG that didn't encounter the issue.
But with the instance having been terminated I have no way of checking.

Is nodeup printing klog.Info("No ASG lifecycle hook found")

Due to the ASG terminating the instances, I have no way of knowing as far as I can tell?

With my theory of the systemctl restart systemd-networkd being incorrect because the lifecycle hook call is done after all tasks have completed their run, then I don't know why in these few cases the instances have appeared to have not performed the lifecycle hook call; yet still managed to be put into service within the cluster for ~10 mins before the ASG terminates them for failing to perform the Lifecycle hook call.

jim-barber-he · 2024-06-06T00:02:54Z

Hi @justinsb
This issue was closed due to the 1st issue being addressed.
Do I need to open a new issue covering the 2nd one where random hosts that joined the warmpool much later in the ASG's life had no problem then but we were terminated by the ASG later for not having called the lifecycle hook?

Fixes: kubernetes#16582 When creating a new ASG with a Warmpool using Lifecycle hooks, the instances that first join the Warmpool when the ASG is created could come up before the Lifecycle hook is in effect. This can lead to problems such as those instances not calling the Lifecycle hook notification when pressed into service, causing the ASG to terminate them approximately 10 mins after they've been performing work in the cluster. Setting a dependency on for the Warmpool to take into account any Lifecycle hooks means that the Warmpool won't be created until the hooks are ready. Signed-off-by: Jim Barber <jim.barber@healthengine.com.au>

justinsb · 2024-06-06T18:52:32Z

Hopefully this will do it:

/reopen

Just to confirm, you are using the AWS VPC CNI? The systemd-networkd change does look like a likely possibly-related change, but it should only be in play if you're using AWS VPC CNI

k8s-ci-robot · 2024-06-06T18:52:37Z

@justinsb: Reopened this issue.

In response to this:

Hopefully this will do it:

/reopen

Just to confirm, you are using the AWS VPC CNI? The systemd-networkd change does look like a likely possibly-related change, but it should only be in play if you're using AWS VPC CNI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jim-barber-he · 2024-06-06T21:34:25Z

Yes. We are using the AWS VPC CNI.

k8s-triage-robot · 2024-09-04T21:54:32Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-10-04T22:36:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jim-barber-he · 2024-10-16T08:16:00Z

Hi @justinsb

I'm guessing you either didn't get time to look at this issue any more; or you did look but couldn't find the cause of the issue?
Just bumping this since the bot will close this pretty soon.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 21, 2024

jim-barber-he mentioned this issue May 21, 2024

Make ASG Warmpool depend on ASG Lifecycle hook #16583

Merged

jim-barber-he changed the title ~~ASG Warmpool instance join before Lifecycle hook is in effect~~ ASG Warmpool instances join before Lifecycle hook is in effect May 21, 2024

hakman added the kind/office-hours label May 31, 2024

k8s-ci-robot closed this as completed in #16583 Jun 5, 2024

k8s-ci-robot reopened this Jun 6, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASG Warmpool instances join before Lifecycle hook is in effect #16582

ASG Warmpool instances join before Lifecycle hook is in effect #16582

jim-barber-he commented May 21, 2024 •

edited

Loading

jim-barber-he commented May 21, 2024

jim-barber-he commented May 31, 2024

justinsb commented Jun 4, 2024

justinsb commented Jun 4, 2024

jim-barber-he commented Jun 5, 2024

jim-barber-he commented Jun 6, 2024 •

edited

Loading

justinsb commented Jun 6, 2024

k8s-ci-robot commented Jun 6, 2024

jim-barber-he commented Jun 6, 2024

k8s-triage-robot commented Sep 4, 2024

k8s-triage-robot commented Oct 4, 2024

jim-barber-he commented Oct 16, 2024 •

edited

Loading

ASG Warmpool instances join before Lifecycle hook is in effect #16582

ASG Warmpool instances join before Lifecycle hook is in effect #16582

Comments

jim-barber-he commented May 21, 2024 • edited Loading

jim-barber-he commented May 21, 2024

jim-barber-he commented May 31, 2024

justinsb commented Jun 4, 2024

justinsb commented Jun 4, 2024

jim-barber-he commented Jun 5, 2024

jim-barber-he commented Jun 6, 2024 • edited Loading

justinsb commented Jun 6, 2024

k8s-ci-robot commented Jun 6, 2024

jim-barber-he commented Jun 6, 2024

k8s-triage-robot commented Sep 4, 2024

k8s-triage-robot commented Oct 4, 2024

jim-barber-he commented Oct 16, 2024 • edited Loading

jim-barber-he commented May 21, 2024 •

edited

Loading

jim-barber-he commented Jun 6, 2024 •

edited

Loading

jim-barber-he commented Oct 16, 2024 •

edited

Loading