Slow startup of AL2023 AMI #1751

stijndehaes · 2024-04-05T06:56:26Z

After trying the AL2023 AMI I noticed startup is way slower compared to the AL2 image.
After some help from @ndbaker1 I made the following analysis using systemd-analyze.

I added a plot using systemd-analyze, the nodeadm component appears to be in the hoth path.
With nodeadm-config.service taking 20s before it allows cloud-init.service (7.2s) to start, and then nodeadm-run.service takes 10.53s, I think it is waiting for containerd-service (8.7s) to start though.

Looking at the logs of nodeadm-config, it appears to take a long time to start before the first log(5s), but also takes 13s between starting to configure kubelet and getting the kubelet version.

Looking at the logs of nodeadm-run, most time apears to be spend looking up the sandbox image. I noticed in code this is done using containerd config dump and a regex, for some reason this takes a long time.

Not sure what can be done to optimize these steps, but I am willing to help :) I can write golang code and make custom AMI's with custom nodeadm if needed to test things out.

nodeadm-config logs, via journalctl -u nodeadm-config

Apr 04 08:56:49 localhost systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config...
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.1338603,"caller":"init/init.go:49","msg":"Checking user is root.."}
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.1339903,"caller":"init/init.go:57","msg":"Loading configuration..","configSource":"imds://user-data"}
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.2056038,"caller":"init/init.go:66","msg":"Loaded configuration","config":{"metadata":{"creationTimestamp":null},"spec":{"cluster":{"name":"datafy-dp-eks-dev-sdh","apiServerEndpoin>
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.2059212,"caller":"init/init.go:68","msg":"Enriching configuration.."}
Apr 04 08:56:54 localhost nodeadm[1693]: {"level":"info","ts":1712221014.2059379,"caller":"init/init.go:148","msg":"Fetching instance details.."}
Apr 04 08:56:54 localhost nodeadm[1693]: SDK 2024/04/04 08:56:54 DEBUG attempting waiter request, attempt count: 1
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7738488,"caller":"init/init.go:161","msg":"Instance details populated","details":{"id":"i-048abcd8a00795a14","region":"eu-west-1","type":"m5.xlarge","availabilityZone":"eu-west-1c>
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7739344,"caller":"init/init.go:162","msg":"Fetching default options..."}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7748718,"caller":"init/init.go:170","msg":"Default options populated","defaults":{"sandboxImage":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7749023,"caller":"init/init.go:73","msg":"Validating configuration.."}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7749126,"caller":"init/init.go:78","msg":"Creating daemon manager.."}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7760847,"caller":"init/init.go:96","msg":"Configuring daemons..."}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7761178,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"containerd"}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.7761524,"caller":"containerd/config.go:38","msg":"Writing containerd config to file..","path":"/etc/containerd/config.toml"}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.894721,"caller":"init/init.go:107","msg":"Configured daemon","name":"containerd"}
Apr 04 08:56:55 localhost nodeadm[1693]: {"level":"info","ts":1712221015.8947728,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"kubelet"}
Apr 04 08:57:08 localhost nodeadm[1693]: {"level":"info","ts":1712221028.7341733,"caller":"kubelet/config.go:300","msg":"Detected kubelet version","version":"v1.28.5"}
Apr 04 08:57:08 localhost nodeadm[1693]: {"level":"info","ts":1712221028.7360287,"caller":"kubelet/config.go:218","msg":"Setup IP for node","ip":"10.0.3.248"}
Apr 04 08:57:08 localhost nodeadm[1693]: {"level":"info","ts":1712221028.7370236,"caller":"kubelet/config.go:351","msg":"Writing kubelet config to file..","path":"/etc/kubernetes/kubelet/config.json"}
Apr 04 08:57:09 localhost nodeadm[1693]: {"level":"info","ts":1712221029.3084157,"caller":"init/init.go:107","msg":"Configured daemon","name":"kubelet"}
Apr 04 08:57:09 localhost systemd[1]: nodeadm-config.service: Deactivated successfully.
Apr 04 08:57:09 localhost systemd[1]: Finished nodeadm-config.service - EKS Nodeadm Config.

nodeadm-run logs, via journalctl -u nodeadmrun:

Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal systemd[1]: Starting nodeadm-run.service - EKS Nodeadm Run...
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5944886,"caller":"init/init.go:49","msg":"Checking user is root.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5945692,"caller":"init/init.go:57","msg":"Loading configuration..","configSource":"imds://user-data"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5981271,"caller":"init/init.go:66","msg":"Loaded configuration","config":{"metadata":{"creationTimestamp":null},"spec":{"cluster":{"name":"datafy-dp>
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5983663,"caller":"init/init.go:68","msg":"Enriching configuration.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.5983822,"caller":"init/init.go:148","msg":"Fetching instance details.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: SDK 2024/04/04 08:57:17 DEBUG attempting waiter request, attempt count: 1
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.72048,"caller":"init/init.go:161","msg":"Instance details populated","details":{"id":"i-048abcd8a00795a14","region":"eu-west-1","type":"m5.xlarge",">
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7205393,"caller":"init/init.go:162","msg":"Fetching default options..."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7216415,"caller":"init/init.go:170","msg":"Default options populated","defaults":{"sandboxImage":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/p>
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.721685,"caller":"init/init.go:73","msg":"Validating configuration.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7216954,"caller":"init/init.go:78","msg":"Creating daemon manager.."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7227976,"caller":"init/init.go:112","msg":"Setting up system aspects..."}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7228348,"caller":"init/init.go:115","msg":"Setting up system aspect..","name":"local-disk"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7228491,"caller":"system/local_disk.go:26","msg":"Not configuring local disks!"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7228668,"caller":"init/init.go:119","msg":"Set up system aspect","name":"local-disk"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.722877,"caller":"init/init.go:115","msg":"Setting up system aspect..","name":"networking"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.722903,"caller":"system/networking.go:79","msg":"writing eks_primary_eni_only network configuration"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7329752,"caller":"init/init.go:119","msg":"Set up system aspect","name":"networking"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7330172,"caller":"init/init.go:128","msg":"Ensuring daemon is running..","name":"containerd"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7368443,"caller":"init/init.go:132","msg":"Daemon is running","name":"containerd"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7369227,"caller":"init/init.go:134","msg":"Running post-launch tasks..","name":"containerd"}
Apr 04 08:57:17 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221037.7369375,"caller":"containerd/sandbox.go:20","msg":"Looking up current sandbox image in containerd config.."}
Apr 04 08:57:25 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221045.1754317,"caller":"containerd/sandbox.go:32","msg":"Found sandbox image","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Apr 04 08:57:25 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221045.1754868,"caller":"containerd/sandbox.go:34","msg":"Fetching ECR authorization token.."}
Apr 04 08:57:25 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221045.2011657,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Apr 04 08:57:25 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: E0404 08:57:25.202548    1918 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unavailable d>
Apr 04 08:57:27 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221047.207119,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.0419524,"caller":"containerd/sandbox.go:53","msg":"Finished pulling sandbox image","image-ref":"sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20>
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.0420063,"caller":"init/init.go:138","msg":"Finished post-launch tasks","name":"containerd"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.042022,"caller":"init/init.go:128","msg":"Ensuring daemon is running..","name":"kubelet"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.0442576,"caller":"init/init.go:132","msg":"Daemon is running","name":"kubelet"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.0443828,"caller":"init/init.go:134","msg":"Running post-launch tasks..","name":"kubelet"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal nodeadm[1918]: {"level":"info","ts":1712221048.044407,"caller":"init/init.go:138","msg":"Finished post-launch tasks","name":"kubelet"}
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal systemd[1]: nodeadm-run.service: Deactivated successfully.
Apr 04 08:57:28 ip-10-0-3-248.eu-west-1.compute.internal systemd[1]: Finished nodeadm-run.service - EKS Nodeadm Run.

Originally posted by @stijndehaes in #1696 (comment)

The text was updated successfully, but these errors were encountered:

stijndehaes · 2024-04-05T06:57:49Z

@cartermckinnon suggested it might have something to do with EBS lazy block fetching resulting in slow startup for the executables we need: #1696 (comment)

stijndehaes · 2024-04-05T07:30:54Z

One way we can speed things up is to make sure that kubelet and containerd are loaded earlier in the chain.
We could for example:

have containerd and kubelet systemd service started at boot, and restarted by nodeadm after configuring, but this could have a big impact
2 systemd oneshot service that just call kubelet --version and containerd --version, this sounds easy to me, but we need to make sure they are started first in the chain, and don't block the rest

cartermckinnon · 2024-04-05T17:10:04Z

We can test the EBS hypothesis by enabling fast-restore on the snapshot: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-fast-snapshot-restore.html

This is pretty expensive so I understand if you don't want to give it a try; but if you can update the PR description with your launch steps we can have consistency across our experiments @stijndehaes

stijndehaes · 2024-04-08T09:55:21Z

@cartermckinnon I am willing to give it a try.
As to your launch steps question, I am not enterily sure I understand what you want. Can you elaborate?

stijndehaes · 2024-04-08T10:10:51Z

Here is the new result, with fast restore on the snapshot enabled.
Sadly I see roughly the same timing, I also see similar delays in the logs.

sh-5.2# journalctl -u nodeadm-run
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal systemd[1]: Starting nodeadm-run.service - EKS Nodeadm Run...
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.2066824,"caller":"init/init.go:49",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.206749,"caller":"init/init.go:57","m>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.2111814,"caller":"init/init.go:66",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.2115018,"caller":"init/init.go:68",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.211519,"caller":"init/init.go:148",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: SDK 2024/04/08 10:05:55 DEBUG attempting waiter request, attempt cou>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3542361,"caller":"init/init.go:161",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3543026,"caller":"init/init.go:162",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3551373,"caller":"init/init.go:170",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3551688,"caller":"init/init.go:73",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3551803,"caller":"init/init.go:78",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3563209,"caller":"init/init.go:112",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.356349,"caller":"init/init.go:115",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.356361,"caller":"system/local_disk.g>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.356374,"caller":"init/init.go:119",">
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3563807,"caller":"init/init.go:115",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3564048,"caller":"system/networking.>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3663833,"caller":"init/init.go:119",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3664064,"caller":"init/init.go:128",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3696218,"caller":"init/init.go:132",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3697126,"caller":"init/init.go:134",>
Apr 08 10:05:55 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570755.3697586,"caller":"containerd/sandbox>
Apr 08 10:06:00 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570760.477823,"caller":"containerd/sandbox.>
Apr 08 10:06:00 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570760.4778726,"caller":"containerd/sandbox>
Apr 08 10:06:00 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570760.516175,"caller":"containerd/sandbox.>
Apr 08 10:06:00 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: E0408 10:06:00.530553    1914 remote_image.go:135] PullImage "602401>
Apr 08 10:06:02 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570762.5316176,"caller":"containerd/sandbox>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0406716,"caller":"containerd/sandbox>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0407143,"caller":"init/init.go:138",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0407345,"caller":"init/init.go:128",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0431368,"caller":"init/init.go:132",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0432801,"caller":"init/init.go:134",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal nodeadm[1914]: {"level":"info","ts":1712570763.0433705,"caller":"init/init.go:138",>
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal systemd[1]: nodeadm-run.service: Deactivated successfully.
Apr 08 10:06:03 ip-10-0-3-27.eu-west-1.compute.internal systemd[1]: Finished nodeadm-run.service - EKS Nodeadm Run.

sh-5.2# journalctl -u nodeadm-config
Apr 08 10:05:33 localhost systemd[1]: Starting nodeadm-config.service - EKS Nodeadm Config...
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.4740438,"caller":"init/init.go:49","msg":"Checking user is root.."}
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.4741585,"caller":"init/init.go:57","msg":"Loading configuration..">
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.5820394,"caller":"init/init.go:66","msg":"Loaded configuration","c>
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.5823655,"caller":"init/init.go:68","msg":"Enriching configuration.>
Apr 08 10:05:37 localhost nodeadm[1687]: {"level":"info","ts":1712570737.5824058,"caller":"init/init.go:148","msg":"Fetching instance detai>
Apr 08 10:05:38 localhost nodeadm[1687]: SDK 2024/04/08 10:05:38 DEBUG attempting waiter request, attempt count: 1
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.696702,"caller":"init/init.go:161","msg":"Instance details populat>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.6967807,"caller":"init/init.go:162","msg":"Fetching default option>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.6998506,"caller":"init/init.go:170","msg":"Default options populat>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.699901,"caller":"init/init.go:73","msg":"Validating configuration.>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.6999104,"caller":"init/init.go:78","msg":"Creating daemon manager.>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.7011337,"caller":"init/init.go:96","msg":"Configuring daemons..."}
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.7011614,"caller":"init/init.go:103","msg":"Configuring daemon...",>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.701194,"caller":"containerd/config.go:38","msg":"Writing container>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.746515,"caller":"init/init.go:107","msg":"Configured daemon","name>
Apr 08 10:05:38 localhost nodeadm[1687]: {"level":"info","ts":1712570738.74656,"caller":"init/init.go:103","msg":"Configuring daemon...","n>
Apr 08 10:05:47 localhost nodeadm[1687]: {"level":"info","ts":1712570747.596923,"caller":"kubelet/config.go:300","msg":"Detected kubelet ve>
Apr 08 10:05:47 localhost nodeadm[1687]: {"level":"info","ts":1712570747.5988646,"caller":"kubelet/config.go:218","msg":"Setup IP for node">
Apr 08 10:05:47 localhost nodeadm[1687]: {"level":"info","ts":1712570747.5999613,"caller":"kubelet/config.go:351","msg":"Writing kubelet co>
Apr 08 10:05:48 localhost nodeadm[1687]: {"level":"info","ts":1712570748.362136,"caller":"init/init.go:107","msg":"Configured daemon","name>
Apr 08 10:05:48 localhost systemd[1]: nodeadm-config.service: Deactivated successfully.
Apr 08 10:05:48 localhost systemd[1]: Finished nodeadm-config.service - EKS Nodeadm Config.

dims · 2024-04-08T12:30:28Z

@stijndehaes the output for journalctl -u nodeadm-run is truncated (only 140 characters per line...) can you please capture the full output? (like you did in your issue text right at the top?

from the initial log, it seems like containerd config dump could be taking time?

amazon-eks-ami/nodeadm/internal/containerd/sandbox.go

Lines 20 to 32 in f55411c

    
           zap.L().Info("Looking up current sandbox image in containerd config..") 
        
           // capture the output of a `containerd config dump`, which is the final 
        
           // containerd configuration used after all of the applied transformations 
        
           dump, err := exec.Command("containerd", "config", "dump").Output() 
        
           if err != nil { 
        
           	return err 
        
           } 
        
           matches := containerdSandboxImageRegex.FindSubmatch(dump) 
        
           if matches == nil { 
        
           	return fmt.Errorf("sandbox image could not be found in containerd config") 
        
           } 
        
           sandboxImage := string(matches[1]) 
        
           zap.L().Info("Found sandbox image", zap.String("image", sandboxImage))

stijndehaes · 2024-05-22T14:14:27Z

@dims Sorry it took so long to get back to you. I have restried with fast restore and got some better results. I takes around 50s with fast restore (still slower than the 40ish seconds I got with al2 without fast restore) and 100s without fast restore. So that appears to have an influence.

Here are the logs of the services with and without fast restore:

nodeadm-run with fast-restore
nodeadm-config with fast-restore
nodeadm-config without fast-restore
nodeadm-run without fast-restore

cartermckinnon · 2024-05-22T21:03:34Z

Thanks for testing this out, my findings concur. I'll see what I can do!

stijndehaes · 2024-05-27T07:21:27Z

Thanks for testing this out, my findings concur. I'll see what I can do!

@cartermckinnon thank you very much, if I can help with anything please reach out :) At the moment I have a good flow of testing things out ;)

stijndehaes · 2024-08-01T11:50:02Z

Just retried it, the result are still similar. The biggest issue appears to be the call to get the version of kubelet, if we could get the version another way I think that would help a lot.
But I am not sure what the best way is to store the kubelet version. One thing we could do it store is edit the service file and store the version in the nodeadm-config.service file:

Environment="KUBELET_VERSION=1.2.3"

I will run a try out to see if this helps

dims · 2024-08-01T12:41:25Z

@stijndehaes you mean kubelet --version is taking time?

( guessing from https://github.com/awslabs/amazon-eks-ami/blob/main/nodeadm/internal/kubelet/version.go#L18 )

stijndehaes · 2024-08-01T13:05:07Z

@dims correct! That is the reason for the big timeout between these 2 log lines:

May 22 14:03:48.312560 localhost nodeadm[1689]: {"level":"info","ts":1716386628.3125567,"caller":"init/init.go:103","msg":"Configuring daemon...","name":"kubelet"}
May 22 14:04:01.246690 localhost nodeadm[1689]: {"level":"info","ts":1716386641.2466207,"caller":"kubelet/config.go:300","msg":"Detected kubelet version","version":"v1.29.3"}

stijndehaes · 2024-08-01T13:10:40Z

I did the following experiment. I change the GetKubeletVersion function to the following:

func GetKubeletVersion() (string, error) {
	if kubeletVersion := os.Getenv("KUBELET_VERSION"); kubeletVersion != "" {
		zap.L().Info("Found kubelet version in environment")
		return kubeletVersion, nil
	}
	rawVersion, err := GetKubeletVersionRaw()
	if err != nil {
		return "", err
	}
	version := parseSemVer(*rawVersion)
	return version, nil
}

And changed the systemd config to the following:

[Unit]
Description=EKS Nodeadm Config
Documentation=https://github.com/awslabs/amazon-eks-ami
# run before cloud-init, then user can still execute their
# own workflows from ec2 userdata cloud-init scripts
Before=cloud-init.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nodeadm init --skip run
Environment="KUBELET_VERSION=v1.29.3"

[Install]
WantedBy=multi-user.target

Obviously the hard coded KUBELET_VERSION needs to be changed.

This resulted in the following (result from systemd-analyze)
Before change: Startup finished in 698ms (kernel) + 7.574s (initrd) + 1min 2.208s (userspace) = 1min 10.481s graphical.target reached after 1min 2.146s in userspace.
After change: Startup finished in 445ms (kernel) + 5.489s (initrd) + 34.253s (userspace) = 40.188s graphical.target reached after 34.189s in userspace.

The nodeadm-config.service went from 21s to 5.12s shaving of around 15s in the hot startup path:

Before:

After:

Do you guys think it makes sense to include this optimisation? Obviously we still need to fill in the correct kubelet version. And sadly it is still slower than the AL2 AMI

stijndehaes · 2024-08-01T13:30:59Z

As an FYI here is the systemd-analyze plot from an AL2 Image.
Startup finished in 1.355s (kernel) + 1.379s (initrd) + 18.876s (userspace) = 21.611s

As you can see the 45s is still far removed from the 22s it used to be, ofcourse certain changes are because of the new AL2023 base image

dims · 2024-08-01T13:57:36Z

@stijndehaes looking at the graphs for al2023 above, what happens if you disable the update-motd.service? (the before picture shows significant contribution from that one)

dims · 2024-08-01T14:00:52Z

@stijndehaes another one, what happens if you switch the order of getting kubelet version first (caching it) and THEN configuring the kubelet service?

stijndehaes · 2024-08-01T15:34:14Z

@stijndehaes looking at the graphs for al2023 above, what happens if you disable the update-motd.service? (the before picture shows significant contribution from that one)

Just disable update-motd.service results in:
Startup finished in 454ms (kernel) + 3.842s (initrd) + 38.513s (userspace) = 42.810s graphical.target reached after 38.055s in userspace.

So actually a great improvement!

stijndehaes · 2024-08-01T15:35:13Z

@stijndehaes another one, what happens if you switch the order of getting kubelet version first (caching it) and THEN configuring the kubelet service?

I am not exactly sure if I understand what you mean by this. Can you eleborate?

dims · 2024-08-01T20:36:15Z

@stijndehaes so please do file a PR for disabling the motd and let the team review it.

on the other one, i was thinking of something like this diff where:

we load up the response from kubelet version and cache it so we call that only once
we call kubelet version BEFORE we setup/configure the kubelet service and then use the cached result from then on.

kubelet-version.diff.txt

stijndehaes · 2024-08-05T05:50:42Z

@dims made a PR for disabling motd. #1910

As to your suggestion to cache the kubelet version the biggest issue is that we are still using the kubelet executable and that needs to be fetched when that part of the EBS is not initialised yet. So it will still be slow.

That was the reason for my suggestion with the Environment variable set during build. There is no dependency anymore on the kubelet executable.

stijndehaes · 2024-10-22T13:46:36Z

@cartermckinnon I see you have been hard at work at improving the boot times! Thanks for the hard work, as soon as an AMI is available with your improvements I will start testing :)

sidewinder12s · 2024-10-22T23:28:39Z

Would be quite interested to see if the same work can be done for the GPU nodes as well, the nvidia components add quite a bit of boot time on AL2 at least.

cartermckinnon · 2024-10-24T23:56:17Z

@sidewinder12s I'd love to speed things up on the AL2 GPU variant, but that's a different quest. When the transition to NVIDIA's open source kmod is complete, things will get much better.

cartermckinnon · 2024-10-24T23:58:38Z

I'm going to close this out for now, as we've addressed the main items brought up in this thread. AL2023 nodes should have time-to-ready that is comparable to AL2 nodes.

@stijndehaes let me know if your tests indicate otherwise!

sidewinder12s · 2024-10-25T01:11:12Z

Oh mainly meant you might be able to get some of the same improvements on the AL or AL2023 GPU AMIs, but I was only able to look at the AL2 GPU AMI currently.

Was mainly thinking adding the nvidia binaries to the initialize script that are in critical path might be an easy win.

stijndehaes mentioned this issue Aug 5, 2024

Disable motd #1910

Closed

cartermckinnon mentioned this issue Aug 8, 2024

feat(ci): emit kubetest2 metrics #1914

Merged

cartermckinnon mentioned this issue Oct 4, 2024

feat(nodeadm): use cached kubelet version to reduce config time #1991

Merged

cartermckinnon mentioned this issue Oct 16, 2024

feat(al2023): initialize critical binaries from EBS filesystem #2011

Merged

cartermckinnon closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow startup of AL2023 AMI #1751

Slow startup of AL2023 AMI #1751

stijndehaes commented Apr 5, 2024

stijndehaes commented Apr 5, 2024

stijndehaes commented Apr 5, 2024 •

edited

Loading

cartermckinnon commented Apr 5, 2024 •

edited

Loading

stijndehaes commented Apr 8, 2024

stijndehaes commented Apr 8, 2024 •

edited

Loading

dims commented Apr 8, 2024

stijndehaes commented May 22, 2024

cartermckinnon commented May 22, 2024

stijndehaes commented May 27, 2024

stijndehaes commented Aug 1, 2024

dims commented Aug 1, 2024 •

edited

Loading

stijndehaes commented Aug 1, 2024

stijndehaes commented Aug 1, 2024 •

edited

Loading

stijndehaes commented Aug 1, 2024 •

edited

Loading

dims commented Aug 1, 2024

dims commented Aug 1, 2024

stijndehaes commented Aug 1, 2024

stijndehaes commented Aug 1, 2024

dims commented Aug 1, 2024

stijndehaes commented Aug 5, 2024 •

edited

Loading

stijndehaes commented Oct 22, 2024

sidewinder12s commented Oct 22, 2024

cartermckinnon commented Oct 24, 2024

cartermckinnon commented Oct 24, 2024

sidewinder12s commented Oct 25, 2024

Slow startup of AL2023 AMI #1751

Slow startup of AL2023 AMI #1751

Comments

stijndehaes commented Apr 5, 2024

stijndehaes commented Apr 5, 2024

stijndehaes commented Apr 5, 2024 • edited Loading

cartermckinnon commented Apr 5, 2024 • edited Loading

stijndehaes commented Apr 8, 2024

stijndehaes commented Apr 8, 2024 • edited Loading

dims commented Apr 8, 2024

stijndehaes commented May 22, 2024

cartermckinnon commented May 22, 2024

stijndehaes commented May 27, 2024

stijndehaes commented Aug 1, 2024

dims commented Aug 1, 2024 • edited Loading

stijndehaes commented Aug 1, 2024

stijndehaes commented Aug 1, 2024 • edited Loading

stijndehaes commented Aug 1, 2024 • edited Loading

dims commented Aug 1, 2024

dims commented Aug 1, 2024

stijndehaes commented Aug 1, 2024

stijndehaes commented Aug 1, 2024

dims commented Aug 1, 2024

stijndehaes commented Aug 5, 2024 • edited Loading

stijndehaes commented Oct 22, 2024

sidewinder12s commented Oct 22, 2024

cartermckinnon commented Oct 24, 2024

cartermckinnon commented Oct 24, 2024

sidewinder12s commented Oct 25, 2024

stijndehaes commented Apr 5, 2024 •

edited

Loading

cartermckinnon commented Apr 5, 2024 •

edited

Loading

stijndehaes commented Apr 8, 2024 •

edited

Loading

dims commented Aug 1, 2024 •

edited

Loading

stijndehaes commented Aug 1, 2024 •

edited

Loading

stijndehaes commented Aug 1, 2024 •

edited

Loading

stijndehaes commented Aug 5, 2024 •

edited

Loading