-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boot time for ISO image 3975.2.0
release has increased to ~1m30s
#1514
Comments
I just booted an instance on VSphere and the
The ISO image is a generic image (without OEM tools), so that's normal if there is no sign of OEM VMWare tools. I need to investigate on how the image-builder is building Flatcar VSphere image, I guess it boots the Flatcar ISO and from the ISO it runs the |
Here's the Packer vars that image-builder is using when building Flatcar for VSphere - https://github.com/kubernetes-sigs/image-builder/blob/main/images/capi/packer/ova/flatcar.json |
After some debugging on Slack we discovered that the boot time for |
open-vm-tools
missing from 3975.2.0
release3975.2.0
release has increased to ~1m30s (on VSphere)
3975.2.0
release has increased to ~1m30s (on VSphere)3975.2.0
release has increased to ~1m30s
Some logs on local ISO with qemu:
It did not investigated further. |
alpha-3913.0.0 boots ok, this issue first occurred with alpha 3941.0.0. |
This is the guilty file: https://github.com/flatcar/bootengine/blob/flatcar-master/dracut/31decrypt-root/decrypt-root.service |
Maybe use |
which means that if there is not flatcar.oem.id parameter, maybe we can skip it. Still, a better and sounder approach should be used. Leaving here the cmdline in case of normal boot on qemu:
For ISO:
|
FYI - I encountered another issue with this release and image-builder after increasing the boot timeout. It seems like the boot command configured in image-builder is now failing because the first command is returning a warning that seems to be causing the |
I have reproduced this on 4054.0.0 with QEMU, so it hasn't been magically fixed since. I'll keep investigating. |
Since the ISO configuration is identical to
I would add a condition to the unit (https://github.com/flatcar/bootengine/blob/flatcar-master/dracut/31decrypt-root/decrypt-root.service) to check for CDROM:
I'm starting to think that some Equinix Metal failures (PXE boot) might be explained by this timeout. |
I've now set this up in a way that I can very quickly test changes. Dropping that unit entirely does indeed fix the problem. Our other Dracut modules check for this situation with this. Perhaps it would be best to do something similar here? usr=$(cmdline_arg mount.usr)
if [[ -z "${usr}" && -f /usr.squashfs ]]; then |
This works: ConditionKernelCommandLine=!mount.usr
ConditionPathExists=!/usr.squashfs But only if I remove this line: Wants=dev-disk-by\x2dpartlabel-ROOT.device I think that device is dynamic, so I don't know whether removing that is a good idea. |
Changing the |
Bleh, it doesn't seem to work. If I drop to a shell before it switches root, I find that it hasn't decrypted the root partition, but doing it manually works. It works automatically if I change it back to That does make sense. The by-partlabel symlink takes time to appear, so we want this to wait for that, but only if we're fairly sure it will appear eventually. I'm having a similar issue with the other boot stuff I'm looking at, so I will keep experimenting. |
I've been doing some more thinking. I don't think we should assume that a user would not want to decrypt root when booting from the ISO or via PXE. Serving Flatcar via PXE but storing data on local encrypted disks is a key use case, right? The point of this decrypt-root service is to make the decryption magically work without explicit configuration. Either we have to wait a reasonable amount of time for the root device to show up, or we have to make it work more reactively via a udev rule. The question is whether this decrypted disk needs to be ready before some other boot time component. Do we do anything special with these "root" disks? What happens if it's not encrypted? If some other part of the boot process depends on this, doesn't that also need to wait for the symlink to appear? |
There may be a usecase that uses an encrypted root disk with PXE, but then there would need to be a a) no b) is what we want to fix here. a) works. c) is not supported (because of no
Decrypt-root runs from the initrd, and we need to have the decrypted root mounted at There is the added fun that |
I wanted to allow for (c), and I had an idea, so I reworked this around a udev rule.
I also made decrypt-root.service a bit simpler. I haven't tried (a) yet, but (b) and (c) work really well. This also decrypts root even if you don't actually specify |
how does the udev rule interact with ignition? for case a) and c) the encryption would be done by ignition so, decrypt-root.service would race with ignitions actions?
…________________________________
From: James Le Cuirot ***@***.***>
Sent: Monday, August 19, 2024 7:19:47 PM
To: flatcar/Flatcar ***@***.***>
Cc: Comment ***@***.***>; Subscribed ***@***.***>
Subject: Re: [flatcar/Flatcar] Boot time for ISO image `3975.2.0` release has increased to ~1m30s (Issue #1514)
I wanted to allow for (c), and I had an idea, so I reworked this around a udev rule.
ACTION=="add", SUBSYSTEM=="block", ENV{ID_PART_ENTRY_NAME}=="ROOT", ENV{ID_FS_TYPE}=="crypto_LUKS", TAG+="systemd", ENV{SYSTEMD_WANTS}+="decrypt-root.service"
I also made decrypt-root.service a bit simpler. I haven't tried (a) yet, but (b) and (c) work really well. This also decrypts root even if you don't actually specify root=/dev/mapper/rootencrypted. If you do specify it, then systemd knows to wait. I'll make a PR tomorrow.
—
Reply to this email directly, view it on GitHub<#1514 (comment)> or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABXINVWHOK3TRWQN243CU3TZSISLHBFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVE4DKNRYGIYTQMJSURXGC3LFVFUGC427NRQWEZLMVRZXKYTKMVRXIX3UPFYGLLCJONZXKZKDN5WW2ZLOOSTHI33QNFRXHE4CUR2HS4DFVJZGK4DPONUXI33SPGSXMYLMOVS2SMJSGM4TIOBRGQ2YFJDUPFYGLJLJONZXKZNFOZQWY5LFVIZDINJXGYYTMMBQGGBKI5DZOBS2K3DBMJSWZJLWMFWHKZNJHA2TMOBSGE4DCMVHORZGSZ3HMVZKMY3SMVQXIZI>.
You are receiving this email because you commented on the thread.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Good news all round after more testing. The rule works correctly on first boot and subsequently for (a) and (c) because of I also tried with In theory, this could be used to automatically decrypt any LUKS partitions but that's probably taking it too far. |
Closes: flatcar/Flatcar#1514 Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
Closes: flatcar/Flatcar#1514 Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
Bug: flatcar/bootengine#98 Closes: flatcar/Flatcar#1514 Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
Bug: flatcar/bootengine#98 Closes: flatcar/Flatcar#1514 Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
See: flatcar/bootengine#98 See: flatcar/Flatcar#1514 Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
Sorry, this was closed a little prematurely. I'll close it when the fix is in the stable release. |
See: flatcar/bootengine#98 See: flatcar/Flatcar#1514 Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
I've backported this to the 4012 Beta. |
Hello @AverageMarcus, the fix has been ported to latest Beta release (4054.1.0) - do you think you could confirm that it solves the initial issue with the image-builder (i.e by removing the boot_wait override) ? That could help to promote this fix to Stable. Thanks! |
Sure, not sure when I'll get the chance though as catching up on stuff right after being away for the past week. I'll report back when I can. |
@AverageMarcus nevermind, I see this change is not available yet on Beta. It will be available in the next Beta. Sorry about that 🙏 but it's available on Alpha if you want to test it. |
See: flatcar/bootengine#98 See: flatcar/Flatcar#1514 Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
udev rule See: flatcar/bootengine#98 See: flatcar/Flatcar#1514 Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com> Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
It is already fixed in Stable 3975.2.2, Beta 4081.1.0, Alpha 4116.0.0. |
Description
The latest stable release,
3975.2.0
, is taking a considerably longer time to boot compared to previous releases.E.g.
vs. the previous release:
This has resulted in a delay at boot of roughly 1min 30sec compared to the previous release.
Impact
When building images with image-builder the default boot wait time is 60s after which Packer will start typing out the boot command. As this happens before the OS has finished booting the
flatcar-install
script isn't actually run correctly and can result in the OS not being correctly set up in an environment such as VSphere (it doesn't install the vmtools for example). With image-builder this then results in it hanging atWaiting for IP...
as vcenter is unable to retrieve the IP from the guest OS as no vmtools have been installed.Environment and steps to reproduce
3975.2.0
on VSphere viaimage-builder
a. With image-builder run:
make build-node-ova-vsphere-flatcar
b. The
3975.2.0
image is used to launch a new VM (https://stable.release.flatcar-linux.net/amd64-usr/3975.2.0/flatcar_production_iso_image.iso
)c. The VM boots but vcenter fails to retrieve the IP address due to lack of vmtools on the guest OS
Waiting for IP...
Expected behavior
The OS boots in under 60 seconds. Ideally back down below 10s like the previous release.
Additional information
This issue was previously about
open-vm-tools
not being installed but after debugging with @tormath1 on Slack (see thread) we learnt that it was actually due to the delay in boot that was causing the problem so I updated the issue to reflect that.The text was updated successfully, but these errors were encountered: