Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSTree boot links can get out of sync #2283

Closed
jlebon opened this issue Feb 22, 2021 · 9 comments
Closed

OSTree boot links can get out of sync #2283

jlebon opened this issue Feb 22, 2021 · 9 comments

Comments

@jlebon
Copy link
Member

jlebon commented Feb 22, 2021

We've had one instance recently in RHCOS where a system failed to boot with:

systemd[1]: Starting OSTree Prepare OS/...
ostree-prepare-root[887]: ostree-prepare-root: Couldn't find specified OSTree root '/sysroot//ostree/boot.1/rhcos/a77c6a70e3037f40c6dc3634d31e63f797d4a5cf9059f5927141ca53690d1942/0': No such file or directory
systemd[1]: ostree-prepare-root.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: ostree-prepare-root.service: Failed with result 'exit-code'.
systemd[1]: Failed to start OSTree Prepare OS/.
systemd[1]: ostree-prepare-root.service: Triggering OnFailure= dependencies.

And digging deeper it seemed to be due to the bootversion being out of sync between the bootloader and the sysroot:

$ ls -la / sysroot/ostree/boot.*/rhcos/*
/sysroot/ostree/boot.0.1/rhcos/57fdb4d93971e78e2ba1dfb53ac646d8372e016bb97dedf2cdadd44b355ec7e9:
total 0
drwxr-xr-x. 2 root root  15 Jan 21 09:35 .
drwxr-xr-x. 4 root root 150 Jan 21 09:35 ..
lrwxrwxrwx. 1 root root  95 Jan 21 09:35 0 -> ../../../deploy/rhcos/deploy/046de410ba240083996855d797207e09374bbfdf65e5be34baa1e1f1ab49918a.0

/sysroot/ostree/boot.0.1/rhcos/a77c6a70e3037f40c6dc3634d31e63f797d4a5cf9059f5927141ca53690d1942:
total 0
drwxr-xr-x. 2 root root  15 Jan 21 09:35 .
drwxr-xr-x. 4 root root 150 Jan 21 09:35 ..
lrwxrwxrwx. 1 root root  95 Jan 21 09:35 0 -> ../../../deploy/rhcos/deploy/ba247419bcd04867a38aa336701cf54cf62d36114bd125e64198567ec14f262a.0

/sysroot/ostree/boot.0/rhcos/57fdb4d93971e78e2ba1dfb53ac646d8372e016bb97dedf2cdadd44b355ec7e9:
total 0
drwxr-xr-x. 2 root root  15 Jan 21 09:35 .
drwxr-xr-x. 4 root root 150 Jan 21 09:35 ..
lrwxrwxrwx. 1 root root  95 Jan 21 09:35 0 -> ../../../deploy/rhcos/deploy/046de410ba240083996855d797207e09374bbfdf65e5be34baa1e1f1ab49918a.0

/sysroot/ostree/boot.0/rhcos/a77c6a70e3037f40c6dc3634d31e63f797d4a5cf9059f5927141ca53690d1942:
total 0
drwxr-xr-x. 2 root root  15 Jan 21 09:35 .
drwxr-xr-x. 4 root root 150 Jan 21 09:35 ..
lrwxrwxrwx. 1 root root  95 Jan 21 09:35 0 -> ../../../deploy/rhcos/deploy/ba247419bcd04867a38aa336701cf54cf62d36114bd125e64198567ec14f262a.0

So the bootversion in the sysroot is 0, but the bootloader is pointing to 1. Looking at the logs for the previous boot (where the update happened, which did a bootconfig swap), we can tell that the previous bootversion was 1, which means that the boot.0 symlink is correct, which means the bootloader symlink must be the one out of sync.

Maybe some bug in the FIFREEZE/FITHAW code which caused /boot to remain unsynced?

Original RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1921217

@cgwalters
Copy link
Member

OK I actually saw this once or twice when doing cosa build-fast which I am embarrassed to admit I didn't try to chase down at the time, but yeah we need to take this seriously.

@cgwalters
Copy link
Member

This is also related to #1951

jlebon added a commit to jlebon/ostree that referenced this issue Mar 2, 2021
Was looking at this code more closely today to investigate issues
related to bootlink mismatches (ostreedev#2283).
jlebon added a commit to jlebon/ostree that referenced this issue Mar 2, 2021
Was looking at this code more closely today to investigate issues
related to bootlink mismatches (ostreedev#2283).
jlebon added a commit to jlebon/ostree that referenced this issue Mar 2, 2021
Was looking at this code more closely today to investigate issues
related to bootlink mismatches (ostreedev#2283).
@jlebon
Copy link
Member Author

jlebon commented Mar 3, 2021

So here is another case: https://bugzilla.redhat.com/show_bug.cgi?id=1933116
There, the failure happens in the real root when the MCO tries to do rpm-ostree status --json:

Feb 23 23:41:20 master00002 rpm-ostree[32245]: Reading config file '/etc/rpm-ostreed.conf'
Feb 23 23:41:20 master00002 rpm-ostree[32245]: error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Parsing deployment 0 in stateroot 'rhcos': readlinkat: No such file or directory

In this case, the bootversion in /boot/loader is 0, but in /ostree/boot it's 1. From the timestamps in the sosreport, we can tell that the /boot/loader configs are two days older than the /ostree/boot symlinks. So here too, it seems like the bootloader wasn't updated. The interesting thing though is that /proc/cmdline shows the bootversion when booting was 1 (at least when sosreport was run). I'm not sure how to make sense of that.

@jlebon
Copy link
Member Author

jlebon commented Mar 3, 2021

I'm not sure how to make sense of that.

To expand, somehow this is what looks like happened:

So maybe this is related to ostree operating on a freshly bootconfig swapped state, but before rebooting? Maybe we're not reloading all the state somehow and think that /ostree/boot doesn't need updating?

Anyway, I've asked for the full logs to try to get more info on the order of operations.

@miabbott
Copy link
Collaborator

Another possible instance of this downstream - https://bugzilla.redhat.com/show_bug.cgi?id=1985913

@cgwalters
Copy link
Member

I'm 93% certain this is coreos/fedora-coreos-tracker#976

@frzifus
Copy link

frzifus commented May 14, 2024

Today I did run into the same problem today after installing an RPM from copr with --apply-live.

rpm-ostree --apply-live install -y opentelemetry-collector-0.99.0-1.20240514165826437302.pr63.7.g984bf2a.fc40.x86_64

After restarting my machine I ended up in the rescue shell. Showing a similar message to that one mentioned here: #2283 (comment)

The path /sysroot//ostree/ only contained boot.0 , boot.0.1 , deploy, lock and repo. While the ostree service searched for an image in /sysroot//ostree/boot.1/....

Comparing the paths that existed on my machine and the path required by the ostree service, I noticed that the paths were quite similar. The only difference was that these files were not in the boot.1 folder, but in the boot.0 folder.

Editing this in the grub config solved it for me.

@RoyalOughtness
Copy link

RoyalOughtness commented Nov 5, 2024

@cgwalters Can this ticket be reopened? It is either not fixed or has resurfaced.

I'm seeing reports of this issue, with an identical error to the first message in this thread. This is pretty bad as it results in ostensibly soft-bricked machines (booting into the prior deployment doesn't work either).

IMG_20241105_193502_502 jpg
IMG_20241105_193445_921 jpg

@xynydev
Copy link

xynydev commented Nov 5, 2024

I had this issue back in april as well, no layers or rebases, my custom image of Fedora Silverblue had just updated to F40. I could also confirm that one other user with a similar setup had encountered a similar issue back then. Clearly there are still problems... Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants