Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upgrade-recovery subcommand #1974

Merged

Conversation

anmazzotti
Copy link
Contributor

@anmazzotti anmazzotti commented Feb 23, 2024

A different take of #1966

This PR does not actually introduce a new upgrade-recovery subcommand, but instead just adds a --recovery-only upgrade subcommand argument.

Scratch the above.
This is now something in the middle.

elemental upgrade-recovery --recovery-system.uri <imgRef> is the final choice.

elemental upgrade --recovery can still be used as usual to upgrade both recovery and system at the same time.

The upgrade-recovery subcommand shares the UpgradeSpec with upgrade to reuse the same configuration/initialization logic, but filters it at a command level.
Only --recovery-system probably makes sense here, most of other flags should be disabled.

Copy link
Contributor

@frelon frelon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer adding the upgrade-recovery command.

Adding flags that completely changes the implementation of the upgrade command feels like a recipe for unintended consequences.

@codecov-commenter
Copy link

codecov-commenter commented Feb 23, 2024

Codecov Report

Attention: Patch coverage is 38.38863% with 130 lines in your changes are missing coverage. Please review.

Project coverage is 72.27%. Comparing base (9538960) to head (3718d94).

Files Patch % Lines
pkg/action/upgrade-recovery.go 56.03% 40 Missing and 11 partials ⚠️
cmd/upgrade-recovery.go 20.00% 31 Missing and 1 partial ⚠️
cmd/config/config.go 0.00% 20 Missing ⚠️
pkg/types/v1/config.go 0.00% 16 Missing ⚠️
pkg/action/upgrade.go 41.66% 5 Missing and 2 partials ⚠️
pkg/constants/constants.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1974      +/-   ##
==========================================
- Coverage   73.04%   72.27%   -0.77%     
==========================================
  Files          74       76       +2     
  Lines        8713     8895     +182     
==========================================
+ Hits         6364     6429      +65     
- Misses       1831     1938     +107     
- Partials      518      528      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@anmazzotti
Copy link
Contributor Author

I prefer adding the upgrade-recovery command.

Adding flags that completely changes the implementation of the upgrade command feels like a recipe for unintended consequences.

From @davidcassany other comment I got the impression adding the argument was better. This is indeed simpler to implement. (And adding a separate action here is probably not needed too)

@anmazzotti
Copy link
Contributor Author

Part of rancher/elemental#1218

@anmazzotti anmazzotti changed the title Add --recovery-only upgrade argument Add upgrade-recovery subcommand Feb 26, 2024
@anmazzotti anmazzotti force-pushed the implement-upgrade-recovery-command-2 branch from 49451b4 to 3246717 Compare February 27, 2024 12:45
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
@anmazzotti anmazzotti force-pushed the implement-upgrade-recovery-command-2 branch from 0e19aad to 825c402 Compare February 28, 2024 11:03
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
@anmazzotti anmazzotti marked this pull request as ready for review February 29, 2024 08:01
@anmazzotti anmazzotti requested a review from a team as a code owner February 29, 2024 08:01
Copy link
Contributor

@davidcassany davidcassany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍

I think the only actual request change I have to actually include the yaml loading unit test. I think the current sub viper setup is wrong (speaking from memory I might miss some detail), but in any case a simple unit test in that front will tell us and actually verify it works when fixed, in case it needs to be fixed.

The other issues, probably it is worth addressing them in a separate PR, not so sure. If we don't face obvious breaks and regressions with this I'd vote to merge it and iterate from that point. I think it might be easier this way.

statePath := filepath.Join(u.spec.Partitions.State.MountPoint, constants.InstallStateFile)
if u.spec.Partitions.State.MountPoint == "/" || u.spec.Partitions.State.MountPoint == "/.snapshots" {
statePath = filepath.Join(constants.RunningStateDir, constants.InstallStateFile)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lines are concerning, these are likely to explain why this does not work in a pod run. I got rid of this hack in upgrade because those paths shouldn't be hardcoded and we should try to minimize suc-upgrade impact in toolkit code (so I'd say that testing against /host/ or /host/.snapshots is also not an option).

Copy link
Contributor

@davidcassany davidcassany Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a compromise to start walking is not updating the state.yaml in State partition and figure out the consequences of such an inconsistency an figure out if there are better alternatives to manage state.yaml data (probably could be stored somewhere else such as in OEM, even I am not a fan of the idea but it would make is handling way easier).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I did inherit the hack and I didn't notice it was removed.
It's gone now. Thank you

return elementalError.NewFromError(err, elementalError.MountStatePartition)
}
cleanup.Push(umount)
umount, err = elemental.MountRWPartition(u.cfg.Config, u.spec.Partitions.State)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to produce an undesired effect or even fail, it is likely to attempt mounting the current root as RW. By default /run/initramfs/elemental-state is already mounted as RW on a running system with the current setup. So it should not be needed to actually mount it as RW, it already is as RW.

Also, on a running system on btrfs, the active subvolume is mounted by default on this partition and it could also be the reported mount point from ghw library. This is the actual reason this was only called in recovery mode in the former code. Probably we need to move or add the logic at

func findStateMount(runner v1.Runner, device string) (rootDir string, stateMount string, err error) {
output, err := runner.Run("findmnt", "-lno", "SOURCE,TARGET", device)
if err != nil {
return "", "", err
}
r := regexp.MustCompile(`@/.snapshots/\d+/snapshot`)
scanner := bufio.NewScanner(strings.NewReader(strings.TrimSpace(string(output))))
for scanner.Scan() {
lineFields := strings.Fields(scanner.Text())
if len(lineFields) != 2 {
continue
}
if strings.Contains(lineFields[1], constants.RunningStateDir) {
stateMount = lineFields[1]
} else if r.MatchString(lineFields[0]) {
rootDir = lineFields[1]
}
}
if stateMount == "" || rootDir == "" {
err = fmt.Errorf("could not find expected mountpoints, findmnt output: %s", string(output))
}
return rootDir, stateMount, err
}
into utils.GetAllPartitions method so we can have a known and consistent mountpoint defined for state partition.
The problem is that current Partitions struct and the underlaying ghw library assumes a single mountpoint per partition, while there could be more. In fact in the upgrade pod mountpoints under /host/run and /run are duplicated, we have not criteria from elemental-toolkit to distinguish them or even reporting them.
At mid term I'd probably drop ghw library use in favor of an lsblk wrapper (we used to have that at a time) and report at list of mountpoints per partition. But this is relatively ambitious refactor... specially in terms of testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I actually changed this to always remount it because the CI tests were failing because /run/initramfs/elemental-state could not be found. I will revert this change here, most likely this is not the solution if there was an issue with it.
I also noticed the state partition was already RW mounted, this is why I added the debug log somewhere else to confirm this. Then the test just worked. 🤷

if err != nil {
return nil, fmt.Errorf("failed initializing upgrade recovery spec: %v", err)
}
vp := viper.Sub("upgrade-recovery")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is correct? shouldn't it be vp := viper.Sub("upgrade")? IIRC this states the sub schema item, hence it maps whatever is loaded from the yaml under upgrade-recovery stanza. I guess this could be tested and validated by checking this method properly reads upgrade section of a yaml. I assume this falls into the scope of cmd/config/config_test.go.

I think this should be fixed or at least validated with a unit test in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I got confused (again :D ) with it.
I removed it all and just added a (bit ugly) flag to sanitize the same stanza when used for recovery only, since it's the only logical difference in there.

Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
return u.cfg.WriteInstallState(
u.spec.State, statePath,
u.spec.State, filepath.Join(u.spec.Partitions.State.MountPoint, constants.InstallStateFile),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to cause failures in recovery-only upgrades as for those deployments using btrfs snapshotter the u.spec.Partitions.State.Mountpoint is likely to point to / (if running directly on the host) or /host (if running i the system-upgrade-controller pod).

As said before I think thin can be addressed later in follow up PR, it is not a regression.

@anmazzotti anmazzotti merged commit 2c5be14 into rancher:main Feb 29, 2024
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants