Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not migrate legacy images if snapshots are already present #1990

Merged

Conversation

davidcassany
Copy link
Contributor

This commit prevents executing the legacy images migration logic if snapshotter already finds available snapshots. This mostly means the migration was already executed and legacy images had already a chance to be converted into snapshots.

@davidcassany davidcassany requested a review from a team as a code owner March 5, 2024 14:01
@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2024

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 72.52%. Comparing base (b266360) to head (63df62d).

Files Patch % Lines
pkg/snapshotter/loopdevice.go 0.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1990      +/-   ##
==========================================
- Coverage   72.55%   72.52%   -0.04%     
==========================================
  Files          76       76              
  Lines        8902     8906       +4     
==========================================
  Hits         6459     6459              
- Misses       1910     1913       +3     
- Partials      533      534       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@anmazzotti anmazzotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I think this explains why I did see multiple snapshots being created at times.

@davidcassany
Copy link
Contributor Author

davidcassany commented Mar 5, 2024

Right I just manually tested the following with this sequence:

  1. upgrade from stable to dev
    1.1 manual reboot to fallback brings cluster up again
  2. upgrade from dev to this PR
    2.1 manual reboot to fallback brings cluster up again
  3. reboot to fallback and from there upgrade to this PR again
    3.1 manual reboot to fallback brings cluster up again

@anmazzotti @fgiudici about the upgrade resouces lifecycle: All three upgrades (1st, 2nd and 3rd) were done with a devoted ManagedOSImage resource without deleting the previous one, so after all three upgrades three different upgrade groups are still alive. Then:

  1. I deleted the cluster (reset was enabled) and the node got reset to stable (recovery was not upgrade)
  2. Created a new cluster with the same name of the deleted one
  3. All three ManagedOSImages got triggered and started simultaneously
  4. 2nd upgrade resource won the download race and managed to run the upgrade, the other 3rd upgrade got blocked in suc-upgrade (the script prevents concurrent executions).
  5. On reboot upgrade 3rd got quickly applied (it is was already the same version, hence it succeeded without doing any change). Upgrade 1st got reboot while the upgrade pod was initializing, it doesn't seam to recontinue after reboot, feels it got stuck in pod initializing...
  6. Deleted the stuck pod in initializing and the system-upgrade-controller immediately recreated the pod and triggered the upgrade. This upgrade was fully executed because this image was pulled from a different repository in the registry (suc-upgrade only prevents downgrades if host and upgrade image are from the same repository in the registry).

After all the cluster got up and ready again and the node got the three upgrade labels, but the process resulted to be a bit insane and without any guarantee to know which would have been end result (image from 1st or from 2nd). I bet that if it would have been a multi-node cluster we would see heterogeneous results some node in image A and some other in image B.

All in all, keeping upgrade resources around seams to be a really tricky practice, however I can't think of a meaningful criteria on how to determine the upgrade resource deletion (think at scale). In that particular case I think it would have been way more sane and reliable to just keep updating the same upgrade resource to modify the image to upgrade to, so you could consider to the rule of thumb that a single upgrade group shall exist for a given group of clusters. I do think making upgrade groups editable makes sense.

Finally, also because I was in this weird scenario, I tried to actually re-trigger upgrade 3rd into the cluster. I could do that by deleting the related label from the node resource, but fun enough, I could not delete the label from the rancher UI I had to kubectl into that cluster to do so 🤷🏽‍♂️ deleting the label from rancher ui did not nothing, it was not effective.

This commit prevents executing the legacy images migration logic
if snapshotter already finds available snapshots. This mostly
means the migration was already executed and legacy images
had already a chance to be converted into snapshots.

Signed-off-by: David Cassany <dcassany@suse.com>
@davidcassany davidcassany force-pushed the fix_legacy_images_management branch from 10b5e8d to 44fb99f Compare March 5, 2024 17:23
@davidcassany davidcassany enabled auto-merge (rebase) March 5, 2024 17:37
@davidcassany davidcassany linked an issue Mar 5, 2024 that may be closed by this pull request
@fgiudici
Copy link
Member

fgiudici commented Mar 5, 2024

[...]
All in all, keeping upgrade resources around seams to be a really tricky practice, however I can't think of a meaningful criteria on how to determine the upgrade resource deletion (think at scale). In that particular case I think it would have been way more sane and reliable to just keep updating the same upgrade resource to modify the image to upgrade to, so you could consider to the rule of thumb that a single upgrade group shall exist for a given group of clusters. I do think making upgrade groups editable makes sense.
[...]

Thanks for sharing all the tests David, quite interesting!
Keeping the upgrade resources around makes things messy indeed... having the upgrade groups editable could be an option. Having also a "Done" flag in the status as per Andrea's idea would be handy, sure we should figure out how to give "feedback" to the upgrade process. Right now it would sound great to me having these two improvements together.
In any case, we have to improve the upgrade lifecycle for sure.

@davidcassany davidcassany disabled auto-merge March 5, 2024 21:25
@davidcassany davidcassany force-pushed the fix_legacy_images_management branch from 67c92f9 to 44fb99f Compare March 5, 2024 22:53
Signed-off-by: David Cassany <dcassany@suse.com>
@davidcassany davidcassany enabled auto-merge (rebase) March 5, 2024 22:56
Signed-off-by: David Cassany <dcassany@suse.com>
@davidcassany davidcassany disabled auto-merge March 6, 2024 08:36
@davidcassany davidcassany merged commit c4bea71 into rancher:main Mar 6, 2024
17 of 19 checks passed
@davidcassany davidcassany deleted the fix_legacy_images_management branch March 6, 2024 09:10
@davidcassany davidcassany restored the fix_legacy_images_management branch May 15, 2024 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Root password lost after an OS upgrade
4 participants