Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design proposal: new migration process for backend-storage PVCs #314

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions design-proposals/backend-storage-clone-on-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Overview
This proposal is about supporting the migration of all VMs that use a persistent feature, such as persistent vTPM.

To make this document as simple as possible, let's consider that storage in Kubernetes can either be filesystem (FS) or
block mode, and either ReadWriteOnce (RWO) or ReadWriteMany (RWX) access mode, with storage classes supporting any
number of combinations of both.

As of KubeVirt 1.3, Here is how each of the 4 combinations is supported by the backend-storage for persistent VM state:
- RWO FS: supported by backend-storage but makes VMs non-migratable. **This entire design proposal is about removing that limitation.**
- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way
- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently.
Copy link
Member

@alicefr alicefr Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm a bit late on this. Have we ever considered to use a clustered filesystem like gfs2 instead of ext3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good question. gfs2 somehow didn't pop up when I looked for potentially suitable filesystems for formatting the block storage, probably worth investigating!
That being said, this proposal (and even the alternative it contains) doesn't consider the option of formatting the block storage (anymore), and not just because I couldn't find a suitable filesystem.
The fact that it requires adding a bunch of tools to virt-launcher and virt-handler, and the general complexity of managing partitions ourselves, make it something we should avoid, in my opinion (see draft PR).

Copy link
Member

@alicefr alicefr Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I read a bit about gfs2 supported by Red Hat, and it requires clvm. So, I think this also excludes it as a solution


The above is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration, enabling RWO FS.

Furthermore, the implementation of this design proposal would also allow a potential block-based implementation, compatible with both RWO and RWX and safely migratable.
That, however, is out-of-scope for this design proposal, and may never be needed, since most/all block storage classes also support RWO FS.

## Alternative approach
The alternative to this design is for upstream components (libvirt - qemu - swtpm) to add block storage support for TPM/EFI.
KubeVirt could then create one backend-storage PVC per feature (EFI/TPM) and pass them directly to libvirt.
That alternative approach would enable RWX block backend-storage (as opposed to RWO FS in this proposal).

Pros:
- Less code to write and maintain in KubeVirt
- No need to duplicate backend-storage PVCs on migrations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- No need to duplicate backend-storage PVCs on migrations
- No need to duplicate backend-storage PVCs on migrations and delete the correct one after migration success/failure
- libvirt gains an important feature

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a public request on libvirt to add this functionality, and include the URL to it here. The simplest way to "kill" this alternative is if libvirt rejects it. I would think that this would be a mistake on their behalf, but that's their question to answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with a few members of the libvirt team, and they actually don't appear to have strong objections against it.
There would be 4 parts to this:

  • block storage support in swtpm: already done
  • libvirt glue for swtpm block storage: draft PR open (https://gitlab.com/libvirt/libvirt/-/merge_requests/380)
  • block storage support in qemu for NVRAM: could work out of the box, since the NVRAM is already just 1 file (just have to ensure they don't run file-specific operations on it)
  • libvirt glue for NVRAM block storage: could also work out of the box, although unlikely, but shouldn't require too much work (just ensuring the "file" is never moved/deleted/chmoded...)

So really not to much left to do overall, however we would have to wait for the first release of every component that needs changes, which could take a long time.
This approach is also definitely not my first choice, and I would hate to request something, have upstream devs work hard on it and ultimately not consume it...
That's why I didn't file any request yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is also definitely not my first choice

Please make sure to spell out all your reasoning as its "cons". For example, I did not see the issue of timing written in the proposal.

That's why I didn't file any request yet.

Let have a formal request and link to it here. We should be explicit that because the reasons spelled out here we may not use it, or may not use it immediately. But let us give the libvirt team an opportunity to refute our assumptions - maybe they can deliver this functionality fast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a public request on libvirt to add this functionality, and include the URL to it here.

https://issues.redhat.com/browse/RHEL-54325
https://issues.redhat.com/browse/RHEL-54326

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. github comments can easily get lost. Please add these to the design doc - the fact that they are not done is a reason against the alternative option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will rework this design proposal this week and include those links.


Cons:
- Relies on third party projects adding and maintaining options to store config files into a block devices without filesystems
- Requires one PVC per feature for every VM

## Motivation/Goals
- Users want to be able to use any storage class for backend-storage
- Users want all VMs with persistent features to be potentially migratable

## Definition of Users
Any VM owner that wants to use persistent features, such as TPM or EFI

## User Stories
As a user, I want to be able to seamlessly enable persistent features while keeping the migratability of my VMs.

## Repos
kubevirt/kubevirt

# Design
- When creating a new VM with persistent features, we use either (first match wins):
- The storage class specified in the KubeVirt Custom Resource
- The storage class marked as the default for virtualization purposes unless it has a StorageProfile that shows only block is supported
- The storage class marked as the default for the cluster
- All new persistent storage PVCs will be created with the name `persistent-state-for-<vm_name>-<random_string>` and the new annotation `persistent-state-for-<vm_name>=true`
- When starting an existing VM with persistent features, we will look for any PVC with the annotation, or any PVC with the legacy name, for backwards compatibility
- The volume mode will be Filesystem
- The access mode will be RWO, unless a StorageProfile exists for the Storage Class and shows only RWX is supported
- Note: the CR field documentation will be adjusted to reflect that RWX is not longer needed
- When a migration is created:
- We create a new empty PVC for the target, with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation
- In the migration object status, we store the names of both the source and target PVCs (see API section below)
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
- In the unlikely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of
- **Important note**: with the way the migration controller currently works, this has the potential of suffering a race condition. We need to make absolutely sure that won't happen.

## API
The only API that's introduced is a couple status fields in the VirtualMachineMigration object:
- `SourcePersistentStatePVCName`
- `TargetPersistentStatePVCName`

## Scalability
No new scalability concern will be introduced.

## Update/Rollback Compatibility
Since the name of the backend-storage PVC will change, we will keep fallback code to look for the legacy PVC.

## Functional Testing Approach
All existing backend-storage-related functional tests will still apply. More tests could be added to ensure all 4 combinations
of FS/block RWX/RWO work for backend-storage, if we think it's worth the additional stress on the e2e test lanes.

# Implementation Phases
First phase is block support. That effort is well underway already.
Second phase is changing the PVC handling and could be done as part of the first phase PR or as a separate one.
Either way, it's important that both phases land in the same KubeVirt release.