design proposal: new migration process for backend-storage PVCs #314

jean-edouard · 2024-07-23T19:54:31Z

NONE

dankenigsberg

Hi @jean-edouard ! Thanks for this much-needed work. I have a few questions, whose answers can be incorporated into the proposal?

dankenigsberg · 2024-07-23T20:04:48Z

design-proposals/backend-storage-block-migration.md

+  - The storage class marked as the default for the cluster
+- All new persistent storage PVCs will be created with the name `persistent-state-for-<vm_name>-<random_string>` and the new annotation `persistent-state-for-<vm_name>=true`
+- When starting an existing VM with persistent features, we will look for any PVC with the annotation, or any PVC with the legacy name, for backwards compatibility
+- For modes, the filesystem volume mode will be preferred (less complexity), and we default to RWO (because RWX won't be needed anymore)


We may be better off deferring to CDI logic, using the StorageProfile of the selected storage class.

Yes, the storage profile will definitely be used, but if multiple access modes are available we still need to pick one

dankenigsberg · 2024-07-23T20:07:25Z

design-proposals/backend-storage-block-migration.md

+- For modes, the filesystem volume mode will be preferred (less complexity), and we default to RWO (because RWX won't be needed anymore)
+- On migration, no matter what volume/access mode the PVC uses, we create a new empty PVC with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation
+- In the migration object status, we store the names of both the source and target PVCs (see API section below)
+- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object


this sounds raceful. What happens if migration succeeds, and the VM crashes before we move over the label?

This is why status fields are added to the migration object. The migration won't move to the finished state as long as the PVCs are not sorted out

Please include a deeper discussion of the mode of failure. I don't understand how you can coordinate between the source and destination without interaction with qemu.

dankenigsberg · 2024-07-23T20:11:50Z

design-proposals/backend-storage-block-migration.md

+
+This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
+This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration,
+allowing all 4 cases above to have proper migration support.


why do we care about migratability with RWO storage? We don't support it for root disk. Would it simplify our solution if we assume RWX, so that the same PVC is accessible both on source and destination nodes?

The point of this proposal is to stop mounting the PVC on both source and target, to enable migratability of block-based backend-storage.

I do not understand why we are interested with RWO. If we have only RWO, we cannot migrate the root disk. I don't understand why we conflate this proposal (and future tests) with this combination (migration+RWO).

The backend-sotage may be set to a different storage class than the other disks.
What is being discussed here is that since the content of this volume will be copied, the access mode of such volume will become irrelevant and might as well be RWO.

Thinking about RWO and discussing it makes this proposal more complex and slower to test. What is the benefit of it?

From my side, we can ignore it. However, in many cases, this is the RWO is the default access mode. We don't have to test or support this.

What about containerdisk VMs? They can be live migrated, correct? Can we use emptydir volumes for backend storage in that case?

EDIT obviously a weird case probably not necessary to support at least initially

Using emptydir for backend storage is effectively what happens for non-persistent TPM/EFI.
If we want persistence, we need a PVC.

dankenigsberg · 2024-07-23T20:14:25Z

design-proposals/backend-storage-block-migration.md

+- RWO FS: supported by backend-storage but makes VMs non-migratable
+- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations
+- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way 
+- RWX block: same as above, with migratability functional but considered unsafe (as it concurrently mounts an ext3 partition)


can you provide some background why this is the case?
We support RWX block for root disk with no problem. Why does vTPM requires creating a filesystem on the storage? Have we requested qemu/libvirt to provide support for filesystem-free vTPM persistence?

In other words: does libvirt have the same problem? If the only shared storage a user has is block, can they use vTPM+migration?

This is because, unlike with VM disks, we have 2 components concurrently accessing the PVC (source and target virt-launchers).
They both maintain the filesystem metadata for the mounted partition, which will lead to conflicts. I can add more verbiage to explain that in the doc.

I can add more verbiage to explain that in the doc.

Please do. In particularly, please refer to my question

Does libvirt have the same problem? If the only shared storage a user has is block, can they use vTPM+migration?

Currently the TPM implementation uses multiple files under the hood.

This means one of two things must be true. Either some component in the stack must manage a filesystem on the block device or the TPM implementation must be rewritten from the ground up. The latter doesn't seem like a fair request to the underlying projects. At least not in a timeline that suits our needs.

The filesystem being used is ext3, so it can either be mounted on the source or destination only. These mounts are mutually exclusive with each other regardless of whether or not the PVC is using RWX mode. This is simply a limitation of ext3 (or most filesystems for that matter).

Given these constraints, KubeVirt is the best component to manage this problem because it can manage PVCs where libvirt can't--because libvirt runs inside a container.

@stu-gott do I understand that your the answer to my question is affirmative? libvirt has the same problem and cannot cannot use shared block+vTPM+migration?

Have we asked them to support this combination?
Do they not mind?

@dankenigsberg I answered in another comment. The answer is, to the best of my knowledge, no.
swtpm, qemu and/or libvirt don't support saving tpm state on block storage and there is no plan to support this.
If we want to support this scenario in KubeVirt we will need to provide the management around this.

dankenigsberg · 2024-07-23T20:15:40Z

design-proposals/backend-storage-block-migration.md

+- Users want all VMs with persistent features to be potentially migratable
+
+## Definition of Users
+Any VM user that wants to use persistent features, such as TPM or EFI


Suggested change

Any VM user that wants to use persistent features, such as TPM or EFI

Any VM owner that wants to use persistent features, such as TPM or EFI

defining "user" via the word "user" sounds circular to me.

dankenigsberg · 2024-07-23T20:21:44Z

design-proposals/backend-storage-block-migration.md

+
+## Functional Testing Approach
+All existing backend-storage-related functional tests will still apply. More tests could be added to ensure all 4 combinations
+of FS/block RWX/RWO work for backend-storage, if we think it's worth the additional stress on the e2e test lanes. 


migration tests are already quite time consuming. can vTPM device added to existing migrating VMs so as not to make the test run longer?

Yes, this is definitely something to keep in mind when adding functional test coverage for this. It's indeed what I was trying to hint at there

jean-edouard · 2024-07-24T14:41:09Z

Hi @jean-edouard ! Thanks for this much-needed work. I have a few questions, whose answers can be incorporated into the proposal?

Thank you for the review! Comments answered in github, I will amend the design proposal to reflect them.

dankenigsberg · 2024-07-25T05:34:59Z

design-proposals/backend-storage-block-migration.md

+- RWO FS: supported by backend-storage but makes VMs non-migratable
+- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations
+- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way 
+- RWX block: same as above, with migratability functional but considered unsafe (as it concurrently mounts an ext3 partition)


I can add more verbiage to explain that in the doc.

Please do. In particularly, please refer to my question

Does libvirt have the same problem? If the only shared storage a user has is block, can they use vTPM+migration?

dankenigsberg · 2024-07-25T05:37:04Z

design-proposals/backend-storage-block-migration.md

+
+This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
+This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration,
+allowing all 4 cases above to have proper migration support.


I do not understand why we are interested with RWO. If we have only RWO, we cannot migrate the root disk. I don't understand why we conflate this proposal (and future tests) with this combination (migration+RWO).

dankenigsberg · 2024-07-25T05:40:23Z

design-proposals/backend-storage-block-migration.md

+- For modes, the filesystem volume mode will be preferred (less complexity), and we default to RWO (because RWX won't be needed anymore)
+- On migration, no matter what volume/access mode the PVC uses, we create a new empty PVC with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation
+- In the migration object status, we store the names of both the source and target PVCs (see API section below)
+- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object


Please include a deeper discussion of the mode of failure. I don't understand how you can coordinate between the source and destination without interaction with qemu.

Signed-off-by: Jed Lejosne <jed@redhat.com>

dankenigsberg · 2024-07-25T22:08:52Z

design-proposals/backend-storage-block-migration.md

+
+This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
+This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration,
+allowing all 4 cases above to have proper migration support.


Thinking about RWO and discussing it makes this proposal more complex and slower to test. What is the benefit of it?

dankenigsberg · 2024-07-25T22:14:30Z

design-proposals/backend-storage-block-migration.md

+  - In the migration object status, we store the names of both the source and target PVCs (see API section below)
+  - If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
+  - If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
+  - In the unlifely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of


how could the finalizer make this decision? I don't think it has enough information.

The finalizer itself doesn't solve this issue. It only makes sure that the migration object won't get deleted before the correct volume is annotated by the controller.

However, there is still a race that this proposal will need to address.

There is a short period when the migration is finalized on the host, but before the controller has annotated the correct volume. The VM could shutdown and start again, picking up an incorrect volume.

I think that this issue can be addressed by simply re-queueing the newly created VMI object if it has an unfinalized migration.
This is an issue on its own that a new VM can start while it has an unfinished migration.

I believe that the existence and nature of these race conditions should be highlighted in this proposal. We should discuss means to eliminate them.

Yes, I agree. @jean-edouard could you please highlight this race and the suggested solution?

dankenigsberg · 2024-07-25T22:16:32Z

design-proposals/backend-storage-block-migration.md

+- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way 
+- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently.
+
+This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.


Would you elaborate about partitioning? I see "partitioning complexity" below, but the significance is not clear to me.

What do you mean by "current implementation"? KubeVirt's? libvirt's? Have we considered other (possibly better, yet more intrusive) implementations?

swtpm, qemu and/or libvirt don't support saving tpm state on block storage.
libvirt supports saving this state in an in-memory secret but it won't be migrateable.
live migration is only supported with a state saved on shared storage.
There is no plan to support any other implementation. To the best of my knowledge.

To support block storage as backed for vTPM, kubevirt will need to have access to in the swtpm container, build a filesystem on top, and mount it.
The issues arise when kubevirt attempts to mount this volume on both the source and the target containers.

I hope it answers your question.

This proposal suggests creating two separate block volumes and allowing qemu to copy the state, however, it will require kubevirt to manage the volumes on migration - which creates other challenges to overcome.

There is no plan to support any other implementation. To the best of my knowledge.

Would you elaborate on this? I don't suppose that KubeVirt is the only virtualization management project that cares about block storage+tpm+migration. Can we point to a request from libvirt to support this, with the reasoning of its rejection?

To my knowledge, other platforms are fine with a shared filesystem volume for migration or don't support it entirely[1].

IIRC, we asked. @jean-edouard @stu-gott @mhenriks @alicefr could you please add your thoughts here?

OpenStack Nova: "Live migration, evacuation, shelving and rescuing of servers with vTPMs is not currently supported."
[1] https://docs.openstack.org/nova/latest/admin/emulated-tpm.html

@xpivarc pointed out that it looks like swtpm supports block storage, so I take my previous statement back until we figure out if we can use this option.

Indeed, with --tpmstate backend-uri=file://, swtpm can serialize its files into something like a block storage device.
This option is however not supported by libvirt. That also doesn't work with EFI.

dankenigsberg · 2024-07-25T22:22:20Z

design-proposals/backend-storage-block-migration.md

+- RWO FS: supported by backend-storage but makes VMs non-migratable
+- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations
+- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way 
+- RWX block: same as above, with migratability functional but considered unsafe (as it concurrently mounts an ext3 partition)


@stu-gott do I understand that your the answer to my question is affirmative? libvirt has the same problem and cannot cannot use shared block+vTPM+migration?

Have we asked them to support this combination?
Do they not mind?

dankenigsberg · 2024-07-27T06:15:05Z

design-proposals/backend-storage-block-migration.md

+- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way 
+- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently.
+
+This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.


There is no plan to support any other implementation. To the best of my knowledge.

Would you elaborate on this? I don't suppose that KubeVirt is the only virtualization management project that cares about block storage+tpm+migration. Can we point to a request from libvirt to support this, with the reasoning of its rejection?

dankenigsberg · 2024-07-27T06:19:16Z

design-proposals/backend-storage-block-migration.md

+  - In the migration object status, we store the names of both the source and target PVCs (see API section below)
+  - If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
+  - If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
+  - In the unlifely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of


I believe that the existence and nature of these race conditions should be highlighted in this proposal. We should discuss means to eliminate them.

jean-edouard · 2024-07-29T18:47:07Z

This could also be a concern: https://github.com/libvirt/libvirt/blob/e40a533118efdc3d5f6fd1dcc9bb8143b54691e2/src/qemu/qemu_tpm.c#L571
(if the TPM is stored under a "shared" (nfs/ceph) storage, it looks like it's automatically considered as shared between the source and the target)

mhenriks · 2024-07-30T01:00:00Z

design-proposals/backend-storage-block-migration.md

+- All new persistent storage PVCs will be created with the name `persistent-state-for-<vm_name>-<random_string>` and the new annotation `persistent-state-for-<vm_name>=true`
+- When starting an existing VM with persistent features, we will look for any PVC with the annotation, or any PVC with the legacy name, for backwards compatibility
+- For volume/access modes:
+  - When multiple volume modes are listed in the StorageProfile of the StorageClass, we will default to filesystem (to avoid partitioning complexity)


I think we can avoid partitioning entirely. Any StorageClass that supports block should natively support filesystem (like rook-ceph-block)

That's huge. Block support is messy and somewhat illogical. Reducing this design to only add RWO FS support would make everything so much simpler and sensible.

mhenriks · 2024-07-30T01:02:59Z

design-proposals/backend-storage-block-migration.md

+- For volume/access modes:
+  - When multiple volume modes are listed in the StorageProfile of the StorageClass, we will default to filesystem (to avoid partitioning complexity)
+  - When multiple access modes are listed for the selected volume mode, we will arbitrarily default to RWO.
+  - When CDI is not present, or the StorageClass has no StorageProfile, or the StorageProfile has no modes, we'll default to RWO FS.


RWO FS is the lowest common denominator in k8s you will be hard pressed to find a provisioner that does not support that. Not sure we should even consider any other combinations

RWO FS is the lowest common denominator in k8s you will be hard pressed to find a provisioner that does not support that

That's a very important note. How common is RWX FS which we already support with migration?

There may be operational problems with splitting VM persistence between several storage classes (disaster recovery and storage migration may become more complex, virt-default StorageClass) which may keep this proposal useful.

That's a very important note. How common is RWX FS which we already support with migration?

RWX FS is not very popular. Shared filesystems have been declining in popularity since object storage entered the scene. Shared filesystems are generally considered complex and slow. For legacy applications. Avoided whenever possible

There may be operational problems with splitting VM persistence between several storage classes (disaster recovery and storage migration may become more complex, virt-default StorageClass) which may keep this proposal useful.

I'm not following this. Where's the splitting happening? My point is that the same storageclass can be used for block and fs. For example the rook-ceph-block StorageClass can do RWX block and RWO fs.

For example the rook-ceph-block StorageClass can do RWX block and RWO fs.

I didn't know that. Thanks.

So to support TPM on rook-ceph-block we have two options: either ask libvirt to support RWX block, or add our own support for RWO file. This proposal should discuss the pros and cons of each.

Well this proposal is all about option 2, but I guess we could touch on option 1 in the intro.
A brief pros/cons:
Option 1 requires some work on our side (multi-PVC support) and a lot of work on libvirt's (add support for swtpm block + work with qemu to design and implement EFI block). The main con IMO is the unreasonable ask of libvirt/qemu to support pre-1961 technology (disks without partitions/filesystems).
Option 2 just requires implementing this design proposal, including ensuring there's no race when rebooting during a migration. Seems a lot saner to me, even if it's more work for us.

Edit: merge request opened for libvirt swtpm block support: https://gitlab.com/libvirt/libvirt/-/merge_requests/380

work with qemu to design and implement EFI block

Seems like a big chunk of work for another team and would ultimately require 2 PVCs for VMs with EFI+TPM, correct? With filesystem PVCs, TPM and EFI can share a single PVC, right?

Exactly yes

Fewer PVCs is big PRO for option 2

mhenriks · 2024-07-30T01:05:08Z

design-proposals/backend-storage-block-migration.md

+of FS/block RWX/RWO work for backend-storage, if we think it's worth the additional stress on the e2e test lanes. 
+
+# Implementation Phases
+First phase is block support. That effort is well underway already.  


I don't think it is worth investing in block PVC support. What provisioner only supports block PVCs?

mhenriks · 2024-07-31T16:52:31Z

This could also be a concern: https://github.com/libvirt/libvirt/blob/e40a533118efdc3d5f6fd1dcc9bb8143b54691e2/src/qemu/qemu_tpm.c#L571 (if the TPM is stored under a "shared" (nfs/ceph) storage, it looks like it's automatically considered as shared between the source and the target)

Interesting. So could be possible to ask for a RWO FileSystem PVC, get nfs, and at migration time, data will not be copied to the new PVC?

I believe we've encountered a similar problem before, but the reverse. User requests RWX PVC but get ext4 because VM is scheduled on the same node as the nfs server. I believe LD_PRELOAD hacks were used to get around the issue until there was a proper fix

…support out of scope Signed-off-by: Jed Lejosne <jed@redhat.com>

dankenigsberg · 2024-08-13T07:49:53Z

design-proposals/backend-storage-clone-on-migration.md

+
+Pros:
+- Less code to write and maintain in KubeVirt
+- No need to duplicate backend-storage PVCs on migrations


Suggested change

- No need to duplicate backend-storage PVCs on migrations

- No need to duplicate backend-storage PVCs on migrations and delete the correct one after migration success/failure

- libvirt gains an important feature

Please file a public request on libvirt to add this functionality, and include the URL to it here. The simplest way to "kill" this alternative is if libvirt rejects it. I would think that this would be a mistake on their behalf, but that's their question to answer.

I discussed this with a few members of the libvirt team, and they actually don't appear to have strong objections against it.
There would be 4 parts to this:

block storage support in swtpm: already done

libvirt glue for swtpm block storage: draft PR open (https://gitlab.com/libvirt/libvirt/-/merge_requests/380)

block storage support in qemu for NVRAM: could work out of the box, since the NVRAM is already just 1 file (just have to ensure they don't run file-specific operations on it)

libvirt glue for NVRAM block storage: could also work out of the box, although unlikely, but shouldn't require too much work (just ensuring the "file" is never moved/deleted/chmoded...)

So really not to much left to do overall, however we would have to wait for the first release of every component that needs changes, which could take a long time.
This approach is also definitely not my first choice, and I would hate to request something, have upstream devs work hard on it and ultimately not consume it...
That's why I didn't file any request yet.

This approach is also definitely not my first choice

Please make sure to spell out all your reasoning as its "cons". For example, I did not see the issue of timing written in the proposal.

That's why I didn't file any request yet.

Let have a formal request and link to it here. We should be explicit that because the reasons spelled out here we may not use it, or may not use it immediately. But let us give the libvirt team an opportunity to refute our assumptions - maybe they can deliver this functionality fast?

Please file a public request on libvirt to add this functionality, and include the URL to it here.

https://issues.redhat.com/browse/RHEL-54325
https://issues.redhat.com/browse/RHEL-54326

thanks. github comments can easily get lost. Please add these to the design doc - the fact that they are not done is a reason against the alternative option.

Yes, I will rework this design proposal this week and include those links.

alicefr · 2024-08-14T08:43:38Z

design-proposals/backend-storage-clone-on-migration.md

+- RWO FS: supported by backend-storage but makes VMs non-migratable. **This entire design proposal is about removing that limitation.**
+- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations
+- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way 
+- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently.


Sorry, I'm a bit late on this. Have we ever considered to use a clustered filesystem like gfs2 instead of ext3?

That's a really good question. gfs2 somehow didn't pop up when I looked for potentially suitable filesystems for formatting the block storage, probably worth investigating!
That being said, this proposal (and even the alternative it contains) doesn't consider the option of formatting the block storage (anymore), and not just because I couldn't find a suitable filesystem.
The fact that it requires adding a bunch of tools to virt-launcher and virt-handler, and the general complexity of managing partitions ourselves, make it something we should avoid, in my opinion (see draft PR).

Yes, I agree. I read a bit about gfs2 supported by Red Hat, and it requires clvm. So, I think this also excludes it as a solution

Signed-off-by: Jed Lejosne <jed@redhat.com>

dankenigsberg · 2024-08-17T09:23:10Z

design-proposals/backend-storage-clone-on-migration/backend-storage-clone-on-migration.md

+- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way 
+- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently.
+
+The above is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.  


Suggested change

The above is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.

The above is the best we can do with the current libvirt implementation, which involves mounting the same partition on migration source and target.

I'm not at all sure I understand to which "current implementation" you refer.

I'll rephrase, what I mean is "without partitioning block / creating new PVCs on migration / libvirt block support"

dankenigsberg · 2024-09-21T15:57:08Z

design-proposals/backend-storage-clone-on-migration/backend-storage-clone-on-migration.md

+- Libvirt gains a feature
+
+Cons:
+- Relies on third party projects adding and maintaining options to store config files into a block devices without filesystems


Suggested change

- Relies on third party projects adding and maintaining options to store config files into a block devices without filesystems

- Relies on libvirt and qemu adding and maintaining options to store config files into a block devices without filesystems

are these the third parties you refer to?

dankenigsberg · 2024-09-21T15:58:49Z

design-proposals/backend-storage-clone-on-migration/backend-storage-clone-on-migration.md

+- We would have to either
+  - Maintain 2 different backend-storage implementations (the existing RWX FS one and this RWX block one)
+  - Deprecate RWX FS and require RWX block to enable backend-storage


I don't understand this point. In the Primary Design, what are do you plan to do with the existing code referring to RWX FS?

With the primary design, we keep support for RWX FS and extend it to support RWO FS. Since both use virtually the same codepath, it's just one thing to maintain.
However, RWX block would need a completely different approach, so if we want to avoid maintaining 2 different ways to do the same thing we'd have to deprecate RWX FS.

I don't know the code, but ideally, by delegating the trouble to libvirt, the only difference between RWX FS and RWX Block would be passing a block device to libvirt instead of a filesystem path.

In the primary implementation, we need to add code to synchronize between source and destination during success and various modes of failures.

I don't know the code, but ideally, by delegating the trouble to libvirt, the only difference between RWX FS and RWX Block would be passing a block device to libvirt instead of a filesystem path.

Yes, well multiple block devices. That means changing every function that touches backend-storage to handle multiple PVCs.
It would also require changing the volumes and mounts we pass to virt-launcher. Finally the libvirt options would indeed look different.
That is a non-trivial amount of differences.

In the primary implementation, we need to add code to synchronize between source and destination during success and various modes of failures.

Yes, before we can mark a (RWO) migration as completed we have to add a label to a PVC and delete another one. It is an extra step, and we have to make sure it's done right, but it's not much code at all.

kubevirt-bot · 2024-09-23T14:51:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jean-edouard. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dankenigsberg

The toughest issue in the primary solution is handling migration and VM failures (and the reason I prefer the alternative solution). Would you elaborate on that issue?

dankenigsberg · 2024-10-01T10:34:11Z

design-proposals/backend-storage-clone-on-migration/backend-storage-clone-on-migration.md

+  - We create a new empty PVC for the target, with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation
+  - In the migration object status, we store the names of both the source and target PVCs (see API section below)
+  - If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
+  - If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object


Does this hold true also with post-copy migration?

I believe so. In a failed post-copy migration, I don't think we can rely on the state of the target, but we know the source is sane.

I think we should be more certain about this. The guest os is running on the destination host. I assume that it makes changes to the storage there and not to the source - but please check this out.

Your concern is valid, but it's just a part of the dangers of post-copy. Yes, if a guest is ongoing a post-copy migration and, after the handoff, makes changes to the TPM, then before the end of the migration someone trips on the network cable, the changes made to the TPM will be lost. But that's exactly why 1) post-copy comes with a big warning, 2) bitlocker strongly encourages you to write down your recovery key.
You can also imagine the opposite scenario, if we decided to keep the target PVC and the (post-handoff post-copy) migration failed right during a change to the TPM. The PVC on the target would now contain a corrupt state and switching to it would break the VM (that's what I was referring to above). Same concern applies in RWX btw.
Post-copy migration might break stuff, I don't think there's any way around it...

if a guest is ongoing a post-copy migration and, after the handoff, makes changes to the TPM, then before the end of the migration someone trips on the network cable, the changes made to the TPM will be lost.

That is not quite correct from the QEMU/libvirt POV. In the event of failed post-copy, both the source and dest QEMU will remain in existence, but both in paused state, since neither has the full state in isolation.

When the network problem is resolved, you can call virDomainMigrateStartPostCopy with the VIR_MIGRATE_POSTCOPY_RESUME flag, to recover from the problem and complete the migration, without data loss.

dankenigsberg · 2024-10-01T10:35:33Z

design-proposals/backend-storage-clone-on-migration/backend-storage-clone-on-migration.md

+  - In the migration object status, we store the names of both the source and target PVCs (see API section below)
+  - If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
+  - If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
+  - In the unlikely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of


How would the migration finalizer know which of the two PVCs is the correct one?

This needs to be reworded. If the VMI disappears (i.e. shutdown), the migration is automatically failed, and we fall into the bullet point above.

Signed-off-by: Jed Lejosne <jed@redhat.com>

dankenigsberg · 2024-10-08T08:58:42Z

design-proposals/backend-storage-clone-on-migration/backend-storage-clone-on-migration.md

+  - We create a new empty PVC for the target, with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation
+  - In the migration object status, we store the names of both the source and target PVCs (see API section below)
+  - If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
+  - If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object


I think we should be more certain about this. The guest os is running on the destination host. I assume that it makes changes to the storage there and not to the source - but please check this out.

dankenigsberg · 2024-10-08T09:04:43Z

design-proposals/backend-storage-clone-on-migration/backend-storage-clone-on-migration.md

+  - In the migration object status, we store the names of both the source and target PVCs (see API section below)
+  - If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
+  - If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
+  - In the unlikely event that a VM shuts down during a migration, the migration will be considered failed and the target PVC will get discarded


What if right after migration ends successfully in the libvirt level, the VM makes a change to the TPM and shuts down? Discarding the target PVC would make the data lost.

Yes, see #314 (comment)

I don't find this acceptable.

This is not a KubeVirt problem, same thing would happen on bare-metal... If one doesn't find the risks of post-copy acceptable, they should just not use it.
What we can and should do is document those risks thoroughly.

What if right after migration ends successfully in the libvirt level, the VM makes a change to the TPM and shuts down? Discarding the target PVC would make the data lost.

Sorry I misread your question. If a libvirt migration ends successfully then the KubeVirt migration will also end successfully and all will be fine.
I read something like "What if right after migration handoff [...]", my bad.

I see. To repeat back and summarize, if all the following is true we have a problem:

We're inside the, at most 400ms, window when the libvirt migration is finished but KubeVirt doesn't know yet

The source pod/node crashes

The VM writes to the TPM

Then the VMIM will fail and that TPM write will be dropped.
Honestly, to me, this scenario feels quite statistically impossible...
If we want to address this, do you have a suggestion how?
I brainstormed this with @stu-gott and this is the only thing we could come up with:

Libvirt+swtpm implement a state lock that's automatically enabled at the end of a migration

Once KubeVirt sees that the libvirt migration succeeded, it instructs libvirt to release that lock

If that makes sense to you we can discuss it with the libvirt team.
However, would you agree to do that as a follow-up effort?

Thanks!

The 400ms is only the time it takes for the migration monitor to update the domain metadata.
Then there is also the time required to propagate the migration status to other kubevirt components/objects.
Overall, it's possible that this can be more than just a statistically insignificant issue.

I didn't have much time to think this through.
However, If we are going to continue with this effort we need to come up with a mechanism that will allow us to choose the right volume on the next VM startup.
I don't think that locking is enough - I don't see how it solves the issue. It doesn't look like there is a competing process that can write into the source volume.

Instead of discarding the target volume for unfinalized migrations, we should keep it around.
Then, on the next VM start up we can select the volume with the latest time stamp.

Instead of discarding the target volume for unfinalized migrations, we should keep it around.
Then, on the next VM start up we can select the volume with the latest time stamp.

We can easily change the code to keep target volumes around for migrations that didn't succeed.
However, that timestamp thing doesn't make sense to me. We have no idea what the state of the target PVC is as long as we don't see the migration succeed. It could very well be empty or partially populated. What we can do is keep it around and clean it up next time we migrate the VM? That way it's there for recovery purposes in case the highly unlikely scenario above occurs.

Instead of discarding the target volume for unfinalized migrations, we should keep it around.
Then, on the next VM start up we can select the volume with the latest time stamp.

We can easily change the code to keep target volumes around for migrations that didn't succeed.

For migrations that didn't reach the completed status - not for any failed migration.

However, that timestamp thing doesn't make sense to me. We have no idea what the state of the target PVC is as long as we don't see the migration succeed. It could very well be empty or partially populated. What we can do is keep it around and clean it up next time we migrate the VM? That way it's there for recovery purposes in case the highly unlikely scenario above occurs.

We are only speaking about the scenario when the VM is on the target and is writing to the target PVC when the source is crashed/unavailable. There should be something on the target as the migration has actually been completed. (only the source doesn't know about it and didn't update all other components)

I feel that there is a disconnect, let's take it offline when I'm back.

@alicefr This issue and kubevirt/kubevirt#13112 may affect the already implemented volume migration.

xpivarc · 2024-11-18T15:23:49Z

/sig compute

kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Jul 23, 2024

kubevirt-bot requested review from cwilkers and jobbler July 23, 2024 19:54

kubevirt-bot added the size/M label Jul 23, 2024

dankenigsberg suggested changes Jul 24, 2024

View reviewed changes

kubevirt-bot assigned dankenigsberg Jul 24, 2024

dankenigsberg suggested changes Jul 25, 2024

View reviewed changes

design proposal: new migration process for backend-storage PVCs

45edea3

Signed-off-by: Jed Lejosne <jed@redhat.com>

jean-edouard force-pushed the blockbsmigration branch from 06a1147 to 45edea3 Compare July 25, 2024 18:03

dankenigsberg suggested changes Jul 25, 2024

View reviewed changes

dankenigsberg suggested changes Jul 28, 2024

View reviewed changes

mhenriks reviewed Jul 30, 2024

View reviewed changes

design proposal: backend-storage migration: simplify by taking block …

4fa3742

…support out of scope Signed-off-by: Jed Lejosne <jed@redhat.com>

dankenigsberg suggested changes Aug 13, 2024

View reviewed changes

alicefr reviewed Aug 14, 2024

View reviewed changes

design proposal: backend-storage migration: add diagrams

80e29ec

Signed-off-by: Jed Lejosne <jed@redhat.com>

dankenigsberg reviewed Aug 17, 2024

View reviewed changes

jean-edouard mentioned this pull request Aug 19, 2024

backend-storage: add RWO FS support kubevirt/kubevirt#12629

Merged

8 tasks

kubevirt-bot added size/L and removed size/M labels Sep 17, 2024

jean-edouard force-pushed the blockbsmigration branch from 8aa9913 to 5d5771f Compare September 17, 2024 18:54

dankenigsberg reviewed Sep 21, 2024

View reviewed changes

jean-edouard force-pushed the blockbsmigration branch from 5d5771f to 1298835 Compare September 23, 2024 14:51

dankenigsberg suggested changes Oct 1, 2024

View reviewed changes

design proposal: backend-storage migration: rework and reword

31f2309

Signed-off-by: Jed Lejosne <jed@redhat.com>

jean-edouard force-pushed the blockbsmigration branch from 1298835 to 31f2309 Compare October 1, 2024 17:57

dankenigsberg reviewed Oct 8, 2024

View reviewed changes

kubevirt-bot added the sig/compute label Nov 18, 2024

	Any VM user that wants to use persistent features, such as TPM or EFI
	Any VM owner that wants to use persistent features, such as TPM or EFI

	- No need to duplicate backend-storage PVCs on migrations
	- No need to duplicate backend-storage PVCs on migrations and delete the correct one after migration success/failure
	- libvirt gains an important feature

	The above is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
	The above is the best we can do with the current libvirt implementation, which involves mounting the same partition on migration source and target.

	- Relies on third party projects adding and maintaining options to store config files into a block devices without filesystems
	- Relies on libvirt and qemu adding and maintaining options to store config files into a block devices without filesystems

design proposal: new migration process for backend-storage PVCs #314

Are you sure you want to change the base?

design proposal: new migration process for backend-storage PVCs #314

Conversation

jean-edouard commented Jul 23, 2024

dankenigsberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhenriks Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jean-edouard commented Jul 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladikr Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jean-edouard commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jean-edouard Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhenriks commented Jul 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alicefr Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alicefr Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhenriks Jul 30, 2024 •

edited

Loading

vladikr Jul 29, 2024 •

edited

Loading

jean-edouard Jul 31, 2024 •

edited

Loading

alicefr Aug 14, 2024 •

edited

Loading

alicefr Aug 15, 2024 •

edited

Loading