Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design proposal: new migration process for backend-storage PVCs #314

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jean-edouard
Copy link
Contributor

NONE

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Jul 23, 2024
Copy link
Member

@dankenigsberg dankenigsberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jean-edouard ! Thanks for this much-needed work. I have a few questions, whose answers can be incorporated into the proposal?

- The storage class marked as the default for the cluster
- All new persistent storage PVCs will be created with the name `persistent-state-for-<vm_name>-<random_string>` and the new annotation `persistent-state-for-<vm_name>=true`
- When starting an existing VM with persistent features, we will look for any PVC with the annotation, or any PVC with the legacy name, for backwards compatibility
- For modes, the filesystem volume mode will be preferred (less complexity), and we default to RWO (because RWX won't be needed anymore)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be better off deferring to CDI logic, using the StorageProfile of the selected storage class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the storage profile will definitely be used, but if multiple access modes are available we still need to pick one

- For modes, the filesystem volume mode will be preferred (less complexity), and we default to RWO (because RWX won't be needed anymore)
- On migration, no matter what volume/access mode the PVC uses, we create a new empty PVC with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation
- In the migration object status, we store the names of both the source and target PVCs (see API section below)
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds raceful. What happens if migration succeeds, and the VM crashes before we move over the label?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why status fields are added to the migration object. The migration won't move to the finished state as long as the PVCs are not sorted out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a deeper discussion of the mode of failure. I don't understand how you can coordinate between the source and destination without interaction with qemu.


This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration,
allowing all 4 cases above to have proper migration support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we care about migratability with RWO storage? We don't support it for root disk. Would it simplify our solution if we assume RWX, so that the same PVC is accessible both on source and destination nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this proposal is to stop mounting the PVC on both source and target, to enable migratability of block-based backend-storage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why we are interested with RWO. If we have only RWO, we cannot migrate the root disk. I don't understand why we conflate this proposal (and future tests) with this combination (migration+RWO).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backend-sotage may be set to a different storage class than the other disks.
What is being discussed here is that since the content of this volume will be copied, the access mode of such volume will become irrelevant and might as well be RWO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about RWO and discussing it makes this proposal more complex and slower to test. What is the benefit of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, we can ignore it. However, in many cases, this is the RWO is the default access mode. We don't have to test or support this.

Copy link
Member

@mhenriks mhenriks Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about containerdisk VMs? They can be live migrated, correct? Can we use emptydir volumes for backend storage in that case?

EDIT obviously a weird case probably not necessary to support at least initially

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using emptydir for backend storage is effectively what happens for non-persistent TPM/EFI.
If we want persistence, we need a PVC.

- RWO FS: supported by backend-storage but makes VMs non-migratable
- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way
- RWX block: same as above, with migratability functional but considered unsafe (as it concurrently mounts an ext3 partition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide some background why this is the case?
We support RWX block for root disk with no problem. Why does vTPM requires creating a filesystem on the storage? Have we requested qemu/libvirt to provide support for filesystem-free vTPM persistence?

In other words: does libvirt have the same problem? If the only shared storage a user has is block, can they use vTPM+migration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because, unlike with VM disks, we have 2 components concurrently accessing the PVC (source and target virt-launchers).
They both maintain the filesystem metadata for the mounted partition, which will lead to conflicts. I can add more verbiage to explain that in the doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add more verbiage to explain that in the doc.

Please do. In particularly, please refer to my question

Does libvirt have the same problem? If the only shared storage a user has is block, can they use vTPM+migration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the TPM implementation uses multiple files under the hood.

This means one of two things must be true. Either some component in the stack must manage a filesystem on the block device or the TPM implementation must be rewritten from the ground up. The latter doesn't seem like a fair request to the underlying projects. At least not in a timeline that suits our needs.

The filesystem being used is ext3, so it can either be mounted on the source or destination only. These mounts are mutually exclusive with each other regardless of whether or not the PVC is using RWX mode. This is simply a limitation of ext3 (or most filesystems for that matter).

Given these constraints, KubeVirt is the best component to manage this problem because it can manage PVCs where libvirt can't--because libvirt runs inside a container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stu-gott do I understand that your the answer to my question is affirmative? libvirt has the same problem and cannot cannot use shared block+vTPM+migration?

Have we asked them to support this combination?
Do they not mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dankenigsberg I answered in another comment. The answer is, to the best of my knowledge, no.
swtpm, qemu and/or libvirt don't support saving tpm state on block storage and there is no plan to support this.
If we want to support this scenario in KubeVirt we will need to provide the management around this.

- Users want all VMs with persistent features to be potentially migratable

## Definition of Users
Any VM user that wants to use persistent features, such as TPM or EFI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Any VM user that wants to use persistent features, such as TPM or EFI
Any VM owner that wants to use persistent features, such as TPM or EFI

defining "user" via the word "user" sounds circular to me.


## Functional Testing Approach
All existing backend-storage-related functional tests will still apply. More tests could be added to ensure all 4 combinations
of FS/block RWX/RWO work for backend-storage, if we think it's worth the additional stress on the e2e test lanes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

migration tests are already quite time consuming. can vTPM device added to existing migrating VMs so as not to make the test run longer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is definitely something to keep in mind when adding functional test coverage for this. It's indeed what I was trying to hint at there

@jean-edouard
Copy link
Contributor Author

Hi @jean-edouard ! Thanks for this much-needed work. I have a few questions, whose answers can be incorporated into the proposal?

Thank you for the review! Comments answered in github, I will amend the design proposal to reflect them.

- RWO FS: supported by backend-storage but makes VMs non-migratable
- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way
- RWX block: same as above, with migratability functional but considered unsafe (as it concurrently mounts an ext3 partition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add more verbiage to explain that in the doc.

Please do. In particularly, please refer to my question

Does libvirt have the same problem? If the only shared storage a user has is block, can they use vTPM+migration?


This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration,
allowing all 4 cases above to have proper migration support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why we are interested with RWO. If we have only RWO, we cannot migrate the root disk. I don't understand why we conflate this proposal (and future tests) with this combination (migration+RWO).

- For modes, the filesystem volume mode will be preferred (less complexity), and we default to RWO (because RWX won't be needed anymore)
- On migration, no matter what volume/access mode the PVC uses, we create a new empty PVC with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation
- In the migration object status, we store the names of both the source and target PVCs (see API section below)
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a deeper discussion of the mode of failure. I don't understand how you can coordinate between the source and destination without interaction with qemu.

Signed-off-by: Jed Lejosne <jed@redhat.com>

This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration,
allowing all 4 cases above to have proper migration support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about RWO and discussing it makes this proposal more complex and slower to test. What is the benefit of it?

- In the migration object status, we store the names of both the source and target PVCs (see API section below)
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
- In the unlifely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could the finalizer make this decision? I don't think it has enough information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The finalizer itself doesn't solve this issue. It only makes sure that the migration object won't get deleted before the correct volume is annotated by the controller.

However, there is still a race that this proposal will need to address.

There is a short period when the migration is finalized on the host, but before the controller has annotated the correct volume. The VM could shutdown and start again, picking up an incorrect volume.

I think that this issue can be addressed by simply re-queueing the newly created VMI object if it has an unfinalized migration.
This is an issue on its own that a new VM can start while it has an unfinished migration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the existence and nature of these race conditions should be highlighted in this proposal. We should discuss means to eliminate them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. @jean-edouard could you please highlight this race and the suggested solution?

- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way
- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently.

This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you elaborate about partitioning? I see "partitioning complexity" below, but the significance is not clear to me.

What do you mean by "current implementation"? KubeVirt's? libvirt's? Have we considered other (possibly better, yet more intrusive) implementations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swtpm, qemu and/or libvirt don't support saving tpm state on block storage.
libvirt supports saving this state in an in-memory secret but it won't be migrateable.
live migration is only supported with a state saved on shared storage.
There is no plan to support any other implementation. To the best of my knowledge.

To support block storage as backed for vTPM, kubevirt will need to have access to in the swtpm container, build a filesystem on top, and mount it.
The issues arise when kubevirt attempts to mount this volume on both the source and the target containers.

I hope it answers your question.

This proposal suggests creating two separate block volumes and allowing qemu to copy the state, however, it will require kubevirt to manage the volumes on migration - which creates other challenges to overcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no plan to support any other implementation. To the best of my knowledge.

Would you elaborate on this? I don't suppose that KubeVirt is the only virtualization management project that cares about block storage+tpm+migration. Can we point to a request from libvirt to support this, with the reasoning of its rejection?

Copy link
Member

@vladikr vladikr Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge, other platforms are fine with a shared filesystem volume for migration or don't support it entirely[1].

IIRC, we asked. @jean-edouard @stu-gott @mhenriks @alicefr could you please add your thoughts here?

OpenStack Nova: "Live migration, evacuation, shelving and rescuing of servers with vTPMs is not currently supported."
[1] https://docs.openstack.org/nova/latest/admin/emulated-tpm.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xpivarc pointed out that it looks like swtpm supports block storage, so I take my previous statement back until we figure out if we can use this option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, with --tpmstate backend-uri=file://, swtpm can serialize its files into something like a block storage device.
This option is however not supported by libvirt. That also doesn't work with EFI.

- RWO FS: supported by backend-storage but makes VMs non-migratable
- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way
- RWX block: same as above, with migratability functional but considered unsafe (as it concurrently mounts an ext3 partition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stu-gott do I understand that your the answer to my question is affirmative? libvirt has the same problem and cannot cannot use shared block+vTPM+migration?

Have we asked them to support this combination?
Do they not mind?

- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way
- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently.

This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no plan to support any other implementation. To the best of my knowledge.

Would you elaborate on this? I don't suppose that KubeVirt is the only virtualization management project that cares about block storage+tpm+migration. Can we point to a request from libvirt to support this, with the reasoning of its rejection?

- In the migration object status, we store the names of both the source and target PVCs (see API section below)
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
- In the unlifely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the existence and nature of these race conditions should be highlighted in this proposal. We should discuss means to eliminate them.

@jean-edouard
Copy link
Contributor Author

This could also be a concern: https://github.com/libvirt/libvirt/blob/e40a533118efdc3d5f6fd1dcc9bb8143b54691e2/src/qemu/qemu_tpm.c#L571
(if the TPM is stored under a "shared" (nfs/ceph) storage, it looks like it's automatically considered as shared between the source and the target)

- All new persistent storage PVCs will be created with the name `persistent-state-for-<vm_name>-<random_string>` and the new annotation `persistent-state-for-<vm_name>=true`
- When starting an existing VM with persistent features, we will look for any PVC with the annotation, or any PVC with the legacy name, for backwards compatibility
- For volume/access modes:
- When multiple volume modes are listed in the StorageProfile of the StorageClass, we will default to filesystem (to avoid partitioning complexity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid partitioning entirely. Any StorageClass that supports block should natively support filesystem (like rook-ceph-block)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's huge. Block support is messy and somewhat illogical. Reducing this design to only add RWO FS support would make everything so much simpler and sensible.

- For volume/access modes:
- When multiple volume modes are listed in the StorageProfile of the StorageClass, we will default to filesystem (to avoid partitioning complexity)
- When multiple access modes are listed for the selected volume mode, we will arbitrarily default to RWO.
- When CDI is not present, or the StorageClass has no StorageProfile, or the StorageProfile has no modes, we'll default to RWO FS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RWO FS is the lowest common denominator in k8s you will be hard pressed to find a provisioner that does not support that. Not sure we should even consider any other combinations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RWO FS is the lowest common denominator in k8s you will be hard pressed to find a provisioner that does not support that

That's a very important note. How common is RWX FS which we already support with migration?

There may be operational problems with splitting VM persistence between several storage classes (disaster recovery and storage migration may become more complex, virt-default StorageClass) which may keep this proposal useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very important note. How common is RWX FS which we already support with migration?

RWX FS is not very popular. Shared filesystems have been declining in popularity since object storage entered the scene. Shared filesystems are generally considered complex and slow. For legacy applications. Avoided whenever possible

There may be operational problems with splitting VM persistence between several storage classes (disaster recovery and storage migration may become more complex, virt-default StorageClass) which may keep this proposal useful.

I'm not following this. Where's the splitting happening? My point is that the same storageclass can be used for block and fs. For example the rook-ceph-block StorageClass can do RWX block and RWO fs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example the rook-ceph-block StorageClass can do RWX block and RWO fs.

I didn't know that. Thanks.

So to support TPM on rook-ceph-block we have two options: either ask libvirt to support RWX block, or add our own support for RWO file. This proposal should discuss the pros and cons of each.

Copy link
Contributor Author

@jean-edouard jean-edouard Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this proposal is all about option 2, but I guess we could touch on option 1 in the intro.
A brief pros/cons:
Option 1 requires some work on our side (multi-PVC support) and a lot of work on libvirt's (add support for swtpm block + work with qemu to design and implement EFI block). The main con IMO is the unreasonable ask of libvirt/qemu to support pre-1961 technology (disks without partitions/filesystems).
Option 2 just requires implementing this design proposal, including ensuring there's no race when rebooting during a migration. Seems a lot saner to me, even if it's more work for us.

Edit: merge request opened for libvirt swtpm block support: https://gitlab.com/libvirt/libvirt/-/merge_requests/380

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

work with qemu to design and implement EFI block

Seems like a big chunk of work for another team and would ultimately require 2 PVCs for VMs with EFI+TPM, correct? With filesystem PVCs, TPM and EFI can share a single PVC, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fewer PVCs is big PRO for option 2

of FS/block RWX/RWO work for backend-storage, if we think it's worth the additional stress on the e2e test lanes.

# Implementation Phases
First phase is block support. That effort is well underway already.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worth investing in block PVC support. What provisioner only supports block PVCs?

@mhenriks
Copy link
Member

This could also be a concern: https://github.com/libvirt/libvirt/blob/e40a533118efdc3d5f6fd1dcc9bb8143b54691e2/src/qemu/qemu_tpm.c#L571 (if the TPM is stored under a "shared" (nfs/ceph) storage, it looks like it's automatically considered as shared between the source and the target)

Interesting. So could be possible to ask for a RWO FileSystem PVC, get nfs, and at migration time, data will not be copied to the new PVC?

I believe we've encountered a similar problem before, but the reverse. User requests RWX PVC but get ext4 because VM is scheduled on the same node as the nfs server. I believe LD_PRELOAD hacks were used to get around the issue until there was a proper fix

…support out of scope

Signed-off-by: Jed Lejosne <jed@redhat.com>

Pros:
- Less code to write and maintain in KubeVirt
- No need to duplicate backend-storage PVCs on migrations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- No need to duplicate backend-storage PVCs on migrations
- No need to duplicate backend-storage PVCs on migrations and delete the correct one after migration success/failure
- libvirt gains an important feature

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a public request on libvirt to add this functionality, and include the URL to it here. The simplest way to "kill" this alternative is if libvirt rejects it. I would think that this would be a mistake on their behalf, but that's their question to answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with a few members of the libvirt team, and they actually don't appear to have strong objections against it.
There would be 4 parts to this:

  • block storage support in swtpm: already done
  • libvirt glue for swtpm block storage: draft PR open (https://gitlab.com/libvirt/libvirt/-/merge_requests/380)
  • block storage support in qemu for NVRAM: could work out of the box, since the NVRAM is already just 1 file (just have to ensure they don't run file-specific operations on it)
  • libvirt glue for NVRAM block storage: could also work out of the box, although unlikely, but shouldn't require too much work (just ensuring the "file" is never moved/deleted/chmoded...)

So really not to much left to do overall, however we would have to wait for the first release of every component that needs changes, which could take a long time.
This approach is also definitely not my first choice, and I would hate to request something, have upstream devs work hard on it and ultimately not consume it...
That's why I didn't file any request yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is also definitely not my first choice

Please make sure to spell out all your reasoning as its "cons". For example, I did not see the issue of timing written in the proposal.

That's why I didn't file any request yet.

Let have a formal request and link to it here. We should be explicit that because the reasons spelled out here we may not use it, or may not use it immediately. But let us give the libvirt team an opportunity to refute our assumptions - maybe they can deliver this functionality fast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a public request on libvirt to add this functionality, and include the URL to it here.

https://issues.redhat.com/browse/RHEL-54325
https://issues.redhat.com/browse/RHEL-54326

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. github comments can easily get lost. Please add these to the design doc - the fact that they are not done is a reason against the alternative option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will rework this design proposal this week and include those links.

- RWO FS: supported by backend-storage but makes VMs non-migratable. **This entire design proposal is about removing that limitation.**
- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way
- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently.
Copy link
Member

@alicefr alicefr Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm a bit late on this. Have we ever considered to use a clustered filesystem like gfs2 instead of ext3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good question. gfs2 somehow didn't pop up when I looked for potentially suitable filesystems for formatting the block storage, probably worth investigating!
That being said, this proposal (and even the alternative it contains) doesn't consider the option of formatting the block storage (anymore), and not just because I couldn't find a suitable filesystem.
The fact that it requires adding a bunch of tools to virt-launcher and virt-handler, and the general complexity of managing partitions ourselves, make it something we should avoid, in my opinion (see draft PR).

Copy link
Member

@alicefr alicefr Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I read a bit about gfs2 supported by Red Hat, and it requires clvm. So, I think this also excludes it as a solution

Signed-off-by: Jed Lejosne <jed@redhat.com>
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way
- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently.

The above is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The above is the best we can do with the current implementation, which involves mounting the same partition on migration source and target.
The above is the best we can do with the current libvirt implementation, which involves mounting the same partition on migration source and target.

I'm not at all sure I understand to which "current implementation" you refer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rephrase, what I mean is "without partitioning block / creating new PVCs on migration / libvirt block support"

- Libvirt gains a feature

Cons:
- Relies on third party projects adding and maintaining options to store config files into a block devices without filesystems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Relies on third party projects adding and maintaining options to store config files into a block devices without filesystems
- Relies on libvirt and qemu adding and maintaining options to store config files into a block devices without filesystems

are these the third parties you refer to?

Comment on lines +34 to +36
- We would have to either
- Maintain 2 different backend-storage implementations (the existing RWX FS one and this RWX block one)
- Deprecate RWX FS and require RWX block to enable backend-storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this point. In the Primary Design, what are do you plan to do with the existing code referring to RWX FS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the primary design, we keep support for RWX FS and extend it to support RWO FS. Since both use virtually the same codepath, it's just one thing to maintain.
However, RWX block would need a completely different approach, so if we want to avoid maintaining 2 different ways to do the same thing we'd have to deprecate RWX FS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the code, but ideally, by delegating the trouble to libvirt, the only difference between RWX FS and RWX Block would be passing a block device to libvirt instead of a filesystem path.

In the primary implementation, we need to add code to synchronize between source and destination during success and various modes of failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the code, but ideally, by delegating the trouble to libvirt, the only difference between RWX FS and RWX Block would be passing a block device to libvirt instead of a filesystem path.

Yes, well multiple block devices. That means changing every function that touches backend-storage to handle multiple PVCs.
It would also require changing the volumes and mounts we pass to virt-launcher. Finally the libvirt options would indeed look different.
That is a non-trivial amount of differences.

In the primary implementation, we need to add code to synchronize between source and destination during success and various modes of failures.

Yes, before we can mark a (RWO) migration as completed we have to add a label to a PVC and delete another one. It is an extra step, and we have to make sure it's done right, but it's not much code at all.

@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jean-edouard. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@dankenigsberg dankenigsberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The toughest issue in the primary solution is handling migration and VM failures (and the reason I prefer the alternative solution). Would you elaborate on that issue?

- We create a new empty PVC for the target, with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation
- In the migration object status, we store the names of both the source and target PVCs (see API section below)
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this hold true also with post-copy migration?

Copy link
Contributor Author

@jean-edouard jean-edouard Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so. In a failed post-copy migration, I don't think we can rely on the state of the target, but we know the source is sane.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be more certain about this. The guest os is running on the destination host. I assume that it makes changes to the storage there and not to the source - but please check this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your concern is valid, but it's just a part of the dangers of post-copy. Yes, if a guest is ongoing a post-copy migration and, after the handoff, makes changes to the TPM, then before the end of the migration someone trips on the network cable, the changes made to the TPM will be lost. But that's exactly why 1) post-copy comes with a big warning, 2) bitlocker strongly encourages you to write down your recovery key.
You can also imagine the opposite scenario, if we decided to keep the target PVC and the (post-handoff post-copy) migration failed right during a change to the TPM. The PVC on the target would now contain a corrupt state and switching to it would break the VM (that's what I was referring to above). Same concern applies in RWX btw.
Post-copy migration might break stuff, I don't think there's any way around it...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a guest is ongoing a post-copy migration and, after the handoff, makes changes to the TPM, then before the end of the migration someone trips on the network cable, the changes made to the TPM will be lost.

That is not quite correct from the QEMU/libvirt POV. In the event of failed post-copy, both the source and dest QEMU will remain in existence, but both in paused state, since neither has the full state in isolation.

When the network problem is resolved, you can call virDomainMigrateStartPostCopy with the VIR_MIGRATE_POSTCOPY_RESUME flag, to recover from the problem and complete the migration, without data loss.

- In the migration object status, we store the names of both the source and target PVCs (see API section below)
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
- In the unlikely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the migration finalizer know which of the two PVCs is the correct one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reworded. If the VMI disappears (i.e. shutdown), the migration is automatically failed, and we fall into the bullet point above.

Signed-off-by: Jed Lejosne <jed@redhat.com>
- We create a new empty PVC for the target, with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation
- In the migration object status, we store the names of both the source and target PVCs (see API section below)
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be more certain about this. The guest os is running on the destination host. I assume that it makes changes to the storage there and not to the source - but please check this out.

- In the migration object status, we store the names of both the source and target PVCs (see API section below)
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object
- In the unlikely event that a VM shuts down during a migration, the migration will be considered failed and the target PVC will get discarded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if right after migration ends successfully in the libvirt level, the VM makes a change to the TPM and shuts down? Discarding the target PVC would make the data lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see #314 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this acceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a KubeVirt problem, same thing would happen on bare-metal... If one doesn't find the risks of post-copy acceptable, they should just not use it.
What we can and should do is document those risks thoroughly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if right after migration ends successfully in the libvirt level, the VM makes a change to the TPM and shuts down? Discarding the target PVC would make the data lost.

Sorry I misread your question. If a libvirt migration ends successfully then the KubeVirt migration will also end successfully and all will be fine.
I read something like "What if right after migration handoff [...]", my bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. To repeat back and summarize, if all the following is true we have a problem:

  • We're inside the, at most 400ms, window when the libvirt migration is finished but KubeVirt doesn't know yet
  • The source pod/node crashes
  • The VM writes to the TPM

Then the VMIM will fail and that TPM write will be dropped.
Honestly, to me, this scenario feels quite statistically impossible...
If we want to address this, do you have a suggestion how?
I brainstormed this with @stu-gott and this is the only thing we could come up with:

  • Libvirt+swtpm implement a state lock that's automatically enabled at the end of a migration
  • Once KubeVirt sees that the libvirt migration succeeded, it instructs libvirt to release that lock

If that makes sense to you we can discuss it with the libvirt team.
However, would you agree to do that as a follow-up effort?

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 400ms is only the time it takes for the migration monitor to update the domain metadata.
Then there is also the time required to propagate the migration status to other kubevirt components/objects.
Overall, it's possible that this can be more than just a statistically insignificant issue.

I didn't have much time to think this through.
However, If we are going to continue with this effort we need to come up with a mechanism that will allow us to choose the right volume on the next VM startup.
I don't think that locking is enough - I don't see how it solves the issue. It doesn't look like there is a competing process that can write into the source volume.

Instead of discarding the target volume for unfinalized migrations, we should keep it around.
Then, on the next VM start up we can select the volume with the latest time stamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of discarding the target volume for unfinalized migrations, we should keep it around.
Then, on the next VM start up we can select the volume with the latest time stamp.

We can easily change the code to keep target volumes around for migrations that didn't succeed.
However, that timestamp thing doesn't make sense to me. We have no idea what the state of the target PVC is as long as we don't see the migration succeed. It could very well be empty or partially populated. What we can do is keep it around and clean it up next time we migrate the VM? That way it's there for recovery purposes in case the highly unlikely scenario above occurs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of discarding the target volume for unfinalized migrations, we should keep it around.
Then, on the next VM start up we can select the volume with the latest time stamp.

We can easily change the code to keep target volumes around for migrations that didn't succeed.

For migrations that didn't reach the completed status - not for any failed migration.

However, that timestamp thing doesn't make sense to me. We have no idea what the state of the target PVC is as long as we don't see the migration succeed. It could very well be empty or partially populated. What we can do is keep it around and clean it up next time we migrate the VM? That way it's there for recovery purposes in case the highly unlikely scenario above occurs.

We are only speaking about the scenario when the VM is on the target and is writing to the target PVC when the source is crashed/unavailable. There should be something on the target as the migration has actually been completed. (only the source doesn't know about it and didn't update all other components)

I feel that there is a disconnect, let's take it offline when I'm back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicefr This issue and kubevirt/kubevirt#13112 may affect the already implemented volume migration.

@xpivarc
Copy link
Member

xpivarc commented Nov 18, 2024

/sig compute

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. sig/compute size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants