-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design proposal: new migration process for backend-storage PVCs #314
base: main
Are you sure you want to change the base?
Conversation
jean-edouard
commented
Jul 23, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jean-edouard ! Thanks for this much-needed work. I have a few questions, whose answers can be incorporated into the proposal?
- The storage class marked as the default for the cluster | ||
- All new persistent storage PVCs will be created with the name `persistent-state-for-<vm_name>-<random_string>` and the new annotation `persistent-state-for-<vm_name>=true` | ||
- When starting an existing VM with persistent features, we will look for any PVC with the annotation, or any PVC with the legacy name, for backwards compatibility | ||
- For modes, the filesystem volume mode will be preferred (less complexity), and we default to RWO (because RWX won't be needed anymore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may be better off deferring to CDI logic, using the StorageProfile of the selected storage class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the storage profile will definitely be used, but if multiple access modes are available we still need to pick one
- For modes, the filesystem volume mode will be preferred (less complexity), and we default to RWO (because RWX won't be needed anymore) | ||
- On migration, no matter what volume/access mode the PVC uses, we create a new empty PVC with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation | ||
- In the migration object status, we store the names of both the source and target PVCs (see API section below) | ||
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sounds raceful. What happens if migration succeeds, and the VM crashes before we move over the label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why status fields are added to the migration object. The migration won't move to the finished state as long as the PVCs are not sorted out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a deeper discussion of the mode of failure. I don't understand how you can coordinate between the source and destination without interaction with qemu.
|
||
This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target. | ||
This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration, | ||
allowing all 4 cases above to have proper migration support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we care about migratability with RWO storage? We don't support it for root disk. Would it simplify our solution if we assume RWX, so that the same PVC is accessible both on source and destination nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this proposal is to stop mounting the PVC on both source and target, to enable migratability of block-based backend-storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why we are interested with RWO. If we have only RWO, we cannot migrate the root disk. I don't understand why we conflate this proposal (and future tests) with this combination (migration+RWO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backend-sotage may be set to a different storage class than the other disks.
What is being discussed here is that since the content of this volume will be copied, the access mode of such volume will become irrelevant and might as well be RWO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about RWO and discussing it makes this proposal more complex and slower to test. What is the benefit of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my side, we can ignore it. However, in many cases, this is the RWO is the default access mode. We don't have to test or support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about containerdisk VMs? They can be live migrated, correct? Can we use emptydir volumes for backend storage in that case?
EDIT obviously a weird case probably not necessary to support at least initially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using emptydir for backend storage is effectively what happens for non-persistent TPM/EFI.
If we want persistence, we need a PVC.
- RWO FS: supported by backend-storage but makes VMs non-migratable | ||
- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations | ||
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way | ||
- RWX block: same as above, with migratability functional but considered unsafe (as it concurrently mounts an ext3 partition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you provide some background why this is the case?
We support RWX block for root disk with no problem. Why does vTPM requires creating a filesystem on the storage? Have we requested qemu/libvirt to provide support for filesystem-free vTPM persistence?
In other words: does libvirt have the same problem? If the only shared storage a user has is block, can they use vTPM+migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because, unlike with VM disks, we have 2 components concurrently accessing the PVC (source and target virt-launchers).
They both maintain the filesystem metadata for the mounted partition, which will lead to conflicts. I can add more verbiage to explain that in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add more verbiage to explain that in the doc.
Please do. In particularly, please refer to my question
Does libvirt have the same problem? If the only shared storage a user has is block, can they use vTPM+migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the TPM implementation uses multiple files under the hood.
This means one of two things must be true. Either some component in the stack must manage a filesystem on the block device or the TPM implementation must be rewritten from the ground up. The latter doesn't seem like a fair request to the underlying projects. At least not in a timeline that suits our needs.
The filesystem being used is ext3, so it can either be mounted on the source or destination only. These mounts are mutually exclusive with each other regardless of whether or not the PVC is using RWX mode. This is simply a limitation of ext3 (or most filesystems for that matter).
Given these constraints, KubeVirt is the best component to manage this problem because it can manage PVCs where libvirt can't--because libvirt runs inside a container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stu-gott do I understand that your the answer to my question is affirmative? libvirt has the same problem and cannot cannot use shared block+vTPM+migration?
Have we asked them to support this combination?
Do they not mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dankenigsberg I answered in another comment. The answer is, to the best of my knowledge, no.
swtpm, qemu and/or libvirt don't support saving tpm state on block storage and there is no plan to support this.
If we want to support this scenario in KubeVirt we will need to provide the management around this.
- Users want all VMs with persistent features to be potentially migratable | ||
|
||
## Definition of Users | ||
Any VM user that wants to use persistent features, such as TPM or EFI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any VM user that wants to use persistent features, such as TPM or EFI | |
Any VM owner that wants to use persistent features, such as TPM or EFI |
defining "user" via the word "user" sounds circular to me.
|
||
## Functional Testing Approach | ||
All existing backend-storage-related functional tests will still apply. More tests could be added to ensure all 4 combinations | ||
of FS/block RWX/RWO work for backend-storage, if we think it's worth the additional stress on the e2e test lanes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
migration tests are already quite time consuming. can vTPM device added to existing migrating VMs so as not to make the test run longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is definitely something to keep in mind when adding functional test coverage for this. It's indeed what I was trying to hint at there
Thank you for the review! Comments answered in github, I will amend the design proposal to reflect them. |
- RWO FS: supported by backend-storage but makes VMs non-migratable | ||
- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations | ||
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way | ||
- RWX block: same as above, with migratability functional but considered unsafe (as it concurrently mounts an ext3 partition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add more verbiage to explain that in the doc.
Please do. In particularly, please refer to my question
Does libvirt have the same problem? If the only shared storage a user has is block, can they use vTPM+migration?
|
||
This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target. | ||
This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration, | ||
allowing all 4 cases above to have proper migration support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why we are interested with RWO. If we have only RWO, we cannot migrate the root disk. I don't understand why we conflate this proposal (and future tests) with this combination (migration+RWO).
- For modes, the filesystem volume mode will be preferred (less complexity), and we default to RWO (because RWX won't be needed anymore) | ||
- On migration, no matter what volume/access mode the PVC uses, we create a new empty PVC with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation | ||
- In the migration object status, we store the names of both the source and target PVCs (see API section below) | ||
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a deeper discussion of the mode of failure. I don't understand how you can coordinate between the source and destination without interaction with qemu.
Signed-off-by: Jed Lejosne <jed@redhat.com>
06a1147
to
45edea3
Compare
|
||
This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target. | ||
This proposal is about switching to a different model which would instead create a new backend-storage PVC on migration, | ||
allowing all 4 cases above to have proper migration support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about RWO and discussing it makes this proposal more complex and slower to test. What is the benefit of it?
- In the migration object status, we store the names of both the source and target PVCs (see API section below) | ||
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object | ||
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object | ||
- In the unlifely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how could the finalizer make this decision? I don't think it has enough information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The finalizer itself doesn't solve this issue. It only makes sure that the migration object won't get deleted before the correct volume is annotated by the controller.
However, there is still a race that this proposal will need to address.
There is a short period when the migration is finalized on the host, but before the controller has annotated the correct volume. The VM could shutdown and start again, picking up an incorrect volume.
I think that this issue can be addressed by simply re-queueing the newly created VMI object if it has an unfinalized migration.
This is an issue on its own that a new VM can start while it has an unfinished migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the existence and nature of these race conditions should be highlighted in this proposal. We should discuss means to eliminate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. @jean-edouard could you please highlight this race and the suggested solution?
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way | ||
- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently. | ||
|
||
This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you elaborate about partitioning? I see "partitioning complexity" below, but the significance is not clear to me.
What do you mean by "current implementation"? KubeVirt's? libvirt's? Have we considered other (possibly better, yet more intrusive) implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swtpm, qemu and/or libvirt don't support saving tpm state on block storage.
libvirt supports saving this state in an in-memory secret but it won't be migrateable.
live migration is only supported with a state saved on shared storage.
There is no plan to support any other implementation. To the best of my knowledge.
To support block storage as backed for vTPM, kubevirt will need to have access to in the swtpm container, build a filesystem on top, and mount it.
The issues arise when kubevirt attempts to mount this volume on both the source and the target containers.
I hope it answers your question.
This proposal suggests creating two separate block volumes and allowing qemu to copy the state, however, it will require kubevirt to manage the volumes on migration - which creates other challenges to overcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no plan to support any other implementation. To the best of my knowledge.
Would you elaborate on this? I don't suppose that KubeVirt is the only virtualization management project that cares about block storage+tpm+migration. Can we point to a request from libvirt to support this, with the reasoning of its rejection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge, other platforms are fine with a shared filesystem volume for migration or don't support it entirely[1].
IIRC, we asked. @jean-edouard @stu-gott @mhenriks @alicefr could you please add your thoughts here?
OpenStack Nova: "Live migration, evacuation, shelving and rescuing of servers with vTPMs is not currently supported."
[1] https://docs.openstack.org/nova/latest/admin/emulated-tpm.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xpivarc pointed out that it looks like swtpm supports block storage, so I take my previous statement back until we figure out if we can use this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, with --tpmstate backend-uri=file://
, swtpm can serialize its files into something like a block storage device.
This option is however not supported by libvirt. That also doesn't work with EFI.
- RWO FS: supported by backend-storage but makes VMs non-migratable | ||
- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations | ||
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way | ||
- RWX block: same as above, with migratability functional but considered unsafe (as it concurrently mounts an ext3 partition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stu-gott do I understand that your the answer to my question is affirmative? libvirt has the same problem and cannot cannot use shared block+vTPM+migration?
Have we asked them to support this combination?
Do they not mind?
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way | ||
- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently. | ||
|
||
This is the best we can do with the current implementation, which involves mounting the same partition on migration source and target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no plan to support any other implementation. To the best of my knowledge.
Would you elaborate on this? I don't suppose that KubeVirt is the only virtualization management project that cares about block storage+tpm+migration. Can we point to a request from libvirt to support this, with the reasoning of its rejection?
- In the migration object status, we store the names of both the source and target PVCs (see API section below) | ||
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object | ||
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object | ||
- In the unlifely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the existence and nature of these race conditions should be highlighted in this proposal. We should discuss means to eliminate them.
This could also be a concern: https://github.com/libvirt/libvirt/blob/e40a533118efdc3d5f6fd1dcc9bb8143b54691e2/src/qemu/qemu_tpm.c#L571 |
- All new persistent storage PVCs will be created with the name `persistent-state-for-<vm_name>-<random_string>` and the new annotation `persistent-state-for-<vm_name>=true` | ||
- When starting an existing VM with persistent features, we will look for any PVC with the annotation, or any PVC with the legacy name, for backwards compatibility | ||
- For volume/access modes: | ||
- When multiple volume modes are listed in the StorageProfile of the StorageClass, we will default to filesystem (to avoid partitioning complexity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid partitioning entirely. Any StorageClass that supports block should natively support filesystem (like rook-ceph-block)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's huge. Block support is messy and somewhat illogical. Reducing this design to only add RWO FS support would make everything so much simpler and sensible.
- For volume/access modes: | ||
- When multiple volume modes are listed in the StorageProfile of the StorageClass, we will default to filesystem (to avoid partitioning complexity) | ||
- When multiple access modes are listed for the selected volume mode, we will arbitrarily default to RWO. | ||
- When CDI is not present, or the StorageClass has no StorageProfile, or the StorageProfile has no modes, we'll default to RWO FS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RWO FS is the lowest common denominator in k8s you will be hard pressed to find a provisioner that does not support that. Not sure we should even consider any other combinations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RWO FS is the lowest common denominator in k8s you will be hard pressed to find a provisioner that does not support that
That's a very important note. How common is RWX FS which we already support with migration?
There may be operational problems with splitting VM persistence between several storage classes (disaster recovery and storage migration may become more complex, virt-default StorageClass) which may keep this proposal useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very important note. How common is RWX FS which we already support with migration?
RWX FS is not very popular. Shared filesystems have been declining in popularity since object storage entered the scene. Shared filesystems are generally considered complex and slow. For legacy applications. Avoided whenever possible
There may be operational problems with splitting VM persistence between several storage classes (disaster recovery and storage migration may become more complex, virt-default StorageClass) which may keep this proposal useful.
I'm not following this. Where's the splitting happening? My point is that the same storageclass can be used for block and fs. For example the rook-ceph-block StorageClass can do RWX block and RWO fs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example the rook-ceph-block StorageClass can do RWX block and RWO fs.
I didn't know that. Thanks.
So to support TPM on rook-ceph-block
we have two options: either ask libvirt to support RWX block, or add our own support for RWO file. This proposal should discuss the pros and cons of each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this proposal is all about option 2, but I guess we could touch on option 1 in the intro.
A brief pros/cons:
Option 1 requires some work on our side (multi-PVC support) and a lot of work on libvirt's (add support for swtpm block + work with qemu to design and implement EFI block). The main con IMO is the unreasonable ask of libvirt/qemu to support pre-1961 technology (disks without partitions/filesystems).
Option 2 just requires implementing this design proposal, including ensuring there's no race when rebooting during a migration. Seems a lot saner to me, even if it's more work for us.
Edit: merge request opened for libvirt swtpm block support: https://gitlab.com/libvirt/libvirt/-/merge_requests/380
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
work with qemu to design and implement EFI block
Seems like a big chunk of work for another team and would ultimately require 2 PVCs for VMs with EFI+TPM, correct? With filesystem PVCs, TPM and EFI can share a single PVC, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fewer PVCs is big PRO for option 2
of FS/block RWX/RWO work for backend-storage, if we think it's worth the additional stress on the e2e test lanes. | ||
|
||
# Implementation Phases | ||
First phase is block support. That effort is well underway already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is worth investing in block PVC support. What provisioner only supports block PVCs?
Interesting. So could be possible to ask for a RWO FileSystem PVC, get nfs, and at migration time, data will not be copied to the new PVC? I believe we've encountered a similar problem before, but the reverse. User requests RWX PVC but get ext4 because VM is scheduled on the same node as the nfs server. I believe LD_PRELOAD hacks were used to get around the issue until there was a proper fix |
…support out of scope Signed-off-by: Jed Lejosne <jed@redhat.com>
|
||
Pros: | ||
- Less code to write and maintain in KubeVirt | ||
- No need to duplicate backend-storage PVCs on migrations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- No need to duplicate backend-storage PVCs on migrations | |
- No need to duplicate backend-storage PVCs on migrations and delete the correct one after migration success/failure | |
- libvirt gains an important feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file a public request on libvirt to add this functionality, and include the URL to it here. The simplest way to "kill" this alternative is if libvirt rejects it. I would think that this would be a mistake on their behalf, but that's their question to answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with a few members of the libvirt team, and they actually don't appear to have strong objections against it.
There would be 4 parts to this:
- block storage support in swtpm: already done
- libvirt glue for swtpm block storage: draft PR open (https://gitlab.com/libvirt/libvirt/-/merge_requests/380)
- block storage support in qemu for NVRAM: could work out of the box, since the NVRAM is already just 1 file (just have to ensure they don't run file-specific operations on it)
- libvirt glue for NVRAM block storage: could also work out of the box, although unlikely, but shouldn't require too much work (just ensuring the "file" is never moved/deleted/chmoded...)
So really not to much left to do overall, however we would have to wait for the first release of every component that needs changes, which could take a long time.
This approach is also definitely not my first choice, and I would hate to request something, have upstream devs work hard on it and ultimately not consume it...
That's why I didn't file any request yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is also definitely not my first choice
Please make sure to spell out all your reasoning as its "cons". For example, I did not see the issue of timing written in the proposal.
That's why I didn't file any request yet.
Let have a formal request and link to it here. We should be explicit that because the reasons spelled out here we may not use it, or may not use it immediately. But let us give the libvirt team an opportunity to refute our assumptions - maybe they can deliver this functionality fast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file a public request on libvirt to add this functionality, and include the URL to it here.
https://issues.redhat.com/browse/RHEL-54325
https://issues.redhat.com/browse/RHEL-54326
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. github comments can easily get lost. Please add these to the design doc - the fact that they are not done is a reason against the alternative option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will rework this design proposal this week and include those links.
- RWO FS: supported by backend-storage but makes VMs non-migratable. **This entire design proposal is about removing that limitation.** | ||
- RWX FS: preferred combination, supported by backend-storage and compatible with VM migrations | ||
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way | ||
- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm a bit late on this. Have we ever considered to use a clustered filesystem like gfs2 instead of ext3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good question. gfs2
somehow didn't pop up when I looked for potentially suitable filesystems for formatting the block storage, probably worth investigating!
That being said, this proposal (and even the alternative it contains) doesn't consider the option of formatting the block storage (anymore), and not just because I couldn't find a suitable filesystem.
The fact that it requires adding a bunch of tools to virt-launcher and virt-handler, and the general complexity of managing partitions ourselves, make it something we should avoid, in my opinion (see draft PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I read a bit about gfs2 supported by Red Hat, and it requires clvm. So, I think this also excludes it as a solution
Signed-off-by: Jed Lejosne <jed@redhat.com>
- RWO block: not supported yet, but PRs exist to implement it in a non-migratable way | ||
- RWX block: same as above, with migratability functional but considered unsafe, because it leads to 2 pods (source and target virt-launchers) mounting the same ext3 filesystem concurrently. | ||
|
||
The above is the best we can do with the current implementation, which involves mounting the same partition on migration source and target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above is the best we can do with the current implementation, which involves mounting the same partition on migration source and target. | |
The above is the best we can do with the current libvirt implementation, which involves mounting the same partition on migration source and target. |
I'm not at all sure I understand to which "current implementation" you refer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rephrase, what I mean is "without partitioning block / creating new PVCs on migration / libvirt block support"
8aa9913
to
5d5771f
Compare
- Libvirt gains a feature | ||
|
||
Cons: | ||
- Relies on third party projects adding and maintaining options to store config files into a block devices without filesystems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Relies on third party projects adding and maintaining options to store config files into a block devices without filesystems | |
- Relies on libvirt and qemu adding and maintaining options to store config files into a block devices without filesystems |
are these the third parties you refer to?
- We would have to either | ||
- Maintain 2 different backend-storage implementations (the existing RWX FS one and this RWX block one) | ||
- Deprecate RWX FS and require RWX block to enable backend-storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this point. In the Primary Design, what are do you plan to do with the existing code referring to RWX FS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the primary design, we keep support for RWX FS and extend it to support RWO FS. Since both use virtually the same codepath, it's just one thing to maintain.
However, RWX block would need a completely different approach, so if we want to avoid maintaining 2 different ways to do the same thing we'd have to deprecate RWX FS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the code, but ideally, by delegating the trouble to libvirt, the only difference between RWX FS and RWX Block would be passing a block device to libvirt instead of a filesystem path.
In the primary implementation, we need to add code to synchronize between source and destination during success and various modes of failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the code, but ideally, by delegating the trouble to libvirt, the only difference between RWX FS and RWX Block would be passing a block device to libvirt instead of a filesystem path.
Yes, well multiple block devices. That means changing every function that touches backend-storage to handle multiple PVCs.
It would also require changing the volumes and mounts we pass to virt-launcher. Finally the libvirt options would indeed look different.
That is a non-trivial amount of differences.
In the primary implementation, we need to add code to synchronize between source and destination during success and various modes of failures.
Yes, before we can mark a (RWO) migration as completed we have to add a label to a PVC and delete another one. It is an extra step, and we have to make sure it's done right, but it's not much code at all.
5d5771f
to
1298835
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The toughest issue in the primary solution is handling migration and VM failures (and the reason I prefer the alternative solution). Would you elaborate on that issue?
- We create a new empty PVC for the target, with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation | ||
- In the migration object status, we store the names of both the source and target PVCs (see API section below) | ||
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object | ||
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this hold true also with post-copy migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so. In a failed post-copy migration, I don't think we can rely on the state of the target, but we know the source is sane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be more certain about this. The guest os is running on the destination host. I assume that it makes changes to the storage there and not to the source - but please check this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your concern is valid, but it's just a part of the dangers of post-copy. Yes, if a guest is ongoing a post-copy migration and, after the handoff, makes changes to the TPM, then before the end of the migration someone trips on the network cable, the changes made to the TPM will be lost. But that's exactly why 1) post-copy comes with a big warning, 2) bitlocker strongly encourages you to write down your recovery key.
You can also imagine the opposite scenario, if we decided to keep the target PVC and the (post-handoff post-copy) migration failed right during a change to the TPM. The PVC on the target would now contain a corrupt state and switching to it would break the VM (that's what I was referring to above). Same concern applies in RWX btw.
Post-copy migration might break stuff, I don't think there's any way around it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a guest is ongoing a post-copy migration and, after the handoff, makes changes to the TPM, then before the end of the migration someone trips on the network cable, the changes made to the TPM will be lost.
That is not quite correct from the QEMU/libvirt POV. In the event of failed post-copy, both the source and dest QEMU will remain in existence, but both in paused state, since neither has the full state in isolation.
When the network problem is resolved, you can call virDomainMigrateStartPostCopy with the VIR_MIGRATE_POSTCOPY_RESUME flag, to recover from the problem and complete the migration, without data loss.
- In the migration object status, we store the names of both the source and target PVCs (see API section below) | ||
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object | ||
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object | ||
- In the unlikely event that a VM shuts down towards the very end of a migration, the migration finalizer will decide which PVC to keep and which one to get rid of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the migration finalizer know which of the two PVCs is the correct one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be reworded. If the VMI disappears (i.e. shutdown), the migration is automatically failed, and we fall into the bullet point above.
Signed-off-by: Jed Lejosne <jed@redhat.com>
1298835
to
31f2309
Compare
- We create a new empty PVC for the target, with the name `persistent-state-for-<vm_name>-<random_string>` and no annotation | ||
- In the migration object status, we store the names of both the source and target PVCs (see API section below) | ||
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object | ||
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be more certain about this. The guest os is running on the destination host. I assume that it makes changes to the storage there and not to the source - but please check this out.
- In the migration object status, we store the names of both the source and target PVCs (see API section below) | ||
- If a migration succeeds, we set the annotation `persistent-state-for-<vm_name>=true` on the new (target) PVC and delete the old (source) PVC, using the source PVC name from the migration object | ||
- If a migration fails, we delete the target backend-storage PVC that was just created, using the target PVC name from the migration object | ||
- In the unlikely event that a VM shuts down during a migration, the migration will be considered failed and the target PVC will get discarded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if right after migration ends successfully in the libvirt level, the VM makes a change to the TPM and shuts down? Discarding the target PVC would make the data lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see #314 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find this acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a KubeVirt problem, same thing would happen on bare-metal... If one doesn't find the risks of post-copy acceptable, they should just not use it.
What we can and should do is document those risks thoroughly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if right after migration ends successfully in the libvirt level, the VM makes a change to the TPM and shuts down? Discarding the target PVC would make the data lost.
Sorry I misread your question. If a libvirt migration ends successfully then the KubeVirt migration will also end successfully and all will be fine.
I read something like "What if right after migration handoff [...]", my bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. To repeat back and summarize, if all the following is true we have a problem:
- We're inside the, at most 400ms, window when the libvirt migration is finished but KubeVirt doesn't know yet
- The source pod/node crashes
- The VM writes to the TPM
Then the VMIM will fail and that TPM write will be dropped.
Honestly, to me, this scenario feels quite statistically impossible...
If we want to address this, do you have a suggestion how?
I brainstormed this with @stu-gott and this is the only thing we could come up with:
- Libvirt+swtpm implement a state lock that's automatically enabled at the end of a migration
- Once KubeVirt sees that the libvirt migration succeeded, it instructs libvirt to release that lock
If that makes sense to you we can discuss it with the libvirt team.
However, would you agree to do that as a follow-up effort?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 400ms is only the time it takes for the migration monitor to update the domain metadata.
Then there is also the time required to propagate the migration status to other kubevirt components/objects.
Overall, it's possible that this can be more than just a statistically insignificant issue.
I didn't have much time to think this through.
However, If we are going to continue with this effort we need to come up with a mechanism that will allow us to choose the right volume on the next VM startup.
I don't think that locking is enough - I don't see how it solves the issue. It doesn't look like there is a competing process that can write into the source volume.
Instead of discarding the target volume for unfinalized migrations, we should keep it around.
Then, on the next VM start up we can select the volume with the latest time stamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of discarding the target volume for unfinalized migrations, we should keep it around.
Then, on the next VM start up we can select the volume with the latest time stamp.
We can easily change the code to keep target volumes around for migrations that didn't succeed.
However, that timestamp thing doesn't make sense to me. We have no idea what the state of the target PVC is as long as we don't see the migration succeed. It could very well be empty or partially populated. What we can do is keep it around and clean it up next time we migrate the VM? That way it's there for recovery purposes in case the highly unlikely scenario above occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of discarding the target volume for unfinalized migrations, we should keep it around.
Then, on the next VM start up we can select the volume with the latest time stamp.We can easily change the code to keep target volumes around for migrations that didn't succeed.
For migrations that didn't reach the completed status - not for any failed migration.
However, that timestamp thing doesn't make sense to me. We have no idea what the state of the target PVC is as long as we don't see the migration succeed. It could very well be empty or partially populated. What we can do is keep it around and clean it up next time we migrate the VM? That way it's there for recovery purposes in case the highly unlikely scenario above occurs.
We are only speaking about the scenario when the VM is on the target and is writing to the target PVC when the source is crashed/unavailable. There should be something on the target as the migration has actually been completed. (only the source doesn't know about it and didn't update all other components)
I feel that there is a disconnect, let's take it offline when I'm back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alicefr This issue and kubevirt/kubevirt#13112 may affect the already implemented volume migration.
/sig compute |