cdi-importer: import virtualbox ova or tar archive in oneshot #2748

lxs137 · 2023-06-12T07:25:44Z

What this PR does / why we need it:

This PR try to make CDI support import Virtualbox OVA file (tar archive is also supported) in oneshot.

with nbdkit-tar-filter, we can extract vmdk file in remote OVA endpoint, which make the conversion don't need scratch disk space.

Without the PR, we need to download OVA file, and extract vmdk file in it, then convert it to QCOW2 file, upload the QCOW2 file into the PVC.

something need help

if there are multiple vmdk in one ova file, should find a way to let user specify the target vmdk file.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

cdi-importer: http-datasource can import (compressed) tar archive (or virtualbox OVA) streamly

kubevirt-bot · 2023-06-12T07:25:54Z

Hi @lxs137. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

awels · 2023-06-12T13:15:41Z

/test all

awels · 2023-06-12T13:31:27Z

Hi, thanks for the PR, looks like there are some unit tests failures. You can run the unit tests by calling make test-unit

awels · 2023-06-12T13:44:19Z

/test pull-containerized-data-importer-e2e-hpp-latest

mhenriks · 2023-06-12T20:12:32Z

hack/build/rpm-deps.sh

@@ -46,6 +46,7 @@ nbdkit-basic-filters
 nbdkit-curl-plugin
 nbdkit-xz-filter
 nbdkit-gzip-filter
+nbdkit-tar-filter


We need to understand if/how nbdkit does caching with this filter. I bet some caching is going on as it would be pretty expensive to always open/seek to the appropriate offset in the archive. If there is any significant on disk/memory caching I'd prefer we use simply the golang tar reader to seek to the file and download it to scratch. Then convert from scratch to target. Thus not getting nbdkit involved at all.

Sorry, I can't get the point, why nbdkit cache will be the concern.
I believe nbdkit-tar-filter will only retrieve tar header block, if find the header name match "tar-entry", then get the target file data offset. (I think golang tar reader will do the same thing)

Looks to me like tar plugin writes to tmp which is something we have to avoid: https://github.com/libguestfs/nbdkit/blob/45b72f5bd8fc1b475fa130d06c86cd877bf595d5/filters/tar/tar.c#L146

But I'm not an expert on nbdkit.

@rwmjones care to comment on how curl + tar filter uses tmp space?

tar filter will not use tmp space (or more accurately, only a few bytes are used for a temporary file used to parse offsets printed by tar). Here's how it works:

https://gitlab.com/nbdkit/nbdkit/-/blob/462eb6cbf605c2969f6e851bb9076f8e3961cccb/filters/tar/tar.c#L138

I realise the code isn't very easy to follow. What it does is this:

(1) When we connect first time, we start the command tar --no-auto-compress -t --block-number -v -f - filename > /tmp/output

(2) We read over the first blocks of the tar file and feed them to stdin of tar. This does mean that if the file inside the tar is not early on inside the tar then we end up reading through the earlier files. But we do not write them to disk.

(3) As we go along we look at the output of tar which will eventually print the offset of the file we want in the tar to the randomly named /tmp/output file.

(4) We are then able to read at the right offset/size within the tar directly when the NBD client reads the disk image.

mhenriks · 2023-06-13T18:03:37Z

/ok-to-test

mhenriks · 2023-06-13T18:06:08Z

I'd like to think about this PR more as "tar file support" then "ova support" so a couple suggestions. If the tar archive only has one file, use that regardless of extension. Support for other extensions like ".iso" ".raw" etc

What do you think @awels?

rwmjones · 2023-06-13T18:49:28Z

It should be possible to tell cheaply if a tar file contains either just one file or (more realistic and still usable) the large file is contained early on. Tar files are basically:

GLOBAL HEADER
FILE1 HEADER
FILE1 CONTENT
FILE2 HEADER
FILE2 CONTENT
&c

(There is no central index unlike zip or any modern format.) You could read the first NN bytes (eg. 10M or whatever you consider small enough), feed that into tar -t --block-number -f - filename and see if it produces a result.

In fact this would be a simple modification to nbdkit-tar-file, adding some sort of "read limit". (@ebblake)

mhenriks · 2023-06-13T19:26:48Z

FYI: single file in a tar came from a discussion with @rmohr regarding how GCE imports disk images as described here: https://cloud.google.com/compute/docs/import/import-existing-image#requirements_for_the_image_file

rwmjones · 2023-06-13T19:31:52Z

Patch for above posted: https://listman.redhat.com/archives/libguestfs/2023-June/031837.html

mhenriks · 2023-06-13T19:37:54Z

FYI: single file in a tar came from a discussion with @rmohr regarding how GCE imports disk images as described here: https://cloud.google.com/compute/docs/import/import-existing-image#requirements_for_the_image_file

Supporting "tar.gz" is probably beyond the scope of this PR though.

Edit: for now (this pr), I think supporting the additional file extensions may be good enough

lxs137 · 2023-06-14T07:29:05Z

Update:

support extract file extensions: "qcow2", "vmdk", "vdi", "vhd", "vhdx", "iso", "raw"
if the tar archive only has one file, use that regardless of extension
fix e2e functional tests

lxs137 · 2023-06-14T14:51:03Z

/test pull-containerized-data-importer-e2e-ceph

lxs137 · 2023-06-15T01:32:47Z

/test pull-containerized-data-importer-e2e-ceph

lxs137 · 2023-06-15T08:48:51Z

I am trying to use nbdkit serve a remote tar.gz archive with curl plugin, tar filter and gzip filter, but I find it will be super slow even execute qemu-img info over it.

fast:

nbdkit --foreground --readonly --exit-with-parent -U /tmp/nbdkit.sock --pidfile /tmp/nbdkit.pid --filter=tar -r curl url=http://xxx.tar tar-entry=xx.img &

// will cost near 0.1s
qemu-img info nbd+unix:///?socket=/tmp/nbdkit.sock

slow (60x):

nbdkit --foreground --readonly --exit-with-parent -U /tmp/nbdkit.sock --pidfile /tmp/nbdkit.pid --filter=tar --filter=gzip -r curl url=http://xxx.tar.gz tar-entry=xx.img &

// will cost near 6s
qemu-img info nbd+unix:///?socket=/tmp/nbdkit.sock

I find it's because when combined tar filter and gzip filter, nbdkit will try to download whole remote file.
If I use tar filter only, nbdkit just peek serval bytes in remote file.

@rwmjones will you take a look on it, thank you :-)

rwmjones · 2023-06-15T08:56:24Z

gzip format isn't seekableᵃ. Use xz format as described here: https://libguestfs.org/nbdkit-xz-filter.1.html#Getting-best-random-access-performance-from-xz

ᵃWell, it's a complicated story. It may be possible to make a modifiable seekable gzip: https://rwmj.wordpress.com/2022/12/01/creating-a-modifiable-gzipped-disk-image/

lxs137 · 2023-06-15T09:38:05Z

gzip format isn't seekableᵃ. Use xz format as described here: https://libguestfs.org/nbdkit-xz-filter.1.html#Getting-best-random-access-performance-from-xz

ᵃWell, it's a complicated story. It may be possible to make a modifiable seekable gzip: https://rwmj.wordpress.com/2022/12/01/creating-a-modifiable-gzipped-disk-image/

OK, I test a tar.xz file with small block size, the qemu-img command run more fast (3x slow than run qemu-img over tar file).

If user provide an unseekable compressed tar archive (gzip, xz with large block size), run qemu-img info command over nbdkit will be super slow.

@mhenriks What do you think?

This can be used to ensure that the tar filter does not read indefinite amounts of input when opening the tar file. See: kubevirt/containerized-data-importer#2748 (comment)

kubevirt-bot · 2023-07-22T08:12:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign awels for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lxs137 · 2023-07-22T10:03:19Z

@mhenriks Sorry for my late reply, I have done all the work. :-)

tar (inside file has known file extension): use nbdkit-tar-filter to extract inner file content
compressed tar: FormatReaders will seek to inner file content
tar (inside file has unknown file extension): can't handle

Signed-off-by: menyakun <lxs137@hotmail.com>

* if find a compressed file, it will look inner file format * for archive format, user can read archive inner file content directly Signed-off-by: menyakun <lxs137@hotmail.com>

* use nbdkit-tar-filter to extract archive inner file content streamly * for unseekable compressed archive (gzip, xz with large block size), qemu-img over nbdkit will be super slow, so use FormatReaders to download inner file to scratch space (fallback to previous method) Signed-off-by: menyakun <lxs137@hotmail.com>

Signed-off-by: menyakun <lxs137@hotmail.com>

lxs137 · 2023-07-27T08:06:18Z

/test pull-cdi-unit-test

alromeros · 2023-07-29T09:15:26Z

/test pull-containerized-data-importer-non-csi-hpp

a180285 · 2023-07-29T10:26:41Z

tools/cdi-func-test-file-host-init/main.go

@@ -52,6 +52,8 @@ func main() {
 		[]string{".qcow2", ".gz"},
 		[]string{".qcow2", ".xz"},
 		[]string{".qcow2", ".zst"},
+		[]string{".tar", ".gz"},


I wondering, if it can deal with file like xxx.raw.tar.gz ?

@a180285 Sure, actually we match file by magic number instead of extension.

lxs137 · 2023-08-09T03:08:49Z

@mhenriks Sorry for my late reply, I have done all the work. :-)

tar (inside file has known file extension): use nbdkit-tar-filter to extract inner file content

compressed tar: FormatReaders will seek to inner file content

tar (inside file has unknown file extension): can't handle

@mhenriks Hi, do you have some time to review on this PR ~

kubevirt-bot · 2023-08-12T07:23:23Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

awels · 2023-08-24T19:15:54Z

Sorry this needs a rebase before we can review it.

lxs137 · 2023-08-25T08:55:40Z

I found we disable the nbdkit streamly importing in #2832 .
But in this PR, I try to extract the tar inner content with nbdkit filter (except compressed tar, use golang reader instead).
What do you think?
@awels @mhenriks

lxs137 · 2023-08-26T03:30:54Z

I found we disable the nbdkit streamly importing in #2832 . But in this PR, I try to extract the tar inner content with nbdkit filter (except compressed tar, use golang reader instead). What do you think? @awels @mhenriks

Read some comments from #2809.

Maybe this PR should hold unitl ndkit-1.35.8 is availavle in centos 9 stream,
or split this PR into golang importing part and nbdkit importing part, we just focus on the golang importing part.

kubevirt-bot · 2023-11-24T03:33:16Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot · 2023-12-24T03:44:41Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubevirt-bot · 2024-01-23T03:59:02Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubevirt-bot · 2024-01-23T03:59:06Z

@kubevirt-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kubevirt-bot requested review from maya-r and mhenriks June 12, 2023 07:25

kubevirt-bot added the size/L label Jun 12, 2023

mhenriks reviewed Jun 12, 2023

View reviewed changes

kubevirt-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 13, 2023

lxs137 force-pushed the extract-ova-and-tar-archive-streamly branch from a907711 to e158570 Compare June 14, 2023 07:21

lxs137 changed the title ~~cdi-importer: import virtualbox ova file in oneshot~~ cdi-importer: import virtualbox ova or other tar archive in oneshot Jun 14, 2023

lxs137 force-pushed the extract-ova-and-tar-archive-streamly branch from e158570 to 546fa75 Compare June 14, 2023 09:23

lxs137 changed the title ~~cdi-importer: import virtualbox ova or other tar archive in oneshot~~ cdi-importer: import virtualbox ova or tar archive in oneshot Jun 14, 2023

lxs137 force-pushed the extract-ova-and-tar-archive-streamly branch from 7a86f10 to 293f7fe Compare July 22, 2023 08:13

kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jul 22, 2023

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2023

lxs137 added 4 commits July 27, 2023 15:00

Add nbdkit-tar-filter rpm deps

a443363

Signed-off-by: menyakun <lxs137@hotmail.com>

FormatReaders can parse (compressed) tar archive:

ce7f782

* if find a compressed file, it will look inner file format * for archive format, user can read archive inner file content directly Signed-off-by: menyakun <lxs137@hotmail.com>

fix golang ci lint

d8b9611

Signed-off-by: menyakun <lxs137@hotmail.com>

lxs137 force-pushed the extract-ova-and-tar-archive-streamly branch from f122355 to d8b9611 Compare July 27, 2023 07:05

kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 27, 2023

alromeros mentioned this pull request Jul 29, 2023

Support raw.tar.gz files #2822

Closed

a180285 reviewed Jul 29, 2023

View reviewed changes

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2023

kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2023

kubevirt-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 24, 2023

kubevirt-bot closed this Jan 23, 2024

cdi-importer: import virtualbox ova or tar archive in oneshot #2748

cdi-importer: import virtualbox ova or tar archive in oneshot #2748

Conversation

lxs137 commented Jun 12, 2023 • edited Loading

kubevirt-bot commented Jun 12, 2023

awels commented Jun 12, 2023

awels commented Jun 12, 2023

awels commented Jun 12, 2023

mhenriks Jun 12, 2023

Choose a reason for hiding this comment

lxs137 Jun 13, 2023

Choose a reason for hiding this comment

mhenriks Jun 13, 2023 • edited Loading

Choose a reason for hiding this comment

rwmjones Jun 13, 2023

Choose a reason for hiding this comment

rwmjones Jun 13, 2023 • edited Loading

Choose a reason for hiding this comment

mhenriks commented Jun 13, 2023

mhenriks commented Jun 13, 2023

rwmjones commented Jun 13, 2023

mhenriks commented Jun 13, 2023

rwmjones commented Jun 13, 2023

mhenriks commented Jun 13, 2023 • edited Loading

lxs137 commented Jun 14, 2023

lxs137 commented Jun 14, 2023

lxs137 commented Jun 15, 2023

lxs137 commented Jun 15, 2023 • edited Loading

rwmjones commented Jun 15, 2023

lxs137 commented Jun 15, 2023

kubevirt-bot commented Jul 22, 2023

lxs137 commented Jul 22, 2023

lxs137 commented Jul 27, 2023

alromeros commented Jul 29, 2023

a180285 Jul 29, 2023

Choose a reason for hiding this comment

lxs137 Jul 30, 2023

Choose a reason for hiding this comment

lxs137 commented Aug 9, 2023

kubevirt-bot commented Aug 12, 2023

awels commented Aug 24, 2023

lxs137 commented Aug 25, 2023

lxs137 commented Aug 26, 2023

kubevirt-bot commented Nov 24, 2023

kubevirt-bot commented Dec 24, 2023

kubevirt-bot commented Jan 23, 2024

kubevirt-bot commented Jan 23, 2024

lxs137 commented Jun 12, 2023 •

edited

Loading

mhenriks Jun 13, 2023 •

edited

Loading

rwmjones Jun 13, 2023 •

edited

Loading

mhenriks commented Jun 13, 2023 •

edited

Loading

lxs137 commented Jun 15, 2023 •

edited

Loading