feat: Add archive extraction support for http(s) #24

pvaneck · 2022-05-04T04:53:12Z

This enables archive extraction/decompression for the zip format and the gz/tar/tar.gz formats.
If a user provides an http(s) link to one of these formats, pullman will now automatically
extract the contents into the destination directory.

Unit tests were added to test the extraction functions where archives are generated during the test.

Closes: #16

kserve-oss-bot · 2022-05-04T04:53:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pvaneck

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [pvaneck]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pvaneck · 2022-05-04T04:54:02Z

These helper functions can be leveraged for future archive extraction support for other providers.

njhill

Thanks for this @pvaneck.

Couple of additional comments, probably not for this PR though:

As you mentioned it would be good to have this be storage-type agnostic. I know it's most useful for http, and fine to just add it for that first. I guess performance-wise its good to have some provider-specific logic so we can unzip/tar a stream rather than writing to disk first.
I wonder if blind extraction could cause a problem for runtimes that might expect an archive themselves. I think some also autodetect and accept either so it wouldn't be a problem for those. But per some of the prior internal discussion it would be nice to add an parameter on the ServingRuntime with a list of archive types natively supported, so that the extraction would be bypassed if appropriate/necessary.

njhill · 2022-05-12T22:26:05Z

pullman/helpers.go

+		destFilePath := filepath.Join(dest, zipFile.Name)
+
+		// Zip slip vulnerability check
+		if !strings.HasPrefix(destFilePath, filepath.Clean(dest)+string(os.PathSeparator)) {


nit: could move filepath.Clean(dest)+string(os.PathSeparator) before the loop

njhill · 2022-05-13T19:10:48Z

pullman/storageproviders/http/downloader.go

-		return fmt.Errorf("error writing resource to local file '%s': %w", filename, err)
+	contentType := resp.Header.Get("Content-type")
+	if strings.Contains(contentType, "application/x-tar") || strings.Contains(contentType, "application/x-gtar") ||
+		strings.Contains(contentType, "application/x-gzip") || strings.Contains(contentType, "application/gzip") {


There is also application/tar and application/tar+gzip (though I guess Contains would catch the latter).

Also wondering whether we should detect based on file extensions in the URL even if Content-Type isn't specified?

Sure, I can add file extension checks as fall backs.

njhill · 2022-05-13T19:17:39Z

pullman/helpers.go

+	return nil
+}
+
+// Extract a tar/tgz/tar.gz archive file into the provided destination directory.


This won't work currently for a non-gzipped tar, and we should probably also support gzipped non-tar (single file). I think we need some additional differentiation logic e.g. some of the content types should be unambiguous, just application/gzip might need to unzip and then inspect (or peek beginning of stream maybe).

~~Based on my testing, this does actually work with non-gzipped tars even with the gzip reader included.~~ Nevermind, I was mistaken here (my tar file was actually gzipped). Will adjust the logic here.

For gzip only support, as you say, the content-type application/gzip can be used for both .gz and .tar.gz. Would checking for a lone .gz extension then sending it through the gzip reader suffice for this? I feel like gzip only files probably aren't nearly as common as the tar.gz use case.

pvaneck · 2022-06-08T03:41:41Z

Updated this PR to rely on magic bytes for checking the file format (as opposed to the Content-Type headers). Tested with actual .gz, .tar.gz, .tar, and .zip files.

This enables archive extraction for the zip format and the gz/tar/tar.gz formats. If a user provides an http(s) link to one of these formats, pullman will now automatically extract the contents into the destination directory. Signed-off-by: Paul Van Eck <pvaneck@us.ibm.com>

njhill

Thanks @pvaneck.

Another thing I thought of - since we are changing the filename (and possibly changing to a directory) after unzipping / untarring, we may need to add a way for the new model file/dir location to be returned from the puller. This is because the subsequent adapter logic expects a local file/dir in the location specified by the original model path (which in this case could end in e.g. somefile.tar.gz), and it won't find that.

Not sure if you had a chance to try this out with one of the built in model servers (e.g. Triton) but unless I'm mistaken I don't think it will work. Which may limit the usefulness of this feature.

njhill · 2022-06-14T23:39:57Z

pullman/storageproviders/http/downloader.go

+			}
+			os.Remove(filename)
+		}
+	}


Could make sense here to split into two - check for gz or zip first and if one of them unzip and update the filename and extension vars. Then separately check extension var for tar after (which will be either original or unzipped).

pvaneck · 2022-06-15T20:09:08Z

Another thing I thought of - since we are changing the filename (and possibly changing to a directory) after unzipping / untarring, we may need to add a way for the new model file/dir location to be returned from the puller. This is because the subsequent adapter logic expects a local file/dir in the location specified by the original model path (which in this case could end in e.g. somefile.tar.gz), and it won't find that.

Not sure if you had a chance to try this out with one of the built in model servers (e.g. Triton) but unless I'm mistaken I don't think it will work. Which may limit the usefulness of this feature.

🤔 Thanks for pointing that out. I think you are right. Let me do some more thorough end to end tests and get back to you.

pvaneck · 2022-06-18T00:01:49Z

/hold

kserve-oss-bot requested review from animeshsingh and chinhuang007 May 4, 2022 04:53

kserve-oss-bot added the approved label May 4, 2022

njhill self-requested a review May 10, 2022 16:02

njhill reviewed May 13, 2022

View reviewed changes

pvaneck force-pushed the http-archive branch 2 times, most recently from 0c8d91e to f0d00a5 Compare June 7, 2022 18:36

pvaneck force-pushed the http-archive branch from f0d00a5 to 441152f Compare June 13, 2022 18:57

njhill reviewed Jun 15, 2022

View reviewed changes

kserve-oss-bot added the do-not-merge/hold label Jun 18, 2022

njhill mentioned this pull request Sep 12, 2022

How to load model from .tar.gz/.tgz? kserve/modelmesh-serving#226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add archive extraction support for http(s) #24

feat: Add archive extraction support for http(s) #24

pvaneck commented May 4, 2022 •

edited

Loading

kserve-oss-bot commented May 4, 2022

pvaneck commented May 4, 2022

njhill left a comment

njhill May 12, 2022

njhill May 13, 2022

pvaneck May 25, 2022

njhill May 13, 2022

pvaneck May 25, 2022 •

edited

Loading

pvaneck commented Jun 8, 2022

njhill left a comment

njhill Jun 14, 2022

pvaneck commented Jun 15, 2022

pvaneck commented Jun 18, 2022

feat: Add archive extraction support for http(s) #24

Are you sure you want to change the base?

feat: Add archive extraction support for http(s) #24

Conversation

pvaneck commented May 4, 2022 • edited Loading

kserve-oss-bot commented May 4, 2022

pvaneck commented May 4, 2022

njhill left a comment

Choose a reason for hiding this comment

njhill May 12, 2022

Choose a reason for hiding this comment

njhill May 13, 2022

Choose a reason for hiding this comment

pvaneck May 25, 2022

Choose a reason for hiding this comment

njhill May 13, 2022

Choose a reason for hiding this comment

pvaneck May 25, 2022 • edited Loading

Choose a reason for hiding this comment

pvaneck commented Jun 8, 2022

njhill left a comment

Choose a reason for hiding this comment

njhill Jun 14, 2022

Choose a reason for hiding this comment

pvaneck commented Jun 15, 2022

pvaneck commented Jun 18, 2022

pvaneck commented May 4, 2022 •

edited

Loading

pvaneck May 25, 2022 •

edited

Loading