Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add archive extraction support for http(s) #24

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pvaneck
Copy link
Member

@pvaneck pvaneck commented May 4, 2022

This enables archive extraction/decompression for the zip format and the gz/tar/tar.gz formats.
If a user provides an http(s) link to one of these formats, pullman will now automatically
extract the contents into the destination directory.

Unit tests were added to test the extraction functions where archives are generated during the test.

Closes: #16

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pvaneck

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pvaneck
Copy link
Member Author

pvaneck commented May 4, 2022

These helper functions can be leveraged for future archive extraction support for other providers.

@njhill njhill self-requested a review May 10, 2022 16:02
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @pvaneck.

Couple of additional comments, probably not for this PR though:

  • As you mentioned it would be good to have this be storage-type agnostic. I know it's most useful for http, and fine to just add it for that first. I guess performance-wise its good to have some provider-specific logic so we can unzip/tar a stream rather than writing to disk first.
  • I wonder if blind extraction could cause a problem for runtimes that might expect an archive themselves. I think some also autodetect and accept either so it wouldn't be a problem for those. But per some of the prior internal discussion it would be nice to add an parameter on the ServingRuntime with a list of archive types natively supported, so that the extraction would be bypassed if appropriate/necessary.

destFilePath := filepath.Join(dest, zipFile.Name)

// Zip slip vulnerability check
if !strings.HasPrefix(destFilePath, filepath.Clean(dest)+string(os.PathSeparator)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could move filepath.Clean(dest)+string(os.PathSeparator) before the loop

return fmt.Errorf("error writing resource to local file '%s': %w", filename, err)
contentType := resp.Header.Get("Content-type")
if strings.Contains(contentType, "application/x-tar") || strings.Contains(contentType, "application/x-gtar") ||
strings.Contains(contentType, "application/x-gzip") || strings.Contains(contentType, "application/gzip") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also application/tar and application/tar+gzip (though I guess Contains would catch the latter).

Also wondering whether we should detect based on file extensions in the URL even if Content-Type isn't specified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add file extension checks as fall backs.

return nil
}

// Extract a tar/tgz/tar.gz archive file into the provided destination directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work currently for a non-gzipped tar, and we should probably also support gzipped non-tar (single file). I think we need some additional differentiation logic e.g. some of the content types should be unambiguous, just application/gzip might need to unzip and then inspect (or peek beginning of stream maybe).

Copy link
Member Author

@pvaneck pvaneck May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my testing, this does actually work with non-gzipped tars even with the gzip reader included. Nevermind, I was mistaken here (my tar file was actually gzipped). Will adjust the logic here.

For gzip only support, as you say, the content-type application/gzip can be used for both .gz and .tar.gz. Would checking for a lone .gz extension then sending it through the gzip reader suffice for this? I feel like gzip only files probably aren't nearly as common as the tar.gz use case.

@pvaneck pvaneck force-pushed the http-archive branch 2 times, most recently from 0c8d91e to f0d00a5 Compare June 7, 2022 18:36
@pvaneck
Copy link
Member Author

pvaneck commented Jun 8, 2022

Updated this PR to rely on magic bytes for checking the file format (as opposed to the Content-Type headers). Tested with actual .gz, .tar.gz, .tar, and .zip files.

This enables archive extraction for the zip format and the
gz/tar/tar.gz formats. If a user provides an http(s) link to one of
these formats, pullman will now automatically extract the contents
into the destination directory.

Signed-off-by: Paul Van Eck <pvaneck@us.ibm.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pvaneck.

Another thing I thought of - since we are changing the filename (and possibly changing to a directory) after unzipping / untarring, we may need to add a way for the new model file/dir location to be returned from the puller. This is because the subsequent adapter logic expects a local file/dir in the location specified by the original model path (which in this case could end in e.g. somefile.tar.gz), and it won't find that.

Not sure if you had a chance to try this out with one of the built in model servers (e.g. Triton) but unless I'm mistaken I don't think it will work. Which may limit the usefulness of this feature.

}
os.Remove(filename)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make sense here to split into two - check for gz or zip first and if one of them unzip and update the filename and extension vars. Then separately check extension var for tar after (which will be either original or unzipped).

@pvaneck
Copy link
Member Author

pvaneck commented Jun 15, 2022

Another thing I thought of - since we are changing the filename (and possibly changing to a directory) after unzipping / untarring, we may need to add a way for the new model file/dir location to be returned from the puller. This is because the subsequent adapter logic expects a local file/dir in the location specified by the original model path (which in this case could end in e.g. somefile.tar.gz), and it won't find that.

Not sure if you had a chance to try this out with one of the built in model servers (e.g. Triton) but unless I'm mistaken I don't think it will work. Which may limit the usefulness of this feature.

🤔 Thanks for pointing that out. I think you are right. Let me do some more thorough end to end tests and get back to you.

@pvaneck
Copy link
Member Author

pvaneck commented Jun 18, 2022

/hold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable archive extraction for HTTP storage provider
3 participants