[async] Support URLFile in the upload_file function #1987

aron · 2024-10-11T14:53:34Z

We would like predict functions to be able to return a remote URL rather than a local file on disk and have it behave like a file object. And when it is passed to the file uploader it will stream the file from the remote to the destination provided.

class Predictor(BasePredictor):
    def predict(self, **kwargs) -> File:
        return URLFile("https://replicate.delivery/czjl/9MBNrffKcxoqY0iprW66NF8MZaNeH322a27yE0sjFGtKMXLnA/hello.webp")

This PR modifies the URLFile class to use urllib.request.openurl instead of requests as the HTTP client for reading the file data.

The urllib.response.addinfourl interface conforms to the one used by urllib3.response.HTTPResponse so I don't think we have any gaps here, longer term we probably want to figure out a more reliable shim. But as this branch is mostly used for language models we're probably okay.

Note

I've tested this locally by creating a function that returns a FileURL() pointing to a remote address and the file is uploaded successfully.

This commit modifies the URLFile class to use `httpx` instead of `requests` as the HTTP client for reading the file data. The `httpx.Response` interface is different from the one used by `urllib3.response.HTTPResponse` so it's unclear where the gaps are here, longer term we probably want to figure out a more reliable shim. But as this branch is mostly used for language models we're probably okay. The `ChunkFileReader` has been updated to use the `iter()` function to convert the `io.IOBase` object into an iterator. This works as usual for file handlers. For the `httpx.Response` object we'll return a `iter_raw()` to yield the bytes from the stream. Because `httpx.Response` doesn't support `seek()` to reset the file we instead generate a new request each time the iterator is called.

This commit refactors the existing change to use httpx instead to use the urllib module from the standard library instead. This already conforms to a file like object and fits in better with the design.

aron

Updated the code to use urllib instead of requests or httpx. The resulting urllib.response.addinfourl is a file-like object and works well with the design of URLFile.

python/cog/server/clients.py

aron · 2024-10-11T14:57:39Z

python/cog/types.py

+        # URLFile doesn't support `seek()` so the upload_file function
+        # cannot reset the request should it need to retry. As a hack
+        # we re-create the request each time the iterator is requested.
+        if response.num_bytes_downloaded > 0:
+            response.close()
+            object.__delattr__(self, "__target__")


I don't really like this, it was a bit of a last minute bodge to get upload retries working. An alternative would be to implement seek() and seekable() and if seek(0) is called reset the target, else raise an error.

python/cog/server/clients.py

aron · 2024-10-14T13:08:06Z

python/cog/types.py

+            # We create a streaming response here, much like the `requests`
+            # version in the main 0.9.x branch. The only concerning bit here
+            # is that the book keeping for closing the response needs to be
+            # handled elsewhere. There's probably a better design for this
+            # in the long term.
+            res = urllib.request.urlopen(url)
+            object.__setattr__(self, "__target__", res)


This feels so much simpler than the requests version. What am I missing, why would we use requests here?

Hmmm, actually, thinking about this some more -- does this actually work? This doesn't spin off a background process to download the file, it blocks until the file is downloaded and stores the whole thing in memory.

Can we chat about this sync? Is requests threaded by default?

We discussed this in-person and this interface will do what we want. It looks like we can actually get something similar using httpx as well:

with httpx.stream("GET", "https://example.com/") as response: raw_stream = response.extensions["network_stream"] print(raw_stream.read(2)) # b'\x1f\x8b'

encode/httpx#2296 (comment)

But for the moment we'll go with this simpler implementation and re-factor later.

python/cog/types.py

python/tests/server/test_clients.py

nickstenning

This mostly looks good, but I do think we're potentially giving ourselves a security headache if we don't constrain URL schemes.

aron changed the title ~~Support URLFile in the upload_file function~~ [async] Support URLFile in the upload_file function Oct 11, 2024

aron force-pushed the async-upload-file-url branch from a11fc98 to 43d1428 Compare October 11, 2024 14:59

Refactor URLFile to use urllib.response instead of httpx

84fb135

This commit refactors the existing change to use httpx instead to use the urllib module from the standard library instead. This already conforms to a file like object and fits in better with the design.

aron commented Oct 14, 2024

View reviewed changes

aron added 2 commits October 15, 2024 15:51

fix linting issues

3c3f215

Fix tests

ff41cbb

aron requested a review from nickstenning October 15, 2024 15:22