-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[async] Support URLFile in the upload_file function #1987
Conversation
This commit modifies the URLFile class to use `httpx` instead of `requests` as the HTTP client for reading the file data. The `httpx.Response` interface is different from the one used by `urllib3.response.HTTPResponse` so it's unclear where the gaps are here, longer term we probably want to figure out a more reliable shim. But as this branch is mostly used for language models we're probably okay. The `ChunkFileReader` has been updated to use the `iter()` function to convert the `io.IOBase` object into an iterator. This works as usual for file handlers. For the `httpx.Response` object we'll return a `iter_raw()` to yield the bytes from the stream. Because `httpx.Response` doesn't support `seek()` to reset the file we instead generate a new request each time the iterator is called.
a11fc98
to
43d1428
Compare
This commit refactors the existing change to use httpx instead to use the urllib module from the standard library instead. This already conforms to a file like object and fits in better with the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code to use urllib
instead of requests
or httpx
. The resulting urllib.response.addinfourl
is a file-like object and works well with the design of URLFile.
python/cog/types.py
Outdated
# URLFile doesn't support `seek()` so the upload_file function | ||
# cannot reset the request should it need to retry. As a hack | ||
# we re-create the request each time the iterator is requested. | ||
if response.num_bytes_downloaded > 0: | ||
response.close() | ||
object.__delattr__(self, "__target__") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this, it was a bit of a last minute bodge to get upload retries working. An alternative would be to implement seek()
and seekable()
and if seek(0)
is called reset the target, else raise an error.
python/cog/types.py
Outdated
# We create a streaming response here, much like the `requests` | ||
# version in the main 0.9.x branch. The only concerning bit here | ||
# is that the book keeping for closing the response needs to be | ||
# handled elsewhere. There's probably a better design for this | ||
# in the long term. | ||
res = urllib.request.urlopen(url) | ||
object.__setattr__(self, "__target__", res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels so much simpler than the requests version. What am I missing, why would we use requests
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, actually, thinking about this some more -- does this actually work? This doesn't spin off a background process to download the file, it blocks until the file is downloaded and stores the whole thing in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we chat about this sync? Is requests threaded by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in-person and this interface will do what we want. It looks like we can actually get something similar using httpx as well:
with httpx.stream("GET", "https://example.com/") as response:
raw_stream = response.extensions["network_stream"]
print(raw_stream.read(2))
# b'\x1f\x8b'
But for the moment we'll go with this simpler implementation and re-factor later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks good, but I do think we're potentially giving ourselves a security headache if we don't constrain URL schemes.
We would like
predict
functions to be able to return a remote URL rather than a local file on disk and have it behave like a file object. And when it is passed to the file uploader it will stream the file from the remote to the destination provided.This PR modifies the URLFile class to use
urllib.request.openurl
instead ofrequests
as the HTTP client for reading the file data.The
urllib.response.addinfourl
interface conforms to the one used byurllib3.response.HTTPResponse
so I don't think we have any gaps here, longer term we probably want to figure out a more reliable shim. But as this branch is mostly used for language models we're probably okay.Note
I've tested this locally by creating a function that returns a
FileURL()
pointing to a remote address and the file is uploaded successfully.