Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use io.IOBase as base class for StreamingBody #879

Closed
depp opened this issue Apr 13, 2016 · 17 comments
Closed

Use io.IOBase as base class for StreamingBody #879

depp opened this issue Apr 13, 2016 · 17 comments
Labels
feature-request This issue requests a feature.

Comments

@depp
Copy link

depp commented Apr 13, 2016

StreamingBody is a "file-like object", but file-like objects should probably inherit from IOBase or implement a similar suite of methods. This will provide an easy fix for #767.

@kyleknap
Copy link
Contributor

Are there any methods in particular that you are looking for? It may be difficult to inherit from IOBase as not all of its methods may be possible to implement (given that we are reading from an unseekable stream from the server underneath).

@kyleknap kyleknap added closing-soon feature-request This issue requests a feature. and removed closing-soon labels Apr 15, 2016
@akvadrako
Copy link

@kyleknap, IOBase can certainly work with streaming data. This is sufficient for my current use-case, but with a little more work I think it could be fully supported. My main interest was in line-by-line iteration:

class StreamingBodyIO(RawIOBase):
    """Wrap a boto StreamingBody in the IOBase API."""
    def __init__(self, body):
        self.body = body

    def readable(self):
        return True

    def read(self, n=-1):
        n = None if n < 0 else n
        return self.body.read(n)

@atrigent
Copy link

There are two other other open feature requests (#1197 and #767), which should be solved by inheriting from IOBase or RawIOBase, and one open PR (#1195) which should be closed because it does not use this method.

@atrigent
Copy link

...actually, is there any point to keeping StreamingBody around at all? It looks like urllib3 implements content length checking itself, and it looks like that socket timeout thing might've been resolved as well.

@GFernie
Copy link

GFernie commented Oct 9, 2017

When returning a file-like object, such as StreamingBody, io.IOBase is a very reasonable abstract to assume. From the io.IOBase docstring:

This class provides empty abstract implementations for many methods that derived classes can override selectively; the default implementations represent a file that cannot be read, written or seeked.

E.g., the missing io.IOBase attributes make it very dificult to lazily read a compressed file from S3:

from gzip import GzipFile
import boto3

body_stream_gz = boto3.resource('s3').Object('my_bucket', 'my_object.txt.gz').get()['Body']
with GzipFile(fileobj=body_stream_gz) as body_stream:
    content = body_stream.read('1000')

---------------------------------------------------------------------------
AttributeError: 'StreamingBody' object has no attribute 'tell'

Here's an accepted cpython PR for the same issue:
python/cpython#3249

@glyph
Copy link

glyph commented Aug 30, 2018

Another problem that this should resolve: text response bodies that one wishes to deal with as such, but still stream. TextIOWrapper(StreamingBody(...), encoding="utf-8") seems like it should work, but instead raises the exception: builtins.AttributeError: 'StreamingBody' object has no attribute 'readable'

@AlJohri
Copy link

AlJohri commented Nov 30, 2018

@GFernie did this already get fixed? I was able to get the following code working:

import json
import gzip
import boto3
from urllib.parse import urlparse

def parse_s3_uri(uri):
    o = urlparse(uri)
    return o.netloc, o.path.lstrip('/')

def stream_file(uri):
    bucket, key = parse_s3_uri(uri)
    fileobj = boto3.resource('s3').Object(bucket, key).get()['Body']
    with gzip.GzipFile(fileobj=fileobj) as f:
        yield from (json.loads(line.decode('utf-8')) for line in f)

if __name__ == "__main__":
    for row in stream_file('s3://mybucket/my/path/key.json.gz'):
        print(row)

it lazily reads from a compressed json.gz file

@glyph
Copy link

glyph commented Dec 27, 2018

I am also very curious about @AlJohri's question! Is this actually fixed now?

@evmin
Copy link

evmin commented Nov 4, 2019

Would be keen to get it fixed as well.

@evmin
Copy link

evmin commented Nov 4, 2019

Especially since there has been a pull request.

@joguSD
Copy link
Contributor

joguSD commented Nov 10, 2020

We've taken a couple stabs at this but both attempts have failed (#2208, #2150) once we do more extensive integration testing across our packages (botocore, boto3, aws-cli, s3transfer).

This class currently defines a close method that will close the underlying stream that the StreamingBody class is wrapping. When we extend from io.IOBase we inherit the finalization behavior, which causes close to be called as part of finalization. In the lifecycle of returning a request from botocore we end up wrapping the underlying stream with StreamingBody twice. This has the implication that we see the full life cycle of one of these objects that can be GC'd before we return the response and end up returning a StreamingBody where the underlying stream has already been closed and can no longer be read.

This behavior seems to be avoidable on Python 3 by stubbing out the implementation of __del__, but on Python 2 it doesn't seem like we can avoid this. Unfortunately, I think we're blocked on inheriting this as long as we support Python 2 and even for Python 3 I have mixed feelings on the robustness of this. I think we'll have to take another look at this in the future, otherwise we might just have to re-implement the methods directly onto StreamingBody.

@glyph
Copy link

glyph commented Jan 27, 2021

@joguSD Could you perhaps implement the finalization behavior yourself, and lie about inheriting from IOBase using an if TYPE_CHECKING or something similar?

@nateprewitt
Copy link
Contributor

@glyph There are likely some more intricate approaches we can take to circumvent the finalization behavior in Python 2. Our initial research suggested it was going to be a fairly deep dive to get this working nicely across all of our packages.

Given we're ending Python 2 support in the next 5 months, I'd say it's more likely we'll implement the Python 3 __del__ solution then, and keep the code simpler if possible.

@glyph
Copy link

glyph commented Jan 27, 2021

@nateprewitt The approach we took for this sort of thing in Twisted when we had areas where 2 was holding us back was we'd implement the 3-only version of the thing but not expose it for py2, so py3 users could start taking advantage. It wouldn't be a regression for py2 users if you did it today, after all.

@mdavis-xyz
Copy link

In April I submitted a PR for a short-term fix for this. The PR has not been merged, or closed, or commented on.

Can anyone here tell me the status of this repo? I see fresh commits, so it's actively maintained.
How can I get the PR merged, or get feedback on why it can't be merged.
Should I raise a support request with AWS in the console? (I wouldn't even know which service to select.)

@tim-finnigan
Copy link
Contributor

Closing this as the PR linked above was merged.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request This issue requests a feature.
Projects
None yet
Development

No branches or pull requests