Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

still doesn't work with boto3 StreamingBody #17135

Closed
uiur opened this issue Aug 1, 2017 · 14 comments
Closed

still doesn't work with boto3 StreamingBody #17135

uiur opened this issue Aug 1, 2017 · 14 comments
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@uiur
Copy link

uiur commented Aug 1, 2017

ref: #16135 #16150

Example

s3_object = client.get_object(Bucket=bucket, Key=key)
result = read_csv(s3_object["Body"])
# ValueError: Invalid file path or buffer object type: <class 'botocore.response.StreamingBody'>

Problem description

issue: is_file_like requirements are too strict for boto3 S3 objects #16135
pull request: #16150

This issue was closed but it's not working. I've found that the test is skipped and has a bug.
I'll send a pull request to reproduce the behavior.

I think it's because botocore.response.StreamingBody doesn't have __iter__ so is_file_like returns False.

INSTALLED VERSIONS

commit: c8dcf19
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL:
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.21.0.dev+317.gc8dcf19
pytest: 3.1.3
pip: 7.1.2
setuptools: 36.2.6
Cython: 0.26
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None

@uiur uiur mentioned this issue Aug 1, 2017
4 tasks
@uiur
Copy link
Author

uiur commented Aug 1, 2017

cc @gfyoung

@gfyoung gfyoung added IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version and removed IO CSV read_csv, to_csv labels Aug 1, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 1, 2017

@uiureo : Thanks for the report! From our conversation back in that issue, we got the impression that the PR would address this issue. Network tests are more difficult to reproduce because they are out of our control, but that's a little unfortunate that the purported patch failed to address this.

@gfyoung
Copy link
Member

gfyoung commented Aug 1, 2017

One issue currently is that we are facing this delicate balancing act between accepting as many "valid" IO objects as possible without allowing for invalid ones. We check for __iter__ because of #15337, as that is a robust method of rejecting such objects.

For your patch to be robust, you will need to find a way that rejects mock.Mock() while allowing for StreamingBody. That would be fantastic if you could!

@uiur
Copy link
Author

uiur commented Aug 11, 2017

This issue would be resolved if this PR boto/botocore#1195 is merged at botocore.

@gfyoung
Copy link
Member

gfyoung commented Aug 11, 2017

@uiureo : Thanks for letting us know! That would definitely be one way to resolve this issue.

@stonefly
Copy link

stonefly commented Sep 19, 2017

It's been widely accepted that as long as an object has read or write function it can be seen as a file-like object.

Much more importantly, this is the behavior before pandas 0.20.*.

A lot of codes relying on this assumption now stop working.

I'm wondering what is the point to re-define what is file-like here? Is the iter actually used in the implementation of read_csv for example?

@jorisvandenbossche jorisvandenbossche added this to the 0.21.0 milestone Sep 19, 2017
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017
@klintan
Copy link

klintan commented Feb 22, 2018

This is almost 5 month old still open, however it seems neither 0.21.0 or 0.22.0 lets you use the boto3 . Had to go back all the way to 0.19.2 for it to work.

Is it a design decision to not include it at this point or will we see this capability in future versions ?

@jreback
Copy link
Contributor

jreback commented Feb 22, 2018

we let’s see we have 2200 issues and only a small number of volunteers - how shall things be prioritized?

submitting a PR would be the fastest way here

@kokes
Copy link
Contributor

kokes commented Aug 27, 2018

I can now confirm that this is no longer an issue with a recent boto3 (the abovementioned PR has been merged). Since boto3 is not a requirement for pandas, there is nothing to done here, is there?

In [16]: obj=client.get_object(Bucket='my-bucket', Key='my/data/foo.csv')

In [17]: obj['Body']
Out[17]: <botocore.response.StreamingBody at 0x7f26ac400d30>

In [18]: df=pandas.read_csv(obj['Body'])

In [19]: len(df)
Out[19]: 7865

@gfyoung
Copy link
Member

gfyoung commented Aug 27, 2018

@kokes : That's great to hear! Might you be able to construct a unit test for us?

@kokes
Copy link
Contributor

kokes commented Aug 27, 2018

@gfyoung I tried - it's my first code PR, so I'm not sure about things like adding dev dependencies (should I edit all the travis/circleci metadata as well?). Also, not sure if I can just have a test without asserts, since all I'm testing is that it doesn't fail, so any exception should break the test.

Running this with botocore older than 1.10.47 will result in an error, as expected.

        if not is_file_like(filepath_or_buffer):
            msg = "Invalid file path or buffer object type: {_type}"
>           raise ValueError(msg.format(_type=type(filepath_or_buffer)))
E           ValueError: Invalid file path or buffer object type: <class 'botocore.response.StreamingBody'>

pandas/io/common.py:232: ValueError
================================================== 1 failed, 1 passed in 0.36 seconds ===================================================
root@114ec59e9d98:/pandas# 

@gfyoung
Copy link
Member

gfyoung commented Aug 27, 2018

@kokes : I would suggest writing the test first and don't worry about the dependencies part yet. We can discuss that once you have written the test.

How does that sound?

@kokes
Copy link
Contributor

kokes commented Aug 27, 2018

@gfyoung we’ve already iterated once with Tom, see #22520

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Aug 28, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018
@mroeschke mroeschke added good first issue and removed IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version labels Oct 16, 2019
@mroeschke mroeschke added the Needs Tests Unit test(s) needed to prevent regressions label Oct 16, 2019
@mroeschke
Copy link
Member

Appears this was closed by #22520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants