Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: reading json with lines=True from s3, xref #17200 #17201

Merged
merged 22 commits into from
Nov 27, 2017
Merged

COMPAT: reading json with lines=True from s3, xref #17200 #17201

merged 22 commits into from
Nov 27, 2017

Conversation

alph486
Copy link

@alph486 alph486 commented Aug 8, 2017

Attempt to decode the bytes array with encoding passed to the call.

Attempt to decode the bytes array with `encoding` passed to the call.
@gfyoung gfyoung added the IO JSON read_json, to_json, json_normalize label Aug 8, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 8, 2017

@alph486 : Thanks for the PR! You will need to write tests for us to merge this. I would also take a look at what @TomAugspurger suggested in terms of patching this.

@alph486
Copy link
Author

alph486 commented Aug 8, 2017

Is there a particular way you guys go about handling s3 in your tests? I see some files in a bucket somewhere, but idk if there are any jsonl type files or how to add one. For that matter, I don't see any tests for read_json where the file is in s3.

This reverts commit 8255cd0.
@pep8speaks
Copy link

pep8speaks commented Aug 8, 2017

Hello @alph486! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 27, 2017 at 00:04 Hours UTC

@alph486
Copy link
Author

alph486 commented Aug 8, 2017

@gfyoung After going back to clean this up I realized jsonl from URL based file also suffers from the same issue. I have a cleaner fix that I think works in both cases.

Can you let me know how I should proceed for getting a JSONL file uploaded to the public s3 bucket for testing this in the test suite? I don't have write access.

@gfyoung
Copy link
Member

gfyoung commented Aug 8, 2017

Can you let me know how I should proceed for getting a JSONL file uploaded to the public s3 bucket for testing this in the test suite? I don't have write access.

@TomAugspurger @jreback : thoughts?

@jreback
Copy link
Contributor

jreback commented Aug 8, 2017

if you put the file on a gist somewhere @TomAugspurger or I can upload to the bucket.

@jreback jreback changed the title Fix for #17200 COMPAT: reading json with lines=True from s3, xref #17200 Aug 8, 2017
@alph486
Copy link
Author

alph486 commented Aug 9, 2017

@jreback @gfyoung Thanks for the assist here. I'll wrap this up soon; just after I get my current project completed that requires this change.

@alph486
Copy link
Author

alph486 commented Aug 15, 2017

@jreback @TomAugspurger Here's a file that I use in my tests. If you can give me an s3 path where you drop it I'll push the tests. https://gist.github.com/alph486/c7eff9a896d80dccae5a3b3283e75a4e#file-items-jsonl

@alph486
Copy link
Author

alph486 commented Aug 18, 2017

bump?

@gfyoung
Copy link
Member

gfyoung commented Aug 18, 2017

@alph486 : Are you still waiting on the updated URL to patch the failing tests?

@alph486
Copy link
Author

alph486 commented Aug 18, 2017

@gfyoung yeah, i just need a target s3 url to put in my code

@gfyoung
Copy link
Member

gfyoung commented Aug 18, 2017

ping @jreback @TomAugspurger

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 18, 2017 via email

@alph486
Copy link
Author

alph486 commented Aug 21, 2017

@jreback ?

@jreback
Copy link
Contributor

jreback commented Aug 21, 2017

yeah I don't think I have this either. @TomAugspurger IIRC you can login using the Continuum dashboard.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2017

if that works we should prob change it :>

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 22, 2017 via email

@alph486
Copy link
Author

alph486 commented Aug 24, 2017

sorry to keep nudging here, but any progress from IT @TomAugspurger ? Id love to try to a patched official version out so we can use our feature in production.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 24, 2017

No word yet.

Really, we should be using https://github.com/spulec/moto to mock all our S3 calls. Now that s3fs (and boto3) are handling all the network stuff, we don't need to be testing it. Just our text parsing. I'll open an issue about that (and you can look at fixing it if you're interested :)

#17325

@alph486
Copy link
Author

alph486 commented Aug 24, 2017 via email

@TomAugspurger
Copy link
Contributor

Thanks, I'd be totally fine with just using moto for your test for now, and following up moving most of the other S3 tests to use moto.

You'll need to edit one of the ci files to include moto. Probably ci/requirements-3.6.run

@alph486
Copy link
Author

alph486 commented Aug 24, 2017 via email

@alph486
Copy link
Author

alph486 commented Aug 29, 2017

Any guidance on which requirements files I should change for both release packaging, development, and CI? (I currently have ci/requirements-3.6.run and requirements_dev.txt). I just have to add s3fs to some things and moto where appropriate.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2017

you shouldn't add s3fs anywhere

@alph486
Copy link
Author

alph486 commented Aug 29, 2017

Unless I'm misunderstanding: you cannot run any of the test suites that require pandas.io.s3 locally unless s3fs is part of requirements_dev.txt no?

Edit:

For example from my attempt of running pytest pandas/tests/io/json/test_pandas.py -v:


pandas/tests/io/json/test_pandas.py:996:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/io/json/json.py:324: in read_json
    encoding=encoding)
pandas/io/common.py:208: in get_filepath_or_buffer
    from pandas.io import s3
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    """ s3 support for remote file interactivity """
    from pandas import compat
    try:
        import s3fs
        from botocore.exceptions import NoCredentialsError
    except:
>       raise ImportError("The s3fs library is required to handle s3 files")
E       ImportError: The s3fs library is required to handle s3 files

pandas/io/s3.py:7: ImportError
=====================================

@TomAugspurger
Copy link
Contributor

@alph486 close! Still a few style things in https://travis-ci.org/pandas-dev/pandas/jobs/279576239#L1938

FYI if you run flake8 pandas/tests/io it should print out all the issues. LMK if you want me to push a commit cleaning these up.

@alph486
Copy link
Author

alph486 commented Sep 25, 2017

@TomAugspurger Yeah, but part of those unused imports are a backwards way of me leveraging the s3_resource that arent necessarily just cleanup. I replied to a comment of yours earlier on about using s3_resource. I'd need to move it and potentially refactor some of that code to use it in my module. Can you check that out and lmk your thoughts?

@TomAugspurger
Copy link
Contributor

@alph486 sorry I missed your comment earlier. I'll push a commit in a minute that refactors the s3_resource stuff.

@alph486
Copy link
Author

alph486 commented Sep 25, 2017

Ok thanks, sorry about it taking so long. I'm not super confident in here.

@TomAugspurger
Copy link
Contributor

OK pushed two commits. One gets you back up to date with master, and 6979fb8 is the actual changes. I added a pandas/tests/io/conftest.py with the s3_resource that both places use now.

If you git pull to test locally, you'll have to rebuild the C extensions (`python setup.py build_ext --inplace). Hopefully everything is OK though.

@alph486
Copy link
Author

alph486 commented Sep 26, 2017

@jreback are you okay with the changes made per your review?

@TomAugspurger TomAugspurger dismissed jreback’s stale review September 26, 2017 17:35

Fixed the whitespace

@TomAugspurger
Copy link
Contributor

Should be all good. ping on green @alph486

@@ -1,2 +1,3 @@
xarray==0.9.1
pandas-gbq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needds to be reverted

@@ -5,4 +5,4 @@ cython
pytest>=3.1.0
pytest-cov
flake8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this, s3fs is NOT a requirement for dev; we should be robust to not having this installed

@@ -26,3 +26,4 @@ sqlalchemy
bottleneck
pymysql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok

"""
If PY3 and/or isinstance(json, bytes)
"""
if isinstance(json, bytes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 line comment

@@ -0,0 +1,2 @@
{"a": 1, "b": 2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok you have to have this named .json otherwise it won't be picked up by setup.py (IOW the install test will fail).

@pytest.fixture(scope='module')
def tips_file():
return os.path.join(tm.get_data_path(), 'tips.csv')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

@@ -0,0 +1,2 @@
{"a": 1, "b": 2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok you have to have this named .json otherwise it won't be picked up by setup.py (IOW the install test will fail).

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

can you rebase / update

@jreback jreback added this to the 0.21.1 milestone Nov 27, 2017
@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Nov 27, 2017
@jreback jreback merged commit 4fd104a into pandas-dev:master Nov 27, 2017
@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

thanks @alph486

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017
TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_json(lines=True) broken for s3 urls in Python 3 (v0.20.3)
5 participants