-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default to_* methods to compression='infer' #22011
Default to_* methods to compression='infer' #22011
Conversation
@@ -28,7 +28,7 @@ | |||
# interface to/from | |||
def to_json(path_or_buf, obj, orient=None, date_format='epoch', | |||
double_precision=10, force_ascii=True, date_unit='ms', | |||
default_handler=None, lines=False, compression=None, | |||
default_handler=None, lines=False, compression='infer', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where to update the to_json docs... didn't see a docstring in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 1905 in 322dbf4
Convert the object to a JSON string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs tests
The following test is failing in Python 2.7 (see travis log): pandas/pandas/tests/test_common.py Lines 254 to 264 in 1033e8b
Let me look into #21227 as to what this test is for. Update: this test was added in 91451cb / #21478 (not #21227). @minggli can you explain the purpose of Here is the relevant pytest fixture: Lines 131 to 138 in 55cbd7d
|
Codecov Report
@@ Coverage Diff @@
## master #22011 +/- ##
==========================================
- Coverage 92.07% 92.06% -0.01%
==========================================
Files 170 170
Lines 50690 50704 +14
==========================================
+ Hits 46672 46680 +8
- Misses 4018 4024 +6
Continue to review full report at Codecov.
|
@WillAyd I believe the different values for |
@dhimmel saw your comment. thanks for contributing! this This is because it's not supported as stated in: pandas/pandas/io/formats/csvs.py Line 134 in 1033e8b
I don't see why it would fail your work on Having checked out your PR, I saw the issue but can't replicate it on master. Are you working off the latest master branch? |
Yep. Don't need to overthink it with all the various parameter combinations but at least need a test to ensure this now defaults to infer (since that is what you are changing) |
@minggli, this PR currently branches off at 322dbf4, which is now two commits behind. I can rebase, but don't think that's the issue.
I was thinking the change of the to_csv compression default may have caused this issue, but it doesn't make sense to me. The test explicitly specifies compression in |
I think the change of default in to_csv did change things. there is no infer_compression procedure in to_csv. I think that's what caused the failing of test. add _infer_compression in to_csv should solve this problem. by the way, it appears that #17900 has removed zip from docstrings in to_csv and to_json based on Oct 2017 discussion but zip compression for writing has been added in 0.23 Jun 2018: pandas/doc/source/whatsnew/v0.23.0.txt Line 543 in 67b6277
could you add 'zip' back in the docstrings of to_csv and to_json? |
Hmm. I thought to_csv now supports
Is there consensus on this change. IIRC from reading past issues, it was considered poor practice to save a single file to a zip archive. Thus pandas would be lenient on reading, but stricter on writing. I don't have a strong opinion, but don't want to hold up this PR for another controversial issue. So let's diagnose the failing test and see if it's somehow related to the zip issue. |
#17900 added infer in _get_handle but to_csv uses compression argument before calling _get_handle; therefore raises RuntimeWarning at https://github.com/dhimmel/pandas/blob/648bf4d1810a2c2b9cbff1d4b941ab7cb7bc0b35/pandas/io/formats/csvs.py#L133 because compression='infer' instead of None as it was before. It on itself shouldn't be a problem but pytest-dev/pytest#2917 exists so that test_compression_warning fails because the warning has been raised in earlier test before this case. simply changing the order of the test is not thread-safe so I think the best way is: inside to_csv: In regards to zip in docstring, it's what production is showing right and supported already.
|
So I'm a bit concerned with what has happened since I last worked on compression & file IO. My impression was that we should start to use a unified API for inferring compression and opening files: see #15008. Happy to revert be724fa or put it in another PR, just wanted to see the CI test output to know what this would break. |
why is this a problem? this is clear separation of concerns. I don't think the goal has drifted from when you last worked on this. A number of bugs / consolidations have happened in the interim. Happy to take further cleanup. |
Having |
Attempt to diagnose testing failure of Python 2 test_compression_warning https://travis-ci.org/pandas-dev/pandas/jobs/407300547#L3853
Hello @dhimmel! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 01, 2018 at 14:41 Hours UTC |
d41ede5
to
cebc0d9
Compare
Why is
|
# GH 21227 | |
def test_compression_warning(compression_only): | |
df = DataFrame(100 * [[0.123456, 0.234567, 0.567567], | |
[12.32112, 123123.2, 321321.2]], | |
columns=['X', 'Y', 'Z']) | |
with tm.ensure_clean() as filename: | |
f, _handles = _get_handle(filename, 'w', compression=compression_only) | |
with tm.assert_produces_warning(RuntimeWarning, | |
check_stacklevel=False): | |
with f: | |
df.to_csv(f, compression=compression_only) |
The error is:
E AssertionError: Did not see expected warning of class 'RuntimeWarning'.
It doesn't make sense to me how the changes in this PR would effect whether the RuntimeWarning
. This PR only changes the default compression value. The test specifies compression
so the default should not matter.
In cebc0d9 I added some debugging statements to help diagnose the testing failure
In CSVFormatter.save:
In test_compression_warning
:
Here's the output:
------------------------------ Captured log call -------------------------------
test_common.py 261 WARNING debug_1: gzip
test_common.py 266 WARNING debug_2: gzip
csvs.py 138 WARNING debug_3: gzip
csvs.py 139 WARNING debug_4: <gzip open file '/tmp/tmpQYAr5L', mode 'wb' at 0x7f7c02e56db0 0x7f7c04698b10>
csvs.py 141 WARNING debug_5: True
csvs.py 143 WARNING debug_6: in loop, should RuntimeWarn
So the logging shows in the failing test that the code enters the if statement that triggers the RuntimeWarning. So I don't understand how the warning could not be raised or where this issue could be coming from.
@WillAyd other than this issue how do things look? I added a test for to_csv
defaulting to inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks OK, though I think the test can be improved. Will have to take a look again after some of the logging things get cleaned up
Test that to_csv defaults to inferring compression from paths. | ||
https://github.com/pandas-dev/pandas/pull/22011 | ||
""" | ||
df = DataFrame({"A": [1]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I suppose this "works" it is definitely focused on gzip
and doesn't make any assertions about how the behavior works across a combination of infer
, None
and an explicit compression.
We have a fixture called compression
in conftest.py
that covers the various compression options - is there a way to parametrize that as part of this test and make more comprehensive assertions about the potential combinations of behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The possible values that compression can take for to_csv
are already tested. The purpose of this test is simply to test that compression
is defaulting to 'infer'
. I kept the test simple so that it ONLY tests that the default is infer and won't break or malfunction should other aspects of the API change. The test will break if the default for compression is put back to None (and hopefully not for other reasons).
I think it may make sense to make a similar test for to_json
, but don't see how testing compression='infer'
for all possible compression extensions is within scope of this PR as this PR just changes the default and is not creating the infer option. Is the worry that there is currently inadequate testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that this test only makes sure that gzip compression works by default with infer
, but how does it guarantee that other compression types play nice with infer? I'll admit it's a subtle distinction but nuances like that do pop up and it doesn't seem like it should be that much more work to leverage the existing compression fixture for this test to increase coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be that much more work to leverage the existing compression fixture for this test to increase coverage
In abd19e3, I modified an existing parametrized test to look for compression by default for paths where inference should occur. This actually caught an issue (we hadn't switched default for Series.to_csv
-- fixed in 2f670fe).
Should I delete the gzip test? Note the parametrized test doesn't test that the right compression is occurring, just that a compression is occurring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of going about it in this fashion can you not do a round trip with to_csv
and read_csv
using infer with a file extension for the former and the parametrized value as an argument for compression
in the latter? So in pseudo-code:
with tm.ensure_clean('compressed.csv.{}'.format(compression_only)) as path:
df.to_csv(path)
result = pd.read_csv(path, compression=compression_only)
tm.assert_frame_equal(result, df)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of the roundtrip approach here, since it never actually tests that the file is compressed on disk. Given that the to_* and read_* methods rely on much of the same compression infrastructure, I think it's possible to modify the code such that all compression gets disabled and the roudtrip works perfectly. Now hopefully there are enough other tests to catch such a situation.
|
Codecov Report
@@ Coverage Diff @@
## master #22011 +/- ##
==========================================
+ Coverage 92.07% 92.07% +<.01%
==========================================
Files 170 170
Lines 50690 50685 -5
==========================================
- Hits 46672 46669 -3
+ Misses 4018 4016 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. ping on green
pandas/core/frame.py
Outdated
If 'infer' and `path_or_buf` is path-like, then detect compression | ||
from the following extensions: '.gz', '.bz2', '.zip' or '.xz' | ||
(otherwise no compression). | ||
.. versionchanged:: 0.24.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to have a blank after this or it has a warning, @TomAugspurger @datapythonista ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f8829a6, but would be good to hear from @TomAugspurger and @datapythonista, since we have complex situations such as:
DOCLINE
DOCLINE
.. versionchanged:: 0.23.0
here is what was added
.. versionchanged:: 0.24.0 here is what changed
DOCLINE
For example, is the above OKAY or do we need additional blanks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhimmel I think you need the additional blank lines (before, and not sure if after).
The reason is not that much an standard in this case, but about sphinx understanding the directive. What we expect in the documentation, is that it's rendered like in the validate
method of the merge
docstring: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
But if you don't leave the right blank lines, sphinx doesn't detect it's a directive, and the text is rendered as it is. See this case: https://pandas.pydata.org/pandas-docs/version/0.23.1/generated/pandas.IntervalIndex.from_tuples.html
So, the best is if you can build the documentation, and check that it's rendered all right. This can be done by ./doc/make.py html
(or ./doc/make.py html --single pandas.DataFrame.read_csv
)
Let me know if you have any issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in eadf68e and built the docs locally to confirm they're rendering properly.
Turns out the blank line before is required. After is not required. In between multiple statements is not required.
pandas/core/generic.py
Outdated
A string representing the compression to use in the output file, | ||
only used when the first argument is a filename. | ||
|
||
.. versionadded:: 0.21.0 | ||
.. versionchanged:: 0.24.0 | ||
'infer' option added and set to default | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. like this is good
pandas/core/series.py
Outdated
Allowed values are None, 'gzip', 'bz2', 'zip', 'xz', and 'infer'. | ||
This input is only used when the first argument is a filename. | ||
.. versionchanged:: 0.24.0 | ||
'infer' option added and set to default | ||
date_format: string, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/tests/io/test_common.py
Outdated
|
||
import pandas as pd | ||
import pandas.util.testing as tm | ||
import pandas.io.common as cmn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename to icom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in af8c137
pandas/tests/io/test_common.py
Outdated
@@ -285,4 +285,100 @@ def test_unknown_engine(self): | |||
df = tm.makeDataFrame() | |||
df.to_csv(path) | |||
with tm.assert_raises_regex(ValueError, 'Unknown engine'): | |||
read_csv(path, engine='pyt') | |||
pd.read_csv(path, engine='pyt') | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a market comment (e.g. line of --- or whatever), and put compression tests to delineate this section of the tests (also ok with a new test file tests_compression.py (maybe simpler)
@jreback it's 🍏, i.e. |
Refs pandas-dev#22011 (comment) Blanks are needed before but not after or in between.
lgtm @WillAyd merge when satisfied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nits / question around comments
pandas/tests/io/test_compression.py
Outdated
def test_dataframe_compression_defaults_to_infer( | ||
write_method, write_kwargs, read_method, compression_only): | ||
# Test that DataFrame.to_* methods default to inferring compression from | ||
# paths. GH 22004 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just change comment to # GH22004
(standard in other tests). The rest here doesn't add anything that isn't inferred from the test name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in cf5b62e
pandas/tests/io/test_compression.py
Outdated
def test_series_compression_defaults_to_infer( | ||
write_method, write_kwargs, read_method, read_kwargs, | ||
compression_only): | ||
# Test that Series.to_* methods default to inferring compression from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
pandas/tests/io/test_compression.py
Outdated
# Assert that passing a file object to to_csv while explicitly specifying a | ||
# compression protocol triggers a RuntimeWarning, as per GH 21227. | ||
# Note that pytest has an issue that causes assert_produces_warning to fail | ||
# in Python 2 if the warning has occurred in previous tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be overlooking it but where in the links is there mention about Python 2 behavior? We get some random Resource Warnings in our tests which is maybe related so could be good info, but wasn't immediately apparent to me where that was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pytest issue at https://git.io/fNEBm / pytest-dev/pytest#2917.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm I'll merge after AppVeyor goes green. Thanks @dhimmel
Not sure what the Travis failure is about:
|
@dhimmel : Thanks a lot for this! |
* master: (47 commits) Run tests in conda build [ci skip] (pandas-dev#22190) TST: Check DatetimeIndex.drop on DST boundary (pandas-dev#22165) CI: Fix Travis failures due to lint.sh on pandas/core/strings.py (pandas-dev#22184) Documentation: typo fixes in MultiIndex / Advanced Indexing (pandas-dev#22179) DOC: added .join to 'see also' in Series.str.cat (pandas-dev#22175) DOC: updated Series.str.contains see also section (pandas-dev#22176) 0.23.4 whatsnew (pandas-dev#22177) fix: scalar timestamp assignment (pandas-dev#19843) (pandas-dev#19973) BUG: Fix get dummies unicode error (pandas-dev#22131) Fixed py36-only syntax [ci skip] (pandas-dev#22167) DEPR: pd.read_table (pandas-dev#21954) DEPR: Removing previously deprecated datetools module (pandas-dev#6581) (pandas-dev#19119) BUG: Matplotlib scatter datetime (pandas-dev#22039) CLN: Use public method to capture UTC offsets (pandas-dev#22164) implement tslibs/src to make tslibs self-contained (pandas-dev#22152) Fix categorical from codes nan 21767 (pandas-dev#21775) BUG: Better handling of invalid na_option argument for groupby.rank(pandas-dev#22124) (pandas-dev#22125) use memoryviews instead of ndarrays (pandas-dev#22147) Remove depr. warning in SeriesGroupBy.count (pandas-dev#22155) API: Default to_* methods to compression='infer' (pandas-dev#22011) ...
Just wanted to add that I posted on Steem about this pull request, as part of the https://utopian.io initiative. Today, the utopian account upvoted my post, thereby rewarding it, as an incentive for open source contributions! So thanks Utopian and happy to help any other Pandas contributors get set up with a Steem account and use Utopian, just email me. |
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR does the following:
to_csv
,to_json
, andto_pickle
methods to infer.test_compression_defaults_to_infer
to test that compression='infer' is default for the relevant to_* methods.compression='infer'
with a file object would produce a RuntimeWarning.test_compression_warning
which can fail due to a pytest bug.pandas/tests/test_common.py
topandas/tests/io/test_common.py