-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add compression support for 'read_pickle' and 'to_pickle' #13317
Conversation
tests! |
pandas/io/common.py
Outdated
return inferred_compression | ||
|
||
|
||
def _get_handle(path, mode, encoding=None, compression=None, is_txt=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is is_txt
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the _get_handle function was originally for csv file io, which will use a TextIOWrapper to wrap the file handle. pickle needs binary file handle, so I add the parameter to bypass the TextIOWrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok, let's call this is_text=True
then
and pls add a doc-string to _get_handle
(should have been there, but since now adding params)......
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string updated
Codecov Report
@@ Coverage Diff @@
## master #13317 +/- ##
=========================================
Coverage ? 91%
=========================================
Files ? 143
Lines ? 49312
Branches ? 0
=========================================
Hits ? 44874
Misses ? 4438
Partials ? 0
Continue to review full report at Codecov.
|
pandas/io/pickle.py
Outdated
""" | ||
with open(path, 'wb') as f: | ||
inferred_compression = _get_inferred_compression(path, compression) | ||
if inferred_compression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_handle
will open the file, so the first statement should be enough
code looks ok, needs quite a few tests though. see here for what to test (and you can model them pretty much like this). |
OK, I added the test code in test_pickle.py. I will move them to compression.py tomorrow. |
@goldenbull no, this should be separate from the other |
pandas/io/tests/test_pickle.py
Outdated
@@ -217,6 +217,16 @@ def python_unpickler(path): | |||
result = python_unpickler(path) | |||
self.compare_element(result, expected, typ) | |||
|
|||
def test_compression(self): | |||
self.data.to_pickle('data.pkl') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will need to separate these tests as you can't guarantee the dependencies for the compression.
But yes the general approach is fine.
also pls use the tm.ensure_clean()
to generate the paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🆗
a455e14
to
100476d
Compare
tests added. |
pandas/io/common.py
Outdated
|
||
Parameters | ||
---------- | ||
path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation for the parameters pls.
can you rebase / update |
Rebased onto master branch. It's been a long time since my last commit, my memory is fading 🤕 |
@goldenbull something went wrong with your rebase. Normally doing:
should solve this |
Actually I didn't use |
There is nothing perse wrong with |
8ff7829
to
f187514
Compare
o yes, I understand now. Rebase is necessary for code review 😄 |
""" | ||
Pickle (serialize) object to input file path. | ||
|
||
Parameters | ||
---------- | ||
path : string | ||
File path | ||
compression : {'infer', 'gzip', 'bz2', 'xz', None}, default 'infer' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need an versionadded tag
pandas/io/common.py
Outdated
@@ -285,8 +285,45 @@ def ZipFile(*args, **kwargs): | |||
ZipFile = zipfile.ZipFile | |||
|
|||
|
|||
def _get_handle(path, mode, encoding=None, compression=None, memory_map=False): | |||
def _get_inferred_compression(filepath_or_buffer, compression): | |||
if compression == 'infer': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a Parameters/Returns in a doc-string
pandas/io/common.py
Outdated
encoding for text file | ||
compression : string, default None | ||
{ None, 'gzip', 'bz2', 'zip', 'xz' } | ||
is_txt : bool, default True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_txt -> is_text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document memory_map arg
pandas/io/parsers.py
Outdated
inferred_compression = None | ||
else: | ||
inferred_compression = None | ||
inferred_compression = _get_inferred_compression(filepath_or_buffer, kwds.get('compression')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass compression=kwds.get('compression')
@@ -15,12 +16,18 @@ def to_pickle(obj, path): | |||
obj : any object | |||
path : string | |||
File path | |||
compression : {'infer', 'gzip', 'bz2', 'xz', None}, default 'infer' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded
pandas/io/pickle.py
Outdated
""" | ||
with open(path, 'wb') as f: | ||
inferred_compression = _get_inferred_compression(path, compression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass compression via kwarg
pandas/io/pickle.py
Outdated
""" | ||
with open(path, 'wb') as f: | ||
inferred_compression = _get_inferred_compression(path, compression) | ||
if inferred_compression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just do _get_handle
(it will open if necessary when no compression is specified)
7e42f8a
to
ccbeaa9
Compare
can you rebase / update |
@goldenbull just want to encourage you on this pull request. Just realized the current limits that there is no way to do compressed pickle IO in pandas. This would be a huge enhancement for https://github.com/cognoma/machine-learning where reading TSVs into pandas is slow, causing difficulty for our contributors. Pickles load almost instantaneously, but are too big uncompressed. |
@dhimmel btw, nothing stopping anyone from picking this up! its actually almost all of the way there! |
@jreback just for clarification, are there still comments that have to be addressed, or is it just rebasing? (I see commits after the last round of comment, but long time ago that I looked at the PR). |
yeah i don't remember |
I see that all tests are now moved to
|
|
# Conflicts: # pandas/io/tests/test_pickle.py
merged the latest master branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls change to use the parametrize style
pandas/tests/io/test_pickle.py
Outdated
@@ -302,3 +302,49 @@ def test_pickle_v0_15_2(): | |||
# with open(pickle_path, 'wb') as f: pickle.dump(cat, f) | |||
# | |||
tm.assert_categorical_equal(cat, pd.read_pickle(pickle_path)) | |||
|
|||
|
|||
class TestPickleCompression(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use a class, follow the existing style
pandas/tests/io/test_pickle.py
Outdated
df2 = pd.read_pickle(path, compression=compression) | ||
tm.assert_frame_equal(df, df2) | ||
|
||
def test_compression_explicit(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use parametrize
pandas/tests/io/test_pickle.py
Outdated
df = tm.makeDataFrame() | ||
df.to_pickle(path, compression=compression) | ||
|
||
def test_compression_explicit_bad(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/tests/io/test_pickle.py
Outdated
df.to_pickle(path) | ||
tm.assert_frame_equal(df, pd.read_pickle(path)) | ||
|
||
def test_compression_infer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
parameterized style is easier to read. 👍 |
@goldenbull looks good! can you add a whatsnew sub-section (pickle has gained compression support or something like this). show an example of writing / reading (you will need to use IOW, something like
|
also have a look at the io.rst docs on pickle. maybe worth adding a small section on using compression (see how we did this for read_csv, and copy the same formatting) |
pandas/tests/io/test_pickle.py
Outdated
for ext in extensions: | ||
yield self.compression_infer, ext | ||
|
||
@pytest.mark.parametrize('ext', ['', '.gz', '.bz2', '.xz', '.no_compress']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to test a .pkl
extension, as this will be the most common (and shouldn't be compressed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.Is there anything official that says .pkl
is a pickle extension ? (though by definition .pkl
would NOT be compressed, while .pkl.gz
would be for example). Also let's have .gzip
the same as .gz
. I think this is common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extension for pickled files is not standardized. The common ones are .pkl
, .p
, and .pickle
.
Also let's have
.gzip
the same as.gz
Currently, we don't infer .gzip
extension as gzip compression. I think that's out of scope for this PR. Something to consider for the future. Perhaps we could outsource inference to mimetypes.guess_type
's encoding detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhimmel good point. can you create an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback, I'm thinking this isn't something that I would advocate for. Here is what mimetypes
does:
import mimetypes
print(mimetypes.guess_type('file-name.tsv.gz'))
print(mimetypes.guess_type('file-name.tsv.gzip'))
outputs:
('text/tab-separated-values', 'gzip')
(None, None)
So if we use mimetypes
, the .gzip
extension won't be recognized. We could code our compression inference to recognize .gzip
and .gz
, but I don't think that would be the right decision, since you could then write gzip compressed files that end in .gzip
using inference. I'd rather stick with the extensions recognized by mimetypes.
If you'd still like to discuss this further via a dedicated issue, just let me know and I'll post it... with the caveat I'm not advocating for it 😼.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, specific extensions are prob fine.
pandas/tests/io/test_pickle.py
Outdated
with tm.ensure_clean(get_random_path() + ext) as path: | ||
df = tm.makeDataFrame() | ||
df.to_pickle(path) | ||
tm.assert_frame_equal(df, pd.read_pickle(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests just test that the round trip works, but not that inference works.
Here's some pseudo-code to more explicitly test inference.
with tm.ensure_clean(get_random_path() + ext) as infer_path, tm.ensure_clean(get_random_path() + ext) as explicit_path:
df = tm.makeDataFrame()
df.to_pickle(infer_path)
df.to_pickle(explicit_path, compression=compression)
# Implement: Assert files equal
tm.assert_frame_equal(df, pd.read_pickle(explicit_path))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe should add a .pkl.gz
file to travis itself and test with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actually, all write-then-read operations should split into three steps:
- write to a file1, compressed or uncompressed
- compress or decompress file1 into file2 using external util or a standalone piece of code
- read file2
Then compare content from file2 with that written to file1. Seems I need to re-write all the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code / tests lgtm. comments are about docs now.
pandas/tests/io/test_pickle.py
Outdated
# read compressed file by inferred compression method | ||
df2 = pd.read_pickle(p2) | ||
tm.assert_frame_equal(df, df2) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove here and blow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, forgot to delete before commit.
pandas/tests/io/test_pickle.py
Outdated
df = tm.makeDataFrame() | ||
# write to uncompressed file | ||
df.to_pickle(p1, compression=None) | ||
# compress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a blank line before comments (easier to read)
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -99,6 +99,34 @@ support for bz2 compression in the python 2 c-engine improved (:issue:`14874`). | |||
|
|||
.. _whatsnew_0200.enhancements.uint64_support: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tag should go after this section. Then add a similar type tag for this.
.. _whatsnew_0200.enhancements.pickle_compression:
or something.
doc/source/whatsnew/v0.20.0.txt
Outdated
Pickle file I/O now supports compression | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
``read_pickle`` and ``to_pickle`` can now read from and write to compressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use :func:`read_pickle`
and :meth:`DataFame.to_pickle`
doc/source/whatsnew/v0.20.0.txt
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
``read_pickle`` and ``to_pickle`` can now read from and write to compressed | ||
pickle files. Compression methods can be explicit parameter or be inferred |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be an explicit parameter
doc/source/io.rst
Outdated
@@ -2908,6 +2908,38 @@ any pickled pandas object (or any other pickled object) from file: | |||
import os | |||
os.remove('foo.pkl') | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this as a sub-section (of pickle section), with a ref-link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag
doc/source/io.rst
Outdated
@@ -2908,6 +2908,38 @@ any pickled pandas object (or any other pickled object) from file: | |||
import os | |||
os.remove('foo.pkl') | |||
|
|||
The ``to_pickle`` and ``read_pickle`` methods can read and write compressed pickle files. | |||
For ``read_pickle`` method, ``compression`` parameter can be one of | |||
{``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``}, default ``'infer'``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't talk about the compression parameter, rather about the types of compression supported (e.g. gzip, bz2, zip, xz)
doc/source/io.rst
Outdated
If 'infer', then use gzip, bz2, zip, or xz if filename ends in '.gz', '.bz2', '.zip', or | ||
'.xz', respectively. If using 'zip', the ZIP file must contain only one data file to be | ||
read in. Set to ``None`` for no decompression. | ||
``to_pickle`` works in a similar way, except that 'zip' format is not supported. If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
give a paragraph break and talk about infer
doc/source/io.rst
Outdated
'A': np.random.randn(1000), | ||
'B': np.random.randn(1000), | ||
'C': np.random.randn(1000)}) | ||
df.to_pickle("data.pkl.xz") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of showing 3 methods, I would should 1 then, show using the infer
kw.
doc/source/whatsnew/v0.20.0.txt
Outdated
df.to_pickle("data.pkl.compress", compression="gzip") | ||
df["A"].to_pickle("s1.pkl.bz2") | ||
|
||
df = pd.read_pickle("data.pkl.xz") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment to above about what to show.
332b462
to
e9c5fd2
Compare
thanks @goldenbull ! note that I changed the tests (put in to a class, which you can do with pytest as long as it inherits from |
@goldenbull congrats on this important addition! Really excited for 0.20. |
closes pandas-dev#11666 Author: goldenbull <goldenbull@gmail.com> Author: Chen Jinniu <goldenbull@users.noreply.github.com> Closes pandas-dev#13317 from goldenbull/pickle_io_compression and squashes the following commits: e9c5fd2 [goldenbull] docs update d50e430 [goldenbull] update docs. re-write all tests to avoid round-trip read/write comparison. 86afd25 [goldenbull] change test to new pytest parameterized style 945e7bb [goldenbull] Merge remote-tracking branch 'origin/master' into pickle_io_compression ccbeaa9 [goldenbull] move pickle compression tests into a new class 9a07250 [goldenbull] Remove prepared compressed data. _get_handle will take care of compressed I/O 1cb810b [goldenbull] add zip decompression support. refactor using lambda. b8c4175 [goldenbull] add compressed pickle data file to io/tests 6df6611 [goldenbull] pickle compression code update 81d55a0 [Chen Jinniu] Merge branch 'master' into pickle_io_compression 025a0cd [goldenbull] add compression support for pickle
closes #11666
My code is not pythonic enough, maybe need some refactor. Any comment is welcome.