add compression support for 'read_pickle' and 'to_pickle' #13317

goldenbull · 2016-05-29T13:19:02Z

closes #11666
My code is not pythonic enough, maybe need some refactor. Any comment is welcome.

jreback · 2016-05-29T13:24:14Z

tests!

jreback · 2016-05-29T15:06:29Z

pandas/io/common.py

+    return inferred_compression
+
+
+def _get_handle(path, mode, encoding=None, compression=None, is_txt=True):


what is is_txt for?

the _get_handle function was originally for csv file io, which will use a TextIOWrapper to wrap the file handle. pickle needs binary file handle, so I add the parameter to bypass the TextIOWrapper.

hmm, ok, let's call this is_text=True then

and pls add a doc-string to _get_handle (should have been there, but since now adding params)......

thanks

doc string updated

codecov-io · 2016-05-29T15:08:08Z

Codecov Report

❗ No coverage uploaded for pull request base (master@a1d3ff3). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #13317   +/-   ##
=========================================
  Coverage          ?      91%           
=========================================
  Files             ?      143           
  Lines             ?    49312           
  Branches          ?        0           
=========================================
  Hits              ?    44874           
  Misses            ?     4438           
  Partials          ?        0

Impacted Files	Coverage Δ
pandas/io/pickle.py	`79.54% <100%> (ø)`
pandas/core/generic.py	`96.25% <100%> (ø)`
pandas/io/common.py	`70.25% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1d3ff3...e9c5fd2. Read the comment docs.

jreback · 2016-05-29T15:08:59Z

pandas/io/pickle.py

    """
-    with open(path, 'wb') as f:
+    inferred_compression = _get_inferred_compression(path, compression)
+    if inferred_compression:


_get_handle will open the file, so the first statement should be enough

jreback · 2016-05-29T15:11:01Z

code looks ok, needs quite a few tests though. see here for what to test (and you can model them pretty much like this).

goldenbull · 2016-05-29T15:12:48Z

OK, I added the test code in test_pickle.py. I will move them to compression.py tomorrow.

jreback · 2016-05-29T15:16:44Z

@goldenbull no, this should be separate from the other compression.py, it just needs to be a test class in test_pickle.py. Just model after that. Its too complicated / not necessary to combine the tests.

jreback · 2016-05-29T15:18:32Z

pandas/io/tests/test_pickle.py

@@ -217,6 +217,16 @@ def python_unpickler(path):
                        result = python_unpickler(path)
                        self.compare_element(result, expected, typ)

+    def test_compression(self):
+        self.data.to_pickle('data.pkl')


you will need to separate these tests as you can't guarantee the dependencies for the compression.

But yes the general approach is fine.
also pls use the tm.ensure_clean() to generate the paths

goldenbull · 2016-05-31T03:39:11Z

tests added.

jreback · 2016-05-31T14:16:45Z

pandas/io/common.py

+
+    Parameters
+    ----------
+    path


explanation for the parameters pls.

jreback · 2016-10-06T10:51:26Z

can you rebase / update

goldenbull · 2016-10-07T01:47:45Z

Rebased onto master branch. It's been a long time since my last commit, my memory is fading 🤕

jorisvandenbossche · 2016-10-07T07:22:38Z

@goldenbull something went wrong with your rebase. Normally doing:

git fetch upstream
git checkout pickle_io_compression
git rebase upstream/master

should solve this

goldenbull · 2016-10-07T14:41:03Z

Actually I didn't use rebase because there are a lot of conflicts during rebasing. Instead I switched to origin/master and git merge pickle_io_compression, and then used push -f to overwrite my remote branch by force. I know push -f is a discouraged command but I think it works for this case.
So what is the issue you find? Maybe it's the problem about push -f?

jorisvandenbossche · 2016-10-07T14:54:50Z

There is nothing perse wrong with push -f (you would have to do that as well after the rebase I showed), but the problem is that there is now a huge diff here in this PR (and many included commits), which makes it impossible to review (just take a look at the "Files changed" tab on github).
If the rebase is too painful, and you have only a few commits, another possibility is to cherry-pick the few commits on a clean branch from current master, so you only have to solve the merge conflicts once.

goldenbull · 2016-10-08T01:06:14Z

o yes, I understand now. Rebase is necessary for code review 😄

jreback · 2016-10-08T14:32:35Z

pandas/core/generic.py

        """
        Pickle (serialize) object to input file path.

        Parameters
        ----------
        path : string
            File path
+        compression : {'infer', 'gzip', 'bz2', 'xz', None}, default 'infer'


need an versionadded tag

jreback · 2016-10-08T14:32:56Z

pandas/io/common.py

@@ -285,8 +285,45 @@ def ZipFile(*args, **kwargs):
    ZipFile = zipfile.ZipFile


-def _get_handle(path, mode, encoding=None, compression=None, memory_map=False):
+def _get_inferred_compression(filepath_or_buffer, compression):
+    if compression == 'infer':


can you add a Parameters/Returns in a doc-string

jreback · 2016-10-08T14:33:38Z

pandas/io/common.py

+        encoding for text file
+    compression : string, default None
+        { None, 'gzip', 'bz2', 'zip', 'xz' }
+    is_txt : bool, default True


is_txt -> is_text

document memory_map arg

jreback · 2016-10-08T14:34:28Z

pandas/io/parsers.py

-                inferred_compression = None
-        else:
-            inferred_compression = None
+    inferred_compression = _get_inferred_compression(filepath_or_buffer, kwds.get('compression'))


pass compression=kwds.get('compression')

jreback · 2016-10-08T14:34:50Z

pandas/io/pickle.py

@@ -15,12 +16,18 @@ def to_pickle(obj, path):
    obj : any object
    path : string
        File path
+    compression : {'infer', 'gzip', 'bz2', 'xz', None}, default 'infer'


add a versionadded

jreback · 2016-10-08T14:34:59Z

pandas/io/pickle.py

    """
-    with open(path, 'wb') as f:
+    inferred_compression = _get_inferred_compression(path, compression)


pass compression via kwarg

jreback · 2016-10-08T14:36:08Z

pandas/io/pickle.py

    """
-    with open(path, 'wb') as f:
+    inferred_compression = _get_inferred_compression(path, compression)
+    if inferred_compression:


can just do _get_handle (it will open if necessary when no compression is specified)

jreback · 2017-02-27T16:03:40Z

can you rebase / update

dhimmel · 2017-02-28T22:24:28Z

@goldenbull just want to encourage you on this pull request. Just realized the current limits that there is no way to do compressed pickle IO in pandas.

This would be a huge enhancement for https://github.com/cognoma/machine-learning where reading TSVs into pandas is slow, causing difficulty for our contributors. Pickles load almost instantaneously, but are too big uncompressed.

jreback · 2017-02-28T22:46:00Z

@dhimmel btw, nothing stopping anyone from picking this up! its actually almost all of the way there!

jorisvandenbossche · 2017-02-28T23:11:06Z

@jreback just for clarification, are there still comments that have to be addressed, or is it just rebasing? (I see commits after the last round of comment, but long time ago that I looked at the PR).

jreback · 2017-02-28T23:17:24Z

yeah i don't remember
i think it was pretty close just needed some more testing / doc updates

goldenbull · 2017-03-01T10:15:41Z

I see that all tests are now moved to pandas\tests folder and test scripts are changed a lot. How shall I write the tests? Is there anything I should pay attention?
Btw, I got two errors when running nosetests on pandas\tests\io\test_pickle.py :

======================================================================
ERROR: pandas.tests.io.test_pickle.test_pickles
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_pickles() missing 2 required positional arguments: 'current_pickle_data' and 'version'

======================================================================
ERROR: pandas.tests.io.test_pickle.test_round_trip_current
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_round_trip_current() missing 1 required positional argument: 'current_pickle_data'

jreback · 2017-03-01T12:58:21Z

@goldenbull

make sure you are rebased on master. Some things have moved around
install and use pytest which we switched
you'll want to mimic tests like this: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/parser/compression.py; these are for the parser but you are using the same underlying routines; these should be separate and in tests\io\test_pickle.py

# Conflicts: # pandas/io/tests/test_pickle.py

goldenbull · 2017-03-03T03:38:44Z

merged the latest master branch.

jreback

pls change to use the parametrize style

jreback · 2017-03-03T12:55:13Z

pandas/tests/io/test_pickle.py

@@ -302,3 +302,49 @@ def test_pickle_v0_15_2():
    # with open(pickle_path, 'wb') as f: pickle.dump(cat, f)
    #
    tm.assert_categorical_equal(cat, pd.read_pickle(pickle_path))
+
+
+class TestPickleCompression(object):


don't use a class, follow the existing style

jreback · 2017-03-03T12:55:35Z

pandas/tests/io/test_pickle.py

+            df2 = pd.read_pickle(path, compression=compression)
+            tm.assert_frame_equal(df, df2)
+
+    def test_compression_explicit(self):


use parametrize

jreback · 2017-03-03T12:55:44Z

pandas/tests/io/test_pickle.py

+                df = tm.makeDataFrame()
+                df.to_pickle(path, compression=compression)
+
+    def test_compression_explicit_bad(self):


jreback · 2017-03-03T12:55:49Z

pandas/tests/io/test_pickle.py

+            df.to_pickle(path)
+            tm.assert_frame_equal(df, pd.read_pickle(path))
+
+    def test_compression_infer(self):


goldenbull · 2017-03-06T05:02:35Z

parameterized style is easier to read. 👍

jreback · 2017-03-06T13:51:09Z

@goldenbull looks good!

can you add a whatsnew sub-section (pickle has gained compression support or something like this).

show an example of writing / reading (you will need to use :suppress: as this will create a file which you need to remove at the end.

IOW, something like

.. ipython:: python
   
   fn = 'compressed.pkl'

   df = ....
   df.to_pickle(fn, compression='gzip')
   pd.read_pickle(fn)

.. ipython:: python
   :suppress:
   import os
   os.remove(fn)

jreback · 2017-03-06T13:52:10Z

also have a look at the io.rst docs on pickle. maybe worth adding a small section on using compression (see how we did this for read_csv, and copy the same formatting)

dhimmel · 2017-03-06T16:32:05Z

pandas/tests/io/test_pickle.py

-        for ext in extensions:
-            yield self.compression_infer, ext
+
+@pytest.mark.parametrize('ext', ['', '.gz', '.bz2', '.xz', '.no_compress'])


Would it make sense to test a .pkl extension, as this will be the most common (and shouldn't be compressed)

.Is there anything official that says .pkl is a pickle extension ? (though by definition .pkl would NOT be compressed, while .pkl.gz would be for example). Also let's have .gzip the same as .gz. I think this is common.

The extension for pickled files is not standardized. The common ones are .pkl, .p, and .pickle.

Also let's have .gzip the same as .gz

Currently, we don't infer .gzip extension as gzip compression. I think that's out of scope for this PR. Something to consider for the future. Perhaps we could outsource inference to mimetypes.guess_type's encoding detection.

@dhimmel good point. can you create an issue?

@jreback, I'm thinking this isn't something that I would advocate for. Here is what mimetypes does:

import mimetypes print(mimetypes.guess_type('file-name.tsv.gz')) print(mimetypes.guess_type('file-name.tsv.gzip'))

outputs:

('text/tab-separated-values', 'gzip') (None, None)

So if we use mimetypes, the .gzip extension won't be recognized. We could code our compression inference to recognize .gzip and .gz, but I don't think that would be the right decision, since you could then write gzip compressed files that end in .gzip using inference. I'd rather stick with the extensions recognized by mimetypes.

If you'd still like to discuss this further via a dedicated issue, just let me know and I'll post it... with the caveat I'm not advocating for it 😼.

sure, specific extensions are prob fine.

dhimmel · 2017-03-06T16:33:05Z

pandas/tests/io/test_pickle.py

+    with tm.ensure_clean(get_random_path() + ext) as path:
+        df = tm.makeDataFrame()
+        df.to_pickle(path)
+        tm.assert_frame_equal(df, pd.read_pickle(path))


These tests just test that the round trip works, but not that inference works.

Here's some pseudo-code to more explicitly test inference.

with tm.ensure_clean(get_random_path() + ext) as infer_path, tm.ensure_clean(get_random_path() + ext) as explicit_path: df = tm.makeDataFrame() df.to_pickle(infer_path) df.to_pickle(explicit_path, compression=compression) # Implement: Assert files equal tm.assert_frame_equal(df, pd.read_pickle(explicit_path))

maybe should add a .pkl.gz file to travis itself and test with it?

Yes, actually, all write-then-read operations should split into three steps:

write to a file1, compressed or uncompressed

compress or decompress file1 into file2 using external util or a standalone piece of code

read file2

Then compare content from file2 with that written to file1. Seems I need to re-write all the tests.

…ison.

jreback

code / tests lgtm. comments are about docs now.

jreback · 2017-03-08T13:53:41Z

pandas/tests/io/test_pickle.py

+        # read compressed file by inferred compression method
+        df2 = pd.read_pickle(p2)
+        tm.assert_frame_equal(df, df2)
+


can you remove here and blow

My mistake, forgot to delete before commit.

jreback · 2017-03-08T13:53:53Z

pandas/tests/io/test_pickle.py

+        df = tm.makeDataFrame()
+        # write to uncompressed file
+        df.to_pickle(p1, compression=None)
+        # compress


can you put a blank line before comments (easier to read)

jreback · 2017-03-08T13:56:52Z

doc/source/whatsnew/v0.20.0.txt

@@ -99,6 +99,34 @@ support for bz2 compression in the python 2 c-engine improved (:issue:`14874`).

 .. _whatsnew_0200.enhancements.uint64_support:


this tag should go after this section. Then add a similar type tag for this.

.. _whatsnew_0200.enhancements.pickle_compression: or something.

jreback · 2017-03-08T13:57:21Z

doc/source/whatsnew/v0.20.0.txt

+Pickle file I/O now supports compression
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``read_pickle`` and ``to_pickle`` can now read from and write to compressed


use :func:`read_pickle` and :meth:`DataFame.to_pickle`

jreback · 2017-03-08T13:57:53Z

doc/source/whatsnew/v0.20.0.txt

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``read_pickle`` and ``to_pickle`` can now read from and write to compressed
+pickle files. Compression methods can be explicit parameter or be inferred


can be an explicit parameter

jreback · 2017-03-08T13:59:26Z

doc/source/io.rst

@@ -2908,6 +2908,38 @@ any pickled pandas object (or any other pickled object) from file:
   import os
   os.remove('foo.pkl')



add this as a sub-section (of pickle section), with a ref-link.

add a versionadded tag

jreback · 2017-03-08T14:00:47Z

doc/source/io.rst

@@ -2908,6 +2908,38 @@ any pickled pandas object (or any other pickled object) from file:
   import os
   os.remove('foo.pkl')

+The ``to_pickle`` and ``read_pickle`` methods can read and write compressed pickle files.
+For ``read_pickle`` method, ``compression`` parameter can be one of
+{``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``}, default ``'infer'``.


don't talk about the compression parameter, rather about the types of compression supported (e.g. gzip, bz2, zip, xz)

jreback · 2017-03-08T14:00:58Z

doc/source/io.rst

+If 'infer', then use gzip, bz2, zip, or xz if filename ends in '.gz', '.bz2', '.zip', or
+'.xz', respectively. If using 'zip', the ZIP file must contain only one data file to be
+read in. Set to ``None`` for no decompression.
+``to_pickle`` works in a similar way, except that 'zip' format is not supported. If the


give a paragraph break and talk about infer

jreback · 2017-03-08T14:01:23Z

doc/source/io.rst

+       'A': np.random.randn(1000),
+       'B': np.random.randn(1000),
+       'C': np.random.randn(1000)})
+   df.to_pickle("data.pkl.xz")


instead of showing 3 methods, I would should 1 then, show using the infer kw.

jreback · 2017-03-08T14:01:38Z

doc/source/whatsnew/v0.20.0.txt

+   df.to_pickle("data.pkl.compress", compression="gzip")
+   df["A"].to_pickle("s1.pkl.bz2")
+
+   df = pd.read_pickle("data.pkl.xz")


similar comment to above about what to show.

jreback · 2017-03-09T15:26:15Z

thanks @goldenbull !

note that I changed the tests (put in to a class, which you can do with pytest as long as it inherits from object, NOT tm.TestCase). and the docs a bit: 5667a3a

dhimmel · 2017-03-09T16:24:18Z

@goldenbull congrats on this important addition! Really excited for 0.20.

closes pandas-dev#11666 Author: goldenbull <goldenbull@gmail.com> Author: Chen Jinniu <goldenbull@users.noreply.github.com> Closes pandas-dev#13317 from goldenbull/pickle_io_compression and squashes the following commits: e9c5fd2 [goldenbull] docs update d50e430 [goldenbull] update docs. re-write all tests to avoid round-trip read/write comparison. 86afd25 [goldenbull] change test to new pytest parameterized style 945e7bb [goldenbull] Merge remote-tracking branch 'origin/master' into pickle_io_compression ccbeaa9 [goldenbull] move pickle compression tests into a new class 9a07250 [goldenbull] Remove prepared compressed data. _get_handle will take care of compressed I/O 1cb810b [goldenbull] add zip decompression support. refactor using lambda. b8c4175 [goldenbull] add compressed pickle data file to io/tests 6df6611 [goldenbull] pickle compression code update 81d55a0 [Chen Jinniu] Merge branch 'master' into pickle_io_compression 025a0cd [goldenbull] add compression support for pickle

jreback added the IO Data IO issues that don't fit into a more specific label label May 29, 2016

jreback reviewed May 29, 2016
View reviewed changes

goldenbull force-pushed the pickle_io_compression branch from a455e14 to 100476d Compare May 30, 2016 00:04

jreback reviewed May 31, 2016
View reviewed changes

pandas/io/common.py Outdated

Parameters

----------

path

Copy link

Contributor

jreback May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation for the parameters pls.

jreback mentioned this pull request May 31, 2016

CLN: consolidate compression inference as much as possible #12688

Closed

goldenbull force-pushed the pickle_io_compression branch from 8ff7829 to f187514 Compare October 8, 2016 00:54

jreback reviewed Oct 8, 2016

View reviewed changes

goldenbull force-pushed the pickle_io_compression branch from 7e42f8a to ccbeaa9 Compare January 4, 2017 11:30

dhimmel mentioned this pull request Feb 28, 2017

Git large-file storage cognoma/machine-learning#82

Open

Merge remote-tracking branch 'origin/master' into pickle_io_compression

945e7bb

# Conflicts: # pandas/io/tests/test_pickle.py

jreback requested changes Mar 3, 2017

View reviewed changes

change test to new pytest parameterized style

86afd25

jreback added this to the 0.20.0 milestone Mar 6, 2017

jreback approved these changes Mar 6, 2017

View reviewed changes

dhimmel reviewed Mar 6, 2017

View reviewed changes

update docs. re-write all tests to avoid round-trip read/write compar…

d50e430

…ison.

jreback requested changes Mar 8, 2017

View reviewed changes

docs update

e9c5fd2

goldenbull force-pushed the pickle_io_compression branch from 332b462 to e9c5fd2 Compare March 9, 2017 09:04

jreback closed this in 0cfc950 Mar 9, 2017

dhimmel mentioned this pull request Apr 25, 2017

Pickle dataframes for fast pandas reading. cognoma/machine-learning#85

Closed

jreback mentioned this pull request Jun 22, 2017

ENH: simple patch for read_json compression #16750

Closed

4 tasks

		return inferred_compression


		def _get_handle(path, mode, encoding=None, compression=None, is_txt=True):

		@@ -99,6 +99,34 @@ support for bz2 compression in the python 2 c-engine improved (:issue:`14874`).

		.. _whatsnew_0200.enhancements.uint64_support:

		@@ -2908,6 +2908,38 @@ any pickled pandas object (or any other pickled object) from file:
		import os
		os.remove('foo.pkl')

add compression support for 'read_pickle' and 'to_pickle' #13317

add compression support for 'read_pickle' and 'to_pickle' #13317

Conversation

goldenbull commented May 29, 2016 • edited by jreback Loading

jreback commented May 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented May 29, 2016 • edited Loading

Codecov Report

Choose a reason for hiding this comment

jreback commented May 29, 2016

goldenbull commented May 29, 2016

jreback commented May 29, 2016

jreback May 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldenbull commented May 31, 2016

Choose a reason for hiding this comment

jreback commented Oct 6, 2016

goldenbull commented Oct 7, 2016

jorisvandenbossche commented Oct 7, 2016

goldenbull commented Oct 7, 2016

jorisvandenbossche commented Oct 7, 2016 • edited Loading

goldenbull commented Oct 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 27, 2017

dhimmel commented Feb 28, 2017

jreback commented Feb 28, 2017

jorisvandenbossche commented Feb 28, 2017

jreback commented Feb 28, 2017

goldenbull commented Mar 1, 2017

jreback commented Mar 1, 2017

goldenbull commented Mar 3, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldenbull commented Mar 6, 2017

jreback commented Mar 6, 2017

jreback commented Mar 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 9, 2017

dhimmel commented Mar 9, 2017

goldenbull commented May 29, 2016 •

edited by jreback

Loading

codecov-io commented May 29, 2016 •

edited

Loading

jreback May 29, 2016 •

edited

Loading

jorisvandenbossche commented Oct 7, 2016 •

edited

Loading