Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add compression support for 'read_pickle' and 'to_pickle' #13317

Closed
wants to merge 11 commits into from

Conversation

goldenbull
Copy link
Contributor

@goldenbull goldenbull commented May 29, 2016

closes #11666
My code is not pythonic enough, maybe need some refactor. Any comment is welcome.

@jreback
Copy link
Contributor

jreback commented May 29, 2016

tests!

@jreback jreback added the IO Data IO issues that don't fit into a more specific label label May 29, 2016
return inferred_compression


def _get_handle(path, mode, encoding=None, compression=None, is_txt=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is is_txt for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the _get_handle function was originally for csv file io, which will use a TextIOWrapper to wrap the file handle. pickle needs binary file handle, so I add the parameter to bypass the TextIOWrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok, let's call this is_text=True then

and pls add a doc-string to _get_handle (should have been there, but since now adding params)......

thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc string updated

@codecov-io
Copy link

codecov-io commented May 29, 2016

Codecov Report

❗ No coverage uploaded for pull request base (master@a1d3ff3). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #13317   +/-   ##
=========================================
  Coverage          ?      91%           
=========================================
  Files             ?      143           
  Lines             ?    49312           
  Branches          ?        0           
=========================================
  Hits              ?    44874           
  Misses            ?     4438           
  Partials          ?        0
Impacted Files Coverage Δ
pandas/io/pickle.py 79.54% <100%> (ø)
pandas/core/generic.py 96.25% <100%> (ø)
pandas/io/common.py 70.25% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1d3ff3...e9c5fd2. Read the comment docs.

"""
with open(path, 'wb') as f:
inferred_compression = _get_inferred_compression(path, compression)
if inferred_compression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_handle will open the file, so the first statement should be enough

@jreback
Copy link
Contributor

jreback commented May 29, 2016

code looks ok, needs quite a few tests though. see here for what to test (and you can model them pretty much like this).

@goldenbull
Copy link
Contributor Author

OK, I added the test code in test_pickle.py. I will move them to compression.py tomorrow.

@jreback
Copy link
Contributor

jreback commented May 29, 2016

@goldenbull no, this should be separate from the other compression.py, it just needs to be a test class in test_pickle.py. Just model after that. Its too complicated / not necessary to combine the tests.

@@ -217,6 +217,16 @@ def python_unpickler(path):
result = python_unpickler(path)
self.compare_element(result, expected, typ)

def test_compression(self):
self.data.to_pickle('data.pkl')
Copy link
Contributor

@jreback jreback May 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to separate these tests as you can't guarantee the dependencies for the compression.

But yes the general approach is fine.
also pls use the tm.ensure_clean() to generate the paths

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆗

@goldenbull goldenbull force-pushed the pickle_io_compression branch from a455e14 to 100476d Compare May 30, 2016 00:04
@goldenbull
Copy link
Contributor Author

tests added.


Parameters
----------
path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation for the parameters pls.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2016

can you rebase / update

@goldenbull
Copy link
Contributor Author

Rebased onto master branch. It's been a long time since my last commit, my memory is fading 🤕

@jorisvandenbossche
Copy link
Member

@goldenbull something went wrong with your rebase. Normally doing:

git fetch upstream
git checkout pickle_io_compression
git rebase upstream/master

should solve this

@goldenbull
Copy link
Contributor Author

Actually I didn't use rebase because there are a lot of conflicts during rebasing. Instead I switched to origin/master and git merge pickle_io_compression, and then used push -f to overwrite my remote branch by force. I know push -f is a discouraged command but I think it works for this case.
So what is the issue you find? Maybe it's the problem about push -f?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 7, 2016

There is nothing perse wrong with push -f (you would have to do that as well after the rebase I showed), but the problem is that there is now a huge diff here in this PR (and many included commits), which makes it impossible to review (just take a look at the "Files changed" tab on github).
If the rebase is too painful, and you have only a few commits, another possibility is to cherry-pick the few commits on a clean branch from current master, so you only have to solve the merge conflicts once.

@goldenbull goldenbull force-pushed the pickle_io_compression branch from 8ff7829 to f187514 Compare October 8, 2016 00:54
@goldenbull
Copy link
Contributor Author

o yes, I understand now. Rebase is necessary for code review 😄

"""
Pickle (serialize) object to input file path.

Parameters
----------
path : string
File path
compression : {'infer', 'gzip', 'bz2', 'xz', None}, default 'infer'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need an versionadded tag

@@ -285,8 +285,45 @@ def ZipFile(*args, **kwargs):
ZipFile = zipfile.ZipFile


def _get_handle(path, mode, encoding=None, compression=None, memory_map=False):
def _get_inferred_compression(filepath_or_buffer, compression):
if compression == 'infer':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Parameters/Returns in a doc-string

encoding for text file
compression : string, default None
{ None, 'gzip', 'bz2', 'zip', 'xz' }
is_txt : bool, default True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_txt -> is_text

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document memory_map arg

inferred_compression = None
else:
inferred_compression = None
inferred_compression = _get_inferred_compression(filepath_or_buffer, kwds.get('compression'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass compression=kwds.get('compression')

@@ -15,12 +16,18 @@ def to_pickle(obj, path):
obj : any object
path : string
File path
compression : {'infer', 'gzip', 'bz2', 'xz', None}, default 'infer'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded

"""
with open(path, 'wb') as f:
inferred_compression = _get_inferred_compression(path, compression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass compression via kwarg

"""
with open(path, 'wb') as f:
inferred_compression = _get_inferred_compression(path, compression)
if inferred_compression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just do _get_handle (it will open if necessary when no compression is specified)

@goldenbull goldenbull force-pushed the pickle_io_compression branch from 7e42f8a to ccbeaa9 Compare January 4, 2017 11:30
@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

can you rebase / update

@dhimmel
Copy link
Contributor

dhimmel commented Feb 28, 2017

@goldenbull just want to encourage you on this pull request. Just realized the current limits that there is no way to do compressed pickle IO in pandas.

This would be a huge enhancement for https://github.com/cognoma/machine-learning where reading TSVs into pandas is slow, causing difficulty for our contributors. Pickles load almost instantaneously, but are too big uncompressed.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2017

@dhimmel btw, nothing stopping anyone from picking this up! its actually almost all of the way there!

@jorisvandenbossche
Copy link
Member

@jreback just for clarification, are there still comments that have to be addressed, or is it just rebasing? (I see commits after the last round of comment, but long time ago that I looked at the PR).

@jreback
Copy link
Contributor

jreback commented Feb 28, 2017

yeah i don't remember
i think it was pretty close just needed some more testing / doc updates

@goldenbull
Copy link
Contributor Author

I see that all tests are now moved to pandas\tests folder and test scripts are changed a lot. How shall I write the tests? Is there anything I should pay attention?
Btw, I got two errors when running nosetests on pandas\tests\io\test_pickle.py :

======================================================================
ERROR: pandas.tests.io.test_pickle.test_pickles
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_pickles() missing 2 required positional arguments: 'current_pickle_data' and 'version'

======================================================================
ERROR: pandas.tests.io.test_pickle.test_round_trip_current
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_round_trip_current() missing 1 required positional argument: 'current_pickle_data'

@jreback
Copy link
Contributor

jreback commented Mar 1, 2017

@goldenbull

@goldenbull
Copy link
Contributor Author

merged the latest master branch.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls change to use the parametrize style

@@ -302,3 +302,49 @@ def test_pickle_v0_15_2():
# with open(pickle_path, 'wb') as f: pickle.dump(cat, f)
#
tm.assert_categorical_equal(cat, pd.read_pickle(pickle_path))


class TestPickleCompression(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use a class, follow the existing style

df2 = pd.read_pickle(path, compression=compression)
tm.assert_frame_equal(df, df2)

def test_compression_explicit(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use parametrize

df = tm.makeDataFrame()
df.to_pickle(path, compression=compression)

def test_compression_explicit_bad(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

df.to_pickle(path)
tm.assert_frame_equal(df, pd.read_pickle(path))

def test_compression_infer(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@goldenbull
Copy link
Contributor Author

parameterized style is easier to read. 👍

@jreback jreback added this to the 0.20.0 milestone Mar 6, 2017
@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

@goldenbull looks good!

can you add a whatsnew sub-section (pickle has gained compression support or something like this).

show an example of writing / reading (you will need to use :suppress: as this will create a file which you need to remove at the end.

IOW, something like

.. ipython:: python
   
   fn = 'compressed.pkl'

   df = ....
   df.to_pickle(fn, compression='gzip')
   pd.read_pickle(fn)

.. ipython:: python
   :suppress:
   import os
   os.remove(fn)

@jreback
Copy link
Contributor

jreback commented Mar 6, 2017

also have a look at the io.rst docs on pickle. maybe worth adding a small section on using compression (see how we did this for read_csv, and copy the same formatting)

for ext in extensions:
yield self.compression_infer, ext

@pytest.mark.parametrize('ext', ['', '.gz', '.bz2', '.xz', '.no_compress'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to test a .pkl extension, as this will be the most common (and shouldn't be compressed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Is there anything official that says .pkl is a pickle extension ? (though by definition .pkl would NOT be compressed, while .pkl.gz would be for example). Also let's have .gzip the same as .gz. I think this is common.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extension for pickled files is not standardized. The common ones are .pkl, .p, and .pickle.

Also let's have .gzip the same as .gz

Currently, we don't infer .gzip extension as gzip compression. I think that's out of scope for this PR. Something to consider for the future. Perhaps we could outsource inference to mimetypes.guess_type's encoding detection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhimmel good point. can you create an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, I'm thinking this isn't something that I would advocate for. Here is what mimetypes does:

import mimetypes
print(mimetypes.guess_type('file-name.tsv.gz'))
print(mimetypes.guess_type('file-name.tsv.gzip'))

outputs:

('text/tab-separated-values', 'gzip')
(None, None)

So if we use mimetypes, the .gzip extension won't be recognized. We could code our compression inference to recognize .gzip and .gz, but I don't think that would be the right decision, since you could then write gzip compressed files that end in .gzip using inference. I'd rather stick with the extensions recognized by mimetypes.

If you'd still like to discuss this further via a dedicated issue, just let me know and I'll post it... with the caveat I'm not advocating for it 😼.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, specific extensions are prob fine.

with tm.ensure_clean(get_random_path() + ext) as path:
df = tm.makeDataFrame()
df.to_pickle(path)
tm.assert_frame_equal(df, pd.read_pickle(path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests just test that the round trip works, but not that inference works.

Here's some pseudo-code to more explicitly test inference.

with tm.ensure_clean(get_random_path() + ext) as infer_path, tm.ensure_clean(get_random_path() + ext) as explicit_path:
    df = tm.makeDataFrame()
    df.to_pickle(infer_path)
    df.to_pickle(explicit_path, compression=compression)
    # Implement: Assert files equal
    tm.assert_frame_equal(df, pd.read_pickle(explicit_path))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should add a .pkl.gz file to travis itself and test with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually, all write-then-read operations should split into three steps:

  1. write to a file1, compressed or uncompressed
  2. compress or decompress file1 into file2 using external util or a standalone piece of code
  3. read file2

Then compare content from file2 with that written to file1. Seems I need to re-write all the tests.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code / tests lgtm. comments are about docs now.

# read compressed file by inferred compression method
df2 = pd.read_pickle(p2)
tm.assert_frame_equal(df, df2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove here and blow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, forgot to delete before commit.

df = tm.makeDataFrame()
# write to uncompressed file
df.to_pickle(p1, compression=None)
# compress
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a blank line before comments (easier to read)

@@ -99,6 +99,34 @@ support for bz2 compression in the python 2 c-engine improved (:issue:`14874`).

.. _whatsnew_0200.enhancements.uint64_support:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tag should go after this section. Then add a similar type tag for this.

.. _whatsnew_0200.enhancements.pickle_compression: or something.

Pickle file I/O now supports compression
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``read_pickle`` and ``to_pickle`` can now read from and write to compressed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :func:`read_pickle` and :meth:`DataFame.to_pickle`

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``read_pickle`` and ``to_pickle`` can now read from and write to compressed
pickle files. Compression methods can be explicit parameter or be inferred
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be an explicit parameter

@@ -2908,6 +2908,38 @@ any pickled pandas object (or any other pickled object) from file:
import os
os.remove('foo.pkl')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this as a sub-section (of pickle section), with a ref-link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag

@@ -2908,6 +2908,38 @@ any pickled pandas object (or any other pickled object) from file:
import os
os.remove('foo.pkl')

The ``to_pickle`` and ``read_pickle`` methods can read and write compressed pickle files.
For ``read_pickle`` method, ``compression`` parameter can be one of
{``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``}, default ``'infer'``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't talk about the compression parameter, rather about the types of compression supported (e.g. gzip, bz2, zip, xz)

If 'infer', then use gzip, bz2, zip, or xz if filename ends in '.gz', '.bz2', '.zip', or
'.xz', respectively. If using 'zip', the ZIP file must contain only one data file to be
read in. Set to ``None`` for no decompression.
``to_pickle`` works in a similar way, except that 'zip' format is not supported. If the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give a paragraph break and talk about infer

'A': np.random.randn(1000),
'B': np.random.randn(1000),
'C': np.random.randn(1000)})
df.to_pickle("data.pkl.xz")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of showing 3 methods, I would should 1 then, show using the infer kw.

df.to_pickle("data.pkl.compress", compression="gzip")
df["A"].to_pickle("s1.pkl.bz2")

df = pd.read_pickle("data.pkl.xz")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment to above about what to show.

@goldenbull goldenbull force-pushed the pickle_io_compression branch from 332b462 to e9c5fd2 Compare March 9, 2017 09:04
@jreback jreback closed this in 0cfc950 Mar 9, 2017
@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

thanks @goldenbull !

note that I changed the tests (put in to a class, which you can do with pytest as long as it inherits from object, NOT tm.TestCase). and the docs a bit: 5667a3a

@dhimmel
Copy link
Contributor

dhimmel commented Mar 9, 2017

@goldenbull congrats on this important addition! Really excited for 0.20.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#11666

Author: goldenbull <goldenbull@gmail.com>
Author: Chen Jinniu <goldenbull@users.noreply.github.com>

Closes pandas-dev#13317 from goldenbull/pickle_io_compression and squashes the following commits:

e9c5fd2 [goldenbull] docs update
d50e430 [goldenbull] update docs. re-write all tests to avoid round-trip read/write comparison.
86afd25 [goldenbull] change test to new pytest parameterized style
945e7bb [goldenbull] Merge remote-tracking branch 'origin/master' into pickle_io_compression
ccbeaa9 [goldenbull] move pickle compression tests into a new class
9a07250 [goldenbull] Remove prepared compressed data. _get_handle will take care of compressed I/O
1cb810b [goldenbull] add zip decompression support. refactor using lambda.
b8c4175 [goldenbull] add compressed pickle data file to io/tests
6df6611 [goldenbull] pickle compression code update
81d55a0 [Chen Jinniu] Merge branch 'master' into pickle_io_compression
025a0cd [goldenbull] add compression support for pickle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add gzip/bz2 compression to read_pickle() (and perhaps other read_*() methods)
5 participants