Support for partition_cols in to_parquet #23321

anjsudh · 2018-10-24T18:17:23Z

closes #23283
tests passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-10-24T18:17:25Z

Hello @anjsudh! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/frame.py !
There are no PEP8 issues in the file pandas/io/parquet.py !
There are no PEP8 issues in the file pandas/tests/io/test_parquet.py !
There are no PEP8 issues in the file pandas/tests/util/test_testing.py !
There are no PEP8 issues in the file pandas/util/testing.py !

Comment last updated on November 05, 2018 at 19:08 Hours UTC

WillAyd

Thanks for the PR! Please make sure that you always have tests first and foremost with any submission though.

Once added feel free to ping back and can take another look

codecov · 2018-10-25T06:56:50Z

Codecov Report

❗ No coverage uploaded for pull request base (master@11c0d28). Click here to learn what that means.
The diff coverage is 90%.

@@            Coverage Diff            @@
##             master   #23321   +/-   ##
=========================================
  Coverage          ?   92.25%           
=========================================
  Files             ?      161           
  Lines             ?    51277           
  Branches          ?        0           
=========================================
  Hits              ?    47305           
  Misses            ?     3972           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.63% <90%> (?)`
#single	`42.32% <15%> (?)`

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.03% <ø> (ø)`
pandas/io/parquet.py	`84.76% <100%> (ø)`
pandas/util/testing.py	`86.71% <77.77%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11c0d28...8b45547. Read the comment docs.

TomAugspurger

Could you add a release note in 0.24.0 under enhancements? We'll also need to update the docstring for fname. IIUC, this will create a directory, rather than a single file, when partition_cols is included.

pandas/io/parquet.py

pandas/tests/io/test_parquet.py

TomAugspurger · 2018-10-25T20:33:28Z

The error in https://travis-ci.org/pandas-dev/pandas/jobs/446310514#L2577 suggests that we'll need to put a minimum pyarrow version to support this behavior. Can you check what version that is, and raise a with an ImportError with a nice error message if the pyarrow is too old?

TomAugspurger

We should mention somewhere that

doc/source/whatsnew/v0.24.0.txt

pandas/io/parquet.py

pandas/tests/io/test_parquet.py

anjsudh · 2018-10-27T11:10:18Z

@WillAyd Hi, hope you can review the diff now? Have added the necessary tests

xhochy · 2018-10-27T11:53:28Z

From a pyarrow perspective this is LGTM.

jreback

also pls update the documentatio in io.rst

pandas/io/parquet.py

pandas/tests/io/test_parquet.py

datapythonista

Suggested couple of improvements to the docstring. Also, if you can run ./scripts/validate_docstrings.py pandas.DataFrame.to_parquet to see that everything else is correct in it, that would be great.

pandas/io/parquet.py

TomAugspurger · 2018-11-05T12:00:33Z

doc/source/io.rst

@@ -4574,6 +4574,8 @@ Several caveats.
 * Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
 * Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
  on an attempt at serialization.
+* ``partition_cols`` will be used for partitioning the dataset, where the dataset will be written to multiple


The rest of the items in this lists feel more like limitations of pandas / these engines. Requiring that path be a directory when partition_cols is set doesn't seem to fit here.

I think this is important / different enough to deserve a new small section below "Handling Indexes", with

A description of what partition_cols requires (list of column names, directory for file path)

A description of why you might want to use partition_cols

A small example.

TomAugspurger · 2018-11-05T12:01:18Z

doc/source/whatsnew/v0.24.0.txt

@@ -235,6 +235,7 @@ Other Enhancements
 - New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
 - Compatibility with Matplotlib 3.0 (:issue:`22790`).
 - Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
+- :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`).


Probably best to mention "with the pyarrow engine (this was previously supported with fastparquet)."

TomAugspurger · 2018-11-05T12:02:27Z

pandas/core/frame.py

+            Column names by which to partition the dataset
+            Columns are partitioned in the order they are given
+            The behaviour applies only to pyarrow >= 0.7.0 and fastparquet
+            For other versions, this argument will be ignored.


Is it actually ignored for older pyarrows? I would have hoped it would raise when pyarrow gets the unrecognized argument.

Actually it seems like we raise. Could you update this?

pandas/io/parquet.py

Fix"Should raise error on using partition_cols and partition_on together"

TomAugspurger · 2018-11-05T22:33:59Z

Merge conflict in io/parquet.py, if you could fix that up I think this will be good to go.

doc/source/io.rst

jreback · 2018-11-06T03:59:13Z

doc/source/whatsnew/v0.24.0.txt

@@ -235,6 +235,7 @@ Other Enhancements
 - New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
 - Compatibility with Matplotlib 3.0 (:issue:`22790`).
 - Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
+- With the pyarrow engine, :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`).


double backticks around DataFrame. say engine='pyarrow' (in double backticks)

pandas/core/frame.py

pandas/io/parquet.py

doc/source/io.rst

anjsudh · 2018-11-08T07:44:52Z

@jreback @WillAyd hope you can have a look ?

jreback

off topic comments, lgtm. @WillAyd

jreback · 2018-11-08T13:12:29Z

pandas/core/frame.py

@@ -1984,7 +1984,10 @@ def to_parquet(self, fname, engine='auto', compression='snappy',
        Parameters
        ----------
        fname : str


side issue. we use path elsewhere for IO routines. We should change this as well (out of scope here). would have to deprecate (the name) unfortunately.

we actually use path on the top-level .to_parquet, not sure how this is named this way.

doc/source/io.rst

doc/source/whatsnew/v0.24.0.txt

pandas/core/frame.py

pandas/io/parquet.py

TomAugspurger · 2018-11-08T16:34:15Z

Nope. I don't really care about bugs in shutil :)

…

On Thu, Nov 8, 2018 at 10:31 AM William Ayd ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In doc/source/io.rst <#23321 (comment)>: > + + test + ├── a=0 + │ ├── 0bac803e32dc42ae83fddfd029cbdebc.parquet + │ └── ... + └── a=1 + ├── e6ab24a4f45147b49b54a662f0c412a3.parquet + └── ... + +.. ipython:: python + :suppress: + + from shutil import rmtree + try: + rmtree('test') + except Exception: Hmm OK. No concern around it catching exceptions that it shouldn't though? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23321 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIjUauolVWoYkrRq48bKY5stxCe4eks5utFxdgaJpZM4X4g_n> .

TomAugspurger · 2018-11-10T12:11:39Z

Looks good. Nice work @anjsudh!

* upstream/master: ENH: Support for partition_cols in to_parquet (pandas-dev#23321)

…fixed * upstream/master: DOC: Fixes to docstring to add validation to CI (pandas-dev#23560) DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600) MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592) DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197) ENH: Support for partition_cols in to_parquet (pandas-dev#23321) TST: Use intp as expected dtype in IntervalIndex indexing tests (pandas-dev#23609)

* closes pandas-dev#23283

anjsudh force-pushed the master branch from c9838ab to 7b17aa9 Compare October 24, 2018 18:28

WillAyd requested changes Oct 25, 2018

View reviewed changes

WillAyd added the IO Parquet parquet, feather label Oct 25, 2018

TomAugspurger reviewed Oct 25, 2018

View reviewed changes

pandas/io/parquet.py Outdated Show resolved Hide resolved

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

anjsudh force-pushed the master branch 10 times, most recently from d2ec124 to c18c99c Compare October 26, 2018 18:21

TomAugspurger reviewed Oct 26, 2018

View reviewed changes

anjsudh force-pushed the master branch 2 times, most recently from d4d6969 to 02fd984 Compare October 27, 2018 05:10

anjsudh mentioned this pull request Oct 27, 2018

tm.ensure_clean() does not create Temporary Directory. It only creates temporary files. #23373

Closed

anjsudh force-pushed the master branch 2 times, most recently from 2cae2fe to 41c2828 Compare October 27, 2018 06:02

closes pandas-dev#23283

41c2828

anjsudh force-pushed the master branch from a577102 to 0d9f878 Compare October 27, 2018 08:41

Fix linting issue

0d9f878

jreback requested changes Oct 28, 2018

View reviewed changes

pandas/io/parquet.py Show resolved Hide resolved

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

datapythonista reviewed Oct 28, 2018

View reviewed changes

pandas/io/parquet.py Outdated Show resolved Hide resolved

pandas/io/parquet.py Outdated Show resolved Hide resolved

pandas/io/parquet.py Outdated Show resolved Hide resolved

anjsudh added 3 commits November 5, 2018 02:36

Merge remote-tracking branch 'upstream/master'

6cb196d

Merge remote-tracking branch 'upstream/master'

6e06646

fix failing codecheck

a5164b8

TomAugspurger reviewed Nov 5, 2018

View reviewed changes

anjsudh force-pushed the master branch 2 times, most recently from d5ee5ec to 1f0978f Compare November 5, 2018 19:08

Updated documentation

1f0978f

Fix"Should raise error on using partition_cols and partition_on together"

anjsudh and others added 2 commits November 6, 2018 09:14

Merge branch 'master' into master

ddfa789

Removed < 0.7.0 documentation for pyarrow in partition support code

ee7707f

jreback requested changes Nov 6, 2018

View reviewed changes

documentation changes for version change

79f1615

jreback requested changes Nov 6, 2018

View reviewed changes

doc/source/io.rst Outdated Show resolved Hide resolved

Cleanup file in while generating doc

514c5c0

TomAugspurger approved these changes Nov 6, 2018

View reviewed changes

jreback approved these changes Nov 8, 2018

View reviewed changes

WillAyd requested changes Nov 8, 2018

View reviewed changes

doc/source/io.rst Show resolved Hide resolved

doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved

pandas/core/frame.py Show resolved Hide resolved

pandas/core/frame.py Show resolved Hide resolved

pandas/io/parquet.py Outdated Show resolved Hide resolved

anjsudh mentioned this pull request Nov 8, 2018

Change Name of fname Parameter to path in Parquet IO Methods #23574

Closed

anjsudh added 2 commits November 9, 2018 00:04

Text changes, Style changes

eb86de0

added empty line after versionadded

8b45547

TomAugspurger merged commit 8ed92ef into pandas-dev:master Nov 10, 2018

thoo added a commit to thoo/pandas that referenced this pull request Nov 10, 2018

Merge remote-tracking branch 'upstream/master' into order_of_parameter

84e5ba7

* upstream/master: ENH: Support for partition_cols in to_parquet (pandas-dev#23321)

JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018

ENH: Support for partition_cols in to_parquet (pandas-dev#23321)

eefb76e

* closes pandas-dev#23283

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

ENH: Support for partition_cols in to_parquet (pandas-dev#23321)

a634a9a

* closes pandas-dev#23283

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH: Support for partition_cols in to_parquet (pandas-dev#23321)

55c259d

* closes pandas-dev#23283

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH: Support for partition_cols in to_parquet (pandas-dev#23321)

a8f3abe

* closes pandas-dev#23283

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for partition_cols in to_parquet #23321

Support for partition_cols in to_parquet #23321

anjsudh commented Oct 24, 2018 •

edited

Loading

pep8speaks commented Oct 24, 2018 •

edited

Loading

WillAyd left a comment

codecov bot commented Oct 25, 2018 •

edited

Loading

TomAugspurger left a comment

TomAugspurger commented Oct 25, 2018

TomAugspurger left a comment

anjsudh commented Oct 27, 2018

xhochy commented Oct 27, 2018

jreback left a comment

datapythonista left a comment

TomAugspurger Nov 5, 2018

anjsudh Nov 5, 2018

TomAugspurger Nov 5, 2018

anjsudh Nov 5, 2018

TomAugspurger Nov 5, 2018

TomAugspurger Nov 5, 2018

anjsudh Nov 5, 2018

TomAugspurger commented Nov 5, 2018

jreback Nov 6, 2018

anjsudh Nov 6, 2018

anjsudh commented Nov 8, 2018

jreback left a comment

jreback Nov 8, 2018

jreback Nov 8, 2018

TomAugspurger commented Nov 8, 2018 via email

TomAugspurger commented Nov 10, 2018

Support for partition_cols in to_parquet #23321

Support for partition_cols in to_parquet #23321

Conversation

anjsudh commented Oct 24, 2018 • edited Loading

pep8speaks commented Oct 24, 2018 • edited Loading

Comment last updated on November 05, 2018 at 19:08 Hours UTC

WillAyd left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 25, 2018 • edited Loading

Codecov Report

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 25, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

anjsudh commented Oct 27, 2018

xhochy commented Oct 27, 2018

jreback left a comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anjsudh commented Nov 8, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 8, 2018 via email

TomAugspurger commented Nov 10, 2018

anjsudh commented Oct 24, 2018 •

edited

Loading

pep8speaks commented Oct 24, 2018 •

edited

Loading

codecov bot commented Oct 25, 2018 •

edited

Loading