-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for partition_cols in to_parquet #23321
Conversation
anjsudh
commented
Oct 24, 2018
•
edited
Loading
edited
- closes #23283
- tests passed
- passes git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
Hello @anjsudh! Thanks for updating the PR.
Comment last updated on November 05, 2018 at 19:08 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Please make sure that you always have tests first and foremost with any submission though.
Once added feel free to ping back and can take another look
Codecov Report
@@ Coverage Diff @@
## master #23321 +/- ##
=========================================
Coverage ? 92.25%
=========================================
Files ? 161
Lines ? 51277
Branches ? 0
=========================================
Hits ? 47305
Misses ? 3972
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a release note in 0.24.0 under enhancements? We'll also need to update the docstring for fname
. IIUC, this will create a directory, rather than a single file, when partition_cols
is included.
The error in https://travis-ci.org/pandas-dev/pandas/jobs/446310514#L2577 suggests that we'll need to put a minimum pyarrow version to support this behavior. Can you check what version that is, and raise a with an ImportError with a nice error message if the pyarrow is too old? |
d2ec124
to
c18c99c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention somewhere that
d4d6969
to
02fd984
Compare
2cae2fe
to
41c2828
Compare
@WillAyd Hi, hope you can review the diff now? Have added the necessary tests |
From a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also pls update the documentatio in io.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested couple of improvements to the docstring. Also, if you can run ./scripts/validate_docstrings.py pandas.DataFrame.to_parquet
to see that everything else is correct in it, that would be great.
doc/source/io.rst
Outdated
@@ -4574,6 +4574,8 @@ Several caveats. | |||
* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype. | |||
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message | |||
on an attempt at serialization. | |||
* ``partition_cols`` will be used for partitioning the dataset, where the dataset will be written to multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the items in this lists feel more like limitations of pandas / these engines. Requiring that path
be a directory when partition_cols
is set doesn't seem to fit here.
I think this is important / different enough to deserve a new small section below "Handling Indexes", with
- A description of what
partition_cols
requires (list of column names, directory for file path) - A description of why you might want to use partition_cols
- A small example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -235,6 +235,7 @@ Other Enhancements | |||
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`). | |||
- Compatibility with Matplotlib 3.0 (:issue:`22790`). | |||
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`) | |||
- :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to mention "with the pyarrow engine (this was previously supported with fastparquet)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/frame.py
Outdated
Column names by which to partition the dataset | ||
Columns are partitioned in the order they are given | ||
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet | ||
For other versions, this argument will be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually ignored for older pyarrows? I would have hoped it would raise when pyarrow gets the unrecognized argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it seems like we raise. Could you update this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
d5ee5ec
to
1f0978f
Compare
Fix"Should raise error on using partition_cols and partition_on together"
Merge conflict in |
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -235,6 +235,7 @@ Other Enhancements | |||
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`). | |||
- Compatibility with Matplotlib 3.0 (:issue:`22790`). | |||
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`) | |||
- With the pyarrow engine, :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks around DataFrame. say engine='pyarrow'
(in double backticks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
off topic comments, lgtm. @WillAyd
@@ -1984,7 +1984,10 @@ def to_parquet(self, fname, engine='auto', compression='snappy', | |||
Parameters | |||
---------- | |||
fname : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side issue. we use path
elsewhere for IO routines. We should change this as well (out of scope here). would have to deprecate (the name) unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually use path on the top-level .to_parquet
, not sure how this is named this way.
Nope. I don't really care about bugs in shutil :)
…On Thu, Nov 8, 2018 at 10:31 AM William Ayd ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/source/io.rst
<#23321 (comment)>:
> +
+ test
+ ├── a=0
+ │ ├── 0bac803e32dc42ae83fddfd029cbdebc.parquet
+ │ └── ...
+ └── a=1
+ ├── e6ab24a4f45147b49b54a662f0c412a3.parquet
+ └── ...
+
+.. ipython:: python
+ :suppress:
+
+ from shutil import rmtree
+ try:
+ rmtree('test')
+ except Exception:
Hmm OK. No concern around it catching exceptions that it shouldn't though?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23321 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIjUauolVWoYkrRq48bKY5stxCe4eks5utFxdgaJpZM4X4g_n>
.
|
Looks good. Nice work @anjsudh! |
* upstream/master: ENH: Support for partition_cols in to_parquet (pandas-dev#23321)
…fixed * upstream/master: DOC: Fixes to docstring to add validation to CI (pandas-dev#23560) DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600) MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592) DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197) ENH: Support for partition_cols in to_parquet (pandas-dev#23321) TST: Use intp as expected dtype in IntervalIndex indexing tests (pandas-dev#23609)