-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for partition_cols in to_parquet #23321
Changes from 3 commits
41c2828
0d9f878
1636681
14a2580
7bc337b
6670adf
971ba54
112d6e9
441f879
6cb196d
6e06646
a5164b8
1f0978f
ddfa789
ee7707f
79f1615
514c5c0
eb86de0
8b45547
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -213,6 +213,7 @@ Other Enhancements | |
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`). | ||
- Compatibility with Matplotlib 3.0 (:issue:`22790`). | ||
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`) | ||
- :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably best to mention "with the pyarrow engine (this was previously supported with fastparquet)." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`) | ||
|
||
.. _whatsnew_0240.api_breaking: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1970,7 +1970,7 @@ def to_feather(self, fname): | |
to_feather(self, fname) | ||
|
||
def to_parquet(self, fname, engine='auto', compression='snappy', | ||
index=None, **kwargs): | ||
index=None, partition_cols=None, **kwargs): | ||
""" | ||
Write a DataFrame to the binary parquet format. | ||
|
||
|
@@ -1984,7 +1984,8 @@ def to_parquet(self, fname, engine='auto', compression='snappy', | |
Parameters | ||
---------- | ||
fname : str | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. side issue. we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we actually use path on the top-level |
||
String file path. | ||
File path or Root Directory path. Will be used as Root Directory | ||
path while writing a partitioned dataset. | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto' | ||
Parquet library to use. If 'auto', then the option | ||
``io.parquet.engine`` is used. The default ``io.parquet.engine`` | ||
|
@@ -1998,6 +1999,12 @@ def to_parquet(self, fname, engine='auto', compression='snappy', | |
the behavior depends on the chosen engine. | ||
|
||
.. versionadded:: 0.24.0 | ||
anjsudh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
partition_cols : list, optional, default None | ||
Column names by which to partition the dataset | ||
Columns are partitioned in the order they are given | ||
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet | ||
For other versions, this argument will be ignored. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it actually ignored for older pyarrows? I would have hoped it would raise when pyarrow gets the unrecognized argument. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually it seems like we raise. Could you update this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
.. versionadded:: 0.24.0 | ||
anjsudh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
**kwargs | ||
Additional arguments passed to the parquet library. See | ||
|
@@ -2027,7 +2034,8 @@ def to_parquet(self, fname, engine='auto', compression='snappy', | |
""" | ||
from pandas.io.parquet import to_parquet | ||
to_parquet(self, fname, engine, | ||
compression=compression, index=index, **kwargs) | ||
compression=compression, index=index, | ||
partition_cols=partition_cols, **kwargs) | ||
|
||
@Substitution(header='Write out the column names. If a list of strings ' | ||
'is given, it is assumed to be aliases for the ' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the items in this lists feel more like limitations of pandas / these engines. Requiring that
path
be a directory whenpartition_cols
is set doesn't seem to fit here.I think this is important / different enough to deserve a new small section below "Handling Indexes", with
partition_cols
requires (list of column names, directory for file path)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done