Skip to content

Commit

Permalink
Updated documentation
Browse files Browse the repository at this point in the history
Fix"Should raise error on using partition_cols and partition_on together"
  • Loading branch information
anjsudh committed Nov 5, 2018
1 parent a5164b8 commit d5ee5ec
Show file tree
Hide file tree
Showing 5 changed files with 49 additions and 8 deletions.
29 changes: 27 additions & 2 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4574,8 +4574,6 @@ Several caveats.
* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
on an attempt at serialization.
* ``partition_cols`` will be used for partitioning the dataset, where the dataset will be written to multiple
files in the path specified. Therefore, the path specified, must be a directory path.

You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``,
Expand Down Expand Up @@ -4670,6 +4668,33 @@ Passing ``index=True`` will *always* write the index, even if that's not the
underlying engine's default behavior.


Partitioning Parquet files
''''''''''''''''''''''''''

Parquet supports partitioning of data based on the values of one or more columns.

.. ipython:: python
df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1]})
df.to_parquet(fname='test', engine='pyarrow', partition_cols=['a'], compression=None)
The `fname` specifies the parent directory to which data will be saved.
The `partition_cols` are the column names by which the dataset will be partitioned.
Columns are partitioned in the order they are given. The partition splits are
determined by the unique values in the partition columns.
The above example creates a partitioned dataset that may look like:

::

test/
a=0/
0bac803e32dc42ae83fddfd029cbdebc.parquet
...
a=1/
e6ab24a4f45147b49b54a662f0c412a3.parquet
...


.. _io.sql:

SQL Queries
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ Other Enhancements
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
- :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`).
- With the pyarrow engine, :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`).
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)

.. _whatsnew_0240.api_breaking:
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2002,8 +2002,8 @@ def to_parquet(self, fname, engine='auto', compression='snappy',
partition_cols : list, optional, default None
Column names by which to partition the dataset
Columns are partitioned in the order they are given
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet
For other versions, this argument will be ignored.
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet.
Raises a ValueError for other versions.
.. versionadded:: 0.24.0
Expand Down
11 changes: 8 additions & 3 deletions pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,12 @@ def write(self, df, path, compression='snappy', index=None,
# Use tobytes() instead.

if 'partition_on' in kwargs:
partition_cols = kwargs.pop('partition_on')
if partition_cols is None:
partition_cols = kwargs.pop('partition_on')
else:
raise ValueError("Cannot use both partition_on and "
"partition_cols. Use partition_cols for "
"partitioning data")

if partition_cols is not None:
kwargs['file_scheme'] = 'hive'
Expand Down Expand Up @@ -290,8 +295,8 @@ def to_parquet(df, path, engine='auto', compression='snappy', index=None,
partition_cols : list, optional
Column names by which to partition the dataset
Columns are partitioned in the order they are given
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet
For other versions, this argument will be ignored.
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet.
Raises a ValueError for other versions.
.. versionadded:: 0.24.0
kwargs
Additional keyword arguments passed to the engine
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -589,3 +589,14 @@ def test_partition_on_supported(self, fp, df_full):
import fastparquet
actual_partition_cols = fastparquet.ParquetFile(path, False).cats
assert len(actual_partition_cols) == 2

def test_error_on_using_partition_cols_and_partition_on(self, fp, df_full):
# GH #23283
partition_cols = ['bool', 'int']
df = df_full
with pytest.raises(ValueError):
with tm.ensure_clean_dir() as path:
df.to_parquet(path, engine="fastparquet", compression=None,
partition_on=partition_cols,
partition_cols=partition_cols)

0 comments on commit d5ee5ec

Please sign in to comment.