Support Encoding Parquet Columns in Parallel #4871

tustvold · 2023-09-27T16:06:12Z

Which issue does this PR close?

Related to: #1718
Enables: apache/datafusion#7655
closes #4871

Rationale for this change

Inspired by #4859 but exposing a slightly different API

I have confirmed this does not appear to have any impact on benchmarks, likely because it doesn't alter any of the "hot" loops

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-09-27T16:07:16Z

parquet/src/arrow/arrow_writer/levels.rs

@@ -558,10 +564,25 @@ pub(crate) struct LevelInfo {

    /// The maximum repetition for this leaf column
    max_rep_level: i16,
+
+    /// The arrow array
+    array: ArrayRef,


This is the major meat of the PR, by bundling the leaf array with the corresponding metadata, we get an API that is very amenable to parallelization

tustvold · 2023-09-27T16:09:50Z

parquet/src/arrow/arrow_writer/mod.rs

+pub struct ArrowLeafColumn(ArrayLevels);
+
+/// Computes the [`ArrowLeafColumn`] for a given potentially nested [`ArrayRef`]
+pub fn compute_leaves(field: &Field, array: &ArrayRef) -> Result<Vec<ArrowLeafColumn>> {


This is the API that allows data to be written encoded in parallel. This method takes a single array so that:

We could theoretically also parallelise the level computation

We preserve the ability to write arrays with stricter nullability than in the file schema (Only require compatible batch schema in ArrowWriter #4027)

tustvold · 2023-09-27T16:11:56Z

parquet/src/arrow/arrow_writer/mod.rs

        }
    }
 }

 /// Encodes [`RecordBatch`] to a parquet row group
-pub struct ArrowRowGroupWriter {
-    writers: Vec<(SharedColumnChunk, ArrowColumnWriter)>,
+struct ArrowRowGroupWriter {


This no longer needs to be public, as this hasn't been released yet, I opted to partially revert the change in (#4850)

tustvold · 2023-09-27T16:57:15Z

Working on adding an example

tustvold · 2023-09-27T17:36:41Z

parquet/src/file/writer.rs

@@ -115,8 +115,7 @@ pub type OnCloseRowGroup<'a> = Box<
            Vec<Option<ColumnIndex>>,
            Vec<Option<OffsetIndex>>,
        ) -> Result<()>
-        + 'a
-        + Send,
+        + 'a,


This unreleased change from #4850 is no longer necessary

devinjdangelo · 2023-09-27T21:56:32Z

This looks great! I will take a pass at updating apache/datafusion#7655 to use this to test it out and report back.

devinjdangelo · 2023-09-28T00:46:08Z

This worked like a charm! I updated apache/datafusion#7655 to use this branch. Your example comment was very helpful as well. I believe the datafusion PR should be handling nested columns correctly now. Performance metrics are within a margin of error vs. the previous API.

alamb

This is really neat @tustvold and @devinjdangelo

I had some suggestions that might make the example and the APIs a little easier to work with for mere mortals like myself 😅 -- but nothing that couldn't be done as a follow on either.

I took a shot at using this API to make an example program that writes parquet in parallel (both row groups and column), both as a way to document how to use it / make it hopefully more discoverable as well as figuring it out myself.

I hope to make a follow on PR with this example and some additional documentation suggestions

alamb · 2023-09-28T19:24:49Z

parquet/src/arrow/arrow_writer/mod.rs

+///
+/// // Spawn work to encode columns
+/// let mut worker_iter = workers.iter_mut();
+/// for (a, f) in to_write.iter().zip(&schema.fields) {


I know single letter variable names are concise, but I would find this much easier to follow if it had names like

Suggested change

/// for (a, f) in to_write.iter().zip(&schema.fields) {

/// for (arr, field) in to_write.iter().zip(&schema.fields) {

parquet/src/arrow/arrow_writer/mod.rs

alamb · 2023-09-28T19:47:53Z

parquet/src/arrow/arrow_writer/mod.rs

+/// // Spawn work to encode columns
+/// let mut worker_iter = workers.iter_mut();
+/// for (a, f) in to_write.iter().zip(&schema.fields) {
+///     for c in compute_leaves(f, a).unwrap() {


Suggested change

/// for c in compute_leaves(f, a).unwrap() {

/// for leaves in compute_leaves(f, a).unwrap() {

alamb · 2023-09-28T19:55:15Z

parquet/src/arrow/arrow_writer/mod.rs

+/// for (handle, send) in workers {
+///     drop(send); // Drop send side to signal termination
+///     let (chunk, result) = handle.join().unwrap().unwrap();
+///     row_group.append_column(&chunk, result).unwrap();


This will result in copying the chunk data, but I suppose that is inevitable -- as this code effectively encoding columns chunks into memory buffers somewhere that then need to be copied directly into destination parquet file

It will just write the bytes to the Write implementation. In this case that is a Vec but if it were a File it "technically" wouldn't be a copy... FWIW this is the same as master

Yes, sorry - I should have been clear that I don't see anything wrong. I was just observing / validating my understanding.

alamb · 2023-09-28T20:01:59Z

parquet/src/arrow/arrow_writer/mod.rs

+    }
+
+    /// Close this column returning the [`ArrowColumnChunk`] and [`ColumnCloseResult`]
+    pub fn close(self) -> Result<(ArrowColumnChunk, ColumnCloseResult)> {


While I was playing with this API, I found this combination of (ArrowColumnChunk, ColumnCloseResult) to be awkward to use because you always need both of them, but the api requires passing them as a pair

What would you think about wrapping them in a named struct, something like this, perhaps?

EncodedRowGroup((ArrowColumnChunk, ColumnCloseResult))

This wouldn't be usable with append_column unless you first exploded it into parts, at which point...

Perhaps there could be a light wrapper around append_column that accepts EncodedRowGroup as @alamb defined it?

It isn't a big deal, but I agree with @alamb that it is confusing at first since the consumer of this API does not really need to concern themself with either of these two structs individually. They are more internal implementation details of arrow-rs. EncodedRowGroup would be more clear.

I tried to address this in a4c17a9 PTAL

alamb · 2023-09-28T20:46:12Z

Can we also implement Debug for ArrowColumnWriter?

93 | #[derive(Debug, Clone)]
   |          ----- in this derive macro expansion
...
98 |     col_writers: Vec<ArrowColumnWriter>
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ `ArrowColumnWriter` cannot be formatted using `{:?}` because it doesn't implement `Debug`
   |

Facilitate parallel parquet writing

eae2820

github-actions bot added the parquet Changes to the parquet crate label Sep 27, 2023

tustvold commented Sep 27, 2023

View reviewed changes

Revert OnCloseRowGroup Send

9ce9323

Add example

9c07441

tustvold changed the title ~~Facilitate parallel parquet writing~~ Support Encoding Parquet Columns in Parallel Sep 27, 2023

tustvold commented Sep 27, 2023

View reviewed changes

devinjdangelo mentioned this pull request Sep 28, 2023

Parallelize Serialization of Columns within Parquet RowGroups apache/datafusion#7655

Merged

alamb mentioned this pull request Sep 28, 2023

Enable External ArrowColumnWriter Access #4859

Closed

alamb approved these changes Sep 28, 2023

View reviewed changes

tustvold added 4 commits September 29, 2023 10:09

Review feedback

a4c17a9

Fix doc

44c6713

Further review feedback

10ab137

More docs

174c555

tustvold merged commit 3ac0053 into apache:master Oct 1, 2023
16 checks passed

devinjdangelo mentioned this pull request Oct 5, 2023

Mark OnCloseRowGroup Send #4893

Merged

tustvold mentioned this pull request Oct 23, 2023

Support encoding a single parquet file using multiple threads #1718

Closed

tustvold mentioned this pull request Jan 29, 2024

Return error instead of assert when meeting incompatitble type #4995

Closed

tustvold mentioned this pull request Mar 17, 2024

Low-Level Arrow Parquet Reader #5522

Open

derekperkins mentioned this pull request Apr 14, 2024

Write columns in parallel parquet-go/parquet-go#123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Encoding Parquet Columns in Parallel #4871

Support Encoding Parquet Columns in Parallel #4871

tustvold commented Sep 27, 2023 •

edited by alamb

Loading

tustvold Sep 27, 2023

tustvold Sep 27, 2023 •

edited

Loading

tustvold Sep 27, 2023

tustvold commented Sep 27, 2023

tustvold Sep 27, 2023

devinjdangelo commented Sep 27, 2023

devinjdangelo commented Sep 28, 2023

alamb left a comment

alamb Sep 28, 2023

alamb Sep 28, 2023

alamb Sep 28, 2023

tustvold Sep 28, 2023

alamb Sep 29, 2023

alamb Sep 28, 2023

tustvold Sep 28, 2023

devinjdangelo Sep 28, 2023

tustvold Sep 29, 2023

alamb commented Sep 28, 2023

	/// for (a, f) in to_write.iter().zip(&schema.fields) {
	/// for (arr, field) in to_write.iter().zip(&schema.fields) {

	/// for c in compute_leaves(f, a).unwrap() {
	/// for leaves in compute_leaves(f, a).unwrap() {

Support Encoding Parquet Columns in Parallel #4871

Support Encoding Parquet Columns in Parallel #4871

Conversation

tustvold commented Sep 27, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

tustvold Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Sep 27, 2023

Choose a reason for hiding this comment

devinjdangelo commented Sep 27, 2023

devinjdangelo commented Sep 28, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 28, 2023

tustvold commented Sep 27, 2023 •

edited by alamb

Loading

tustvold Sep 27, 2023 •

edited

Loading