Allow Setting Minimum Parallelism with RowCount Based Demuxer #7841

devinjdangelo · 2023-10-16T23:00:25Z

Which issue does this PR close?

Addresses performance regression of #7791

Rationale for this change

#7791 introduced a row count targeting execution time partitioning strategy for DataSinks. The initial implementation only writes a single file at a time, which guarantees that only 1 file will ever be written with <soft_max_rows_per_output_file rows and all others will have >= soft_max_rows_per_output_file. This PR introduces a new setting minimum_parallel_output_files which will write N files in parallel, each targeting soft_max_rows_per_output_file. This allows the user to configure a balance between parallelism and achieving the desired file size.

The behavior of this PR is identical to #7791 if minimum_parallel_output_files is set to 1.

What changes are included in this PR?

Adds minimum_parallel_output_files config setting
Creates new file writers on-demand as batches arrive, so if there is only 1 batch only 1 file will be written regardless of the minimum_parallel_output_files setting.
Updates tests to account for this new setting

Are these changes tested?

Yes by existing tests

Are there any user-facing changes?

Default behavior is now to output at least 4 files in parallel even if soft_max_rows_per_output_file is not reached.

devinjdangelo · 2023-10-18T12:21:57Z

@alamb @metesynnada This PR and #7801 are rebased and ready for review when you have a chance. This one is smaller and addresses the performance regression, so probably best to prioritize this one.

alamb · 2023-10-18T21:22:24Z

Thank you @devinjdangelo -- I have been accumulating quite a review backlog while working on some other writing projects lol -- I hope to make a dent in this backlog tomorrow

alamb

Thank you @devinjdangelo -- this is (another) really nice PR.

I reran the test from #7791 (review)

And I do confirm this PR goes much faster. The fact the setting is configurable also means users can trade off buffering and fewer/more compacted files, which is very nice

It is also really nice that this PR still doesn't make empty files if there are no batches to send.

The only thing I think this PR needs prior to merge is some sort of test (perhaps you could set minimum_parallel_output_files to 1 and demonstrate that a single file is created, and then set it minimum_parallel_output_files to 3 and demonstrate that more than 1 is created or something

datafusion/common/src/config.rs

datafusion/core/src/datasource/file_format/write.rs

alamb · 2023-10-19T15:00:34Z

datafusion/common/src/config.rs

+        /// RecordBatches will be distributed in round robin fashion to each
+        /// parallel writer. Each writer is closed and a new file opened once
+        /// soft_max_rows_per_output_file is reached.
+        pub minimum_parallel_output_files: usize, default = 4


What do you think about defaulting to the number of cores (maybe if this was 0)?

The returns to additional cores seems to decline very fast beyond 4 tasks in my testing. I believe this is because ~4 parallel serialization tasks no longer bottlenecks the end-to-end execution plan. Going beyond 4 tasks mostly gives higher memory usage and smaller output files for little benefit.

My testing is mostly on a 32core system. I have not tested on enough different configurations to know if core_count/8 is a reasonable default or if a static 4 tasks is a decent default.

It will also depend a lot on the actual execution plan. If you are writing a pre-cached in memory dataset, then you definitely want 1 task/output file per core.

I plan to work on a statement level option soon, so you could easily do:

copy my_in_memory_table to my_dir (format parquet, output_files 32);

to boost the parallelism for specific plans that benefit from it.

Makes sense to me

alamb · 2023-10-20T18:26:06Z

This PR has a small conflict, but I am pretty sure once that is fixed it will be ready to go

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

devinjdangelo · 2023-10-21T11:57:06Z

This PR has a small conflict, but I am pretty sure once that is fixed it will be ready to go

I'll sort this out today, and see if I can improve the tests as you suggested.

alamb · 2023-10-21T20:23:53Z

LGTM -- thanks again @devinjdangelo

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 16, 2023

devinjdangelo mentioned this pull request Oct 16, 2023

DataSink Dynamic Execution Time Demux #7791

Merged

devinjdangelo added 2 commits October 18, 2023 07:53

minimum_parallel_files_added

b7d3e16

update docs

3d3149f

devinjdangelo force-pushed the minimum_parallel_output_files branch from fa73eb8 to 3d3149f Compare October 18, 2023 11:55

devinjdangelo marked this pull request as ready for review October 18, 2023 12:09

alamb approved these changes Oct 19, 2023

View reviewed changes

Apply suggestions from code review

553ca0d

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

devinjdangelo added 3 commits October 21, 2023 08:42

Merge branch 'main' into minimum_parallel_output_files

6c9c9a5

generalize tests

d3de758

update docs

9534a80

alamb merged commit 9fde5c4 into apache:main Oct 21, 2023
23 checks passed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Setting Minimum Parallelism with RowCount Based Demuxer #7841

Allow Setting Minimum Parallelism with RowCount Based Demuxer #7841

devinjdangelo commented Oct 16, 2023

devinjdangelo commented Oct 18, 2023

alamb commented Oct 18, 2023

alamb left a comment

alamb Oct 19, 2023

devinjdangelo Oct 21, 2023

devinjdangelo Oct 21, 2023

alamb Oct 21, 2023

alamb commented Oct 20, 2023

devinjdangelo commented Oct 21, 2023

alamb commented Oct 21, 2023

Allow Setting Minimum Parallelism with RowCount Based Demuxer #7841

Allow Setting Minimum Parallelism with RowCount Based Demuxer #7841

Conversation

devinjdangelo commented Oct 16, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

devinjdangelo commented Oct 18, 2023

alamb commented Oct 18, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Oct 19, 2023

Choose a reason for hiding this comment

devinjdangelo Oct 21, 2023

Choose a reason for hiding this comment

devinjdangelo Oct 21, 2023

Choose a reason for hiding this comment

alamb Oct 21, 2023

Choose a reason for hiding this comment

alamb commented Oct 20, 2023

devinjdangelo commented Oct 21, 2023

alamb commented Oct 21, 2023