FilePartition and PartitionedFile for scanning flexibility #932

yjshen · 2021-08-23T14:10:56Z

Which issue does this PR close?

Closes #946 .

Rationale for this change

For potentially finer-grained readers that parallelize even one file reading or balancing workload between scanning threads even in case of great variance in input file sizes. As I quote @andygrove here:

One of the current issues IMO with DataFusion is that we use "file" as the default unit of partitioning. We would be able to scale better if we had finer-grained readers such as reading Parquet row groups instead. This way we can have multiple threads reading from the same file concurrently and avoid the need to repartition first to increase concurrency.

Refactoring Logic in ParquetExec and parquet datasource. It's strange to call ParquetExec:: try_from_path to get planning-related metadata.

What changes are included in this PR?

PartitionedFile -> Single file (for the moment) or part of a file (later, part of the row groups or rows), and we may even extend this to include partition value and partition schema to support partitioned tables:
/path/to/table/root/p_date=20210813/p_hour=1200/xxxxx.parquet
FilePartition -> The basic unit for parallel processing, each task is responsible for processing one FilePartition which is composed of several PartitionFiles.
Update ballista protocol as well as the serdes to use the new abstraction.
Telling apart the planning related code from ParquetExec

Are there any user-facing changes?

No.

yjshen · 2021-08-25T16:13:24Z

cc @houqp @alamb @andygrove for review

houqp

left couple minor comments, the rest looks good to me!

ballista/rust/core/src/serde/logical_plan/from_proto.rs

ballista/rust/core/src/serde/logical_plan/to_proto.rs

houqp · 2021-08-28T03:52:43Z

ballista/rust/scheduler/src/lib.rs

-                        .collect(),
+                    schema: Some(parquet_desc.schema().as_ref().into()),
+                    partitions: vec![FilePartitionMetadata {
+                        filename: vec![path],


I remember we discussed this in the original PR. After taking a second look at the code, I am still not fully following the change here. The old behavior has FilePartitionMetadata.filename set to a vector of file paths returned from a directory list, while the new behavior here has the filename always set to a vector of single entry with value set to the root path of the table.

Shouldn't we use parquet_desc.descriptor.descriptor to build the filename vector here instead?

I changed it to a vector of all the files.

However, after searching for a while in the project, I find this method may not be actually used, it's hard to understand this RPC's intention as well. Perhaps it's deprecated and we should remove it later?

rpc GetFileMetadata (GetFileMetadataParams) returns (GetFileMetadataResult) {}

I had the same question when I was going through the code base yesterday, I noticed it's only mentioned in ballista/docs/architecture.md. @andygrove do you know if this RPC method is still needed?

houqp

Great refactor @yjshen !

alamb

This is looking great @yjshen -- thank you for persevering. I think this PR looks great other than the addition of filter to the LogicalPlanBuilder::scan (see comments on that).

I didn't review the ballista changes, but I assume they are mostly mechanical

Again, thank you so much and sorry for the long review cycle

alamb · 2021-08-29T09:55:05Z

datafusion/src/datasource/mod.rs

+    pub file_path: String,
+    /// Statistics of the file
+    pub statistics: Statistics,
+    // Values of partition columns to be appended to each row


I think in order to take full advantage of partition values (which might span multiple columns, for example), more information about the partitioning scheme will be needed (e.g. what expression is used to generated partitioning values). Adding partitioning support to DataFusion's planning / execution is probably worth its own discussion

(that is to say I agree with postponing adding anything partition specific)

datafusion/src/datasource/mod.rs

alamb · 2021-08-29T10:05:49Z

datafusion/src/physical_plan/parquet.rs

@@ -27,14 +27,14 @@ use crate::{
    logical_plan::{Column, Expr},
    physical_optimizer::pruning::{PruningPredicate, PruningStatistics},
    physical_plan::{
-        common, DisplayFormatType, ExecutionPlan, Partitioning, SendableRecordBatchStream,


I really like how the statistics and schema related code has been moved out of physical_plan and into datasource

alamb · 2021-08-29T10:16:20Z

datafusion/src/logical_plan/builder.rs

    }

    /// Convert a table provider into a builder with a TableScan
    pub fn scan(
        table_name: impl Into<String>,
        provider: Arc<dyn TableProvider>,
        projection: Option<Vec<usize>>,
+        filters: Option<Vec<Expr>>,


I think this argument is likely going to be confusing to users and it should be removed.

For example as a user of LogicalPlanBuilder I would probably assume that the following plan would return only rows where with a<5

// Build a plan that looks like it would filter out all rows with `a < 5` let plan = builder.scan("table", provider, None, vec![col("a").lt(lit(5)));

However, I am pretty sure it could (and often would) return rows with a >= 5). This is because filters added to a TableScan node are optional (in the sense that the provider might not filter rows that do not pass the predicate, but is not required to). Indeed, even for the parquet provider, the filters are only used for row group pruning which may or may not be able to filter rows.

I think we could solve this with:

Leave scan signature alone and rely on the predicate pushdown optimization to push filters appropriately down to the scan (my preference as it is simpler for the users)

Rename this argument to something like 'optional_filters_for_performance' and document what it does more carefully. I think it would be challenging to explain as it might/might not do anything depending on how the data was laid out.

Removed, and keep the filters not deserialized for ballista as before.

yjshen · 2021-08-29T14:29:40Z

@houqp @alamb I've resolved the comments, PTAL, thanks.

houqp

LGTM!

alamb

Thanks @yjshen !

yjshen · 2021-08-30T13:30:31Z

Thanks @houqp @alamb for your great help!

houqp · 2021-08-30T15:00:39Z

Thank you @yjshen for being patient and driving through this big change step by step :)

FilePartition and partitionedFile for scanning flexibility

5cb4d63

github-actions bot added ballista datafusion Changes in the datafusion crate labels Aug 23, 2021

yjshen changed the title ~~FilePartition and partitionedFile for scanning flexibility~~ FilePartition and PartitionedFile for scanning flexibility Aug 23, 2021

yjshen added 3 commits August 23, 2021 22:26

clippy

794a28d

remove schema from partitioned file

f50b1a3

ballista logical parquet table

ab71fa6

github-actions bot added the sql SQL Planner label Aug 25, 2021

yjshen added 2 commits August 25, 2021 20:34

ballista physical parquet exec

fd2a0b0

Merge remote-tracking branch 'apache/master' into pf_only

ca68d6e

yjshen marked this pull request as ready for review August 25, 2021 14:47

Merge remote-tracking branch 'apache/master' into pf_only

efc911c

This was referenced Aug 26, 2021

Add support for reading remote storage systems #811

Closed

ObjectStore API to read from remote storage systems #950

Merged

houqp reviewed Aug 28, 2021

View reviewed changes

houqp requested review from andygrove, alamb, Dandandan and jorgecarleitao August 28, 2021 04:20

resolve comments

030fb55

houqp approved these changes Aug 29, 2021

View reviewed changes

houqp added api change Changes the API exposed to users of the crate enhancement New feature or request labels Aug 29, 2021

houqp mentioned this pull request Aug 29, 2021

Lazy load parquet roapi/roapi#63

Merged

alamb approved these changes Aug 29, 2021

View reviewed changes

resolve comments

288db83

houqp approved these changes Aug 29, 2021

View reviewed changes

alamb approved these changes Aug 30, 2021

View reviewed changes

alamb merged commit 8a085fc into apache:master Aug 30, 2021

yjshen deleted the pf_only branch August 30, 2021 13:27

yjshen mentioned this pull request Sep 1, 2021

Is RPC GetFileMetadata in Ballista still needed? #963

Closed

rdettai mentioned this pull request Sep 16, 2021

Reorganize table providers by table format #1009

Closed

yjshen mentioned this pull request Oct 12, 2021

Replace file format providers rdettai/arrow-datafusion#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilePartition and PartitionedFile for scanning flexibility #932

FilePartition and PartitionedFile for scanning flexibility #932

yjshen commented Aug 23, 2021 •

edited

Loading

yjshen commented Aug 25, 2021

houqp left a comment

houqp Aug 28, 2021

yjshen Aug 29, 2021 •

edited

Loading

houqp Aug 29, 2021

houqp left a comment

alamb left a comment

alamb Aug 29, 2021

alamb Aug 29, 2021

alamb Aug 29, 2021

yjshen Aug 29, 2021

yjshen commented Aug 29, 2021

houqp left a comment

alamb left a comment

yjshen commented Aug 30, 2021

houqp commented Aug 30, 2021

FilePartition and PartitionedFile for scanning flexibility #932

FilePartition and PartitionedFile for scanning flexibility #932

Conversation

yjshen commented Aug 23, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

yjshen commented Aug 25, 2021

houqp left a comment

Choose a reason for hiding this comment

houqp Aug 28, 2021

Choose a reason for hiding this comment

yjshen Aug 29, 2021 • edited Loading

Choose a reason for hiding this comment

houqp Aug 29, 2021

Choose a reason for hiding this comment

houqp left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 29, 2021

Choose a reason for hiding this comment

alamb Aug 29, 2021

Choose a reason for hiding this comment

alamb Aug 29, 2021

Choose a reason for hiding this comment

yjshen Aug 29, 2021

Choose a reason for hiding this comment

yjshen commented Aug 29, 2021

houqp left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

yjshen commented Aug 30, 2021

houqp commented Aug 30, 2021

yjshen commented Aug 23, 2021 •

edited

Loading

yjshen Aug 29, 2021 •

edited

Loading