Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to support recursive unnest in physical plan #11577

Merged

Conversation

duongcongtoai
Copy link
Contributor

@duongcongtoai duongcongtoai commented Jul 20, 2024

Which issue does this PR close?

Closes #11198 and #11689

Rationale for this change

We are not properly handling unnest where there are multiple unnest exprs at different level involved in the select exprs

What changes are included in this PR?

Implement RecursiveUnnestRewriter

It can transform a composite expr of unnest(unnest(unnest(columnA))) into a logical plan unnest(columnA, depth=3)

This implementation also fixed #11198, by the usage of push_projection_dedupl, meaning during the traversal of the complex expr, if a column is already projected, it should not be projected again (check #11198 (comment) for further context)

Support recursive unnest at physical layer by calling list_unnest_at_level multiple times

To fix #11689, i took a look how DuckDB and Postgres behave, where in the query there are multiple unnesting exprs, but with different recursion level.
This implementation follows DuckDB's behavior. For more context on the behavior difference, check this comment

This function is used to execute the unnesting on multiple columns all at once, but one level at a time, and is called n times, where n is the highest recursion level among the unnest exprs in the query.

For example giving the following query:

select unnest(colA, max_depth:=3) as P1, unnest(colA,max_depth:=2) as P2, unnest(colB, max_depth:=1) as P3 from temp;

Then the total times this function being called is 3

It needs to be aware of which level the current unnesting is, because if there exists multiple unnesting on the same column, but with different recursion levels, say unnest(colA, max_depth:=3) and unnest(colA, max_depth:=2), then the unnesting of expr unnest(colA, max_depth:=3) will start at level 3, while unnesting for expr unnest(colA, max_depth:=2) has to start at level 2

Set colA as a 3-dimension columns and colB as an array (1-dimension). As stated, this function is called with the descending order of recursion depth

Depth = 3

  • colA(3-dimension) unnest into temp column temp_P1(2_dimension) (unnesting of P1 starts from this level)
  • colA(3-dimension) having indices repeated by the unnesting operation above
  • colB(1-dimension) having indices repeated by the unnesting operation above

Depth = 2

  • temp_P1(2-dimension) unnest into temp column temp_P1(1-dimension)
  • colA(3-dimension) unnest into temp column temp_P2(2-dimension) (unnesting of P2 starts from this level)
  • colB(1-dimension) having indices repeated by the unnesting operation above

Depth = 1

  • temp_P1(1-dimension) unnest into P1
  • temp_P2(2-dimension) unnest into P2
  • colB(1-dimension) unnest into P3 (unnesting of P3 starts from this level)

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jul 20, 2024
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jul 30, 2024
@github-actions github-actions bot added the optimizer Optimizer rules label Aug 3, 2024
@github-actions github-actions bot added the core Core DataFusion crate label Aug 4, 2024
@github-actions github-actions bot added the physical-expr Physical Expressions label Aug 12, 2024
@duongcongtoai duongcongtoai marked this pull request as ready for review September 11, 2024 20:41
@duongcongtoai duongcongtoai marked this pull request as draft September 11, 2024 21:00
@@ -165,6 +165,34 @@ select unnest(column1), column1 from unnest_table;
6 [6]
12 [12]

# unnest at different level at the same time
query II
select unnest([1,2,3]), unnest(unnest([[1,2,3]]));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix #11689

@duongcongtoai duongcongtoai marked this pull request as ready for review September 14, 2024 08:20
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @duongcongtoai -- I started looking through this PR and it looks good so far -- I need to find a few more minutes to complete the review but I left some comments for now

}
}

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add documentation to this struct about what it means (what computation it represents)? In the context of this PR it is clear, but once we merge I think the context will be lost

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, comment addressed

@@ -3049,17 +3051,54 @@ pub enum Partitioning {
DistributeBy(Vec<Expr>),
}

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum ColumnUnnestType {
List(Vec<ColumnUnnestList>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add documentation to List about what it is (ideally with an example of when it is used and how it is different to Inferred?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment addressed

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @duongcongtoai -- this is an epic PR. Very nice 👏

I think it is worth considering the public API (aka adding fields to UnnestOptions rather than new methods), but otherwise this is looking really nice.

It is well tested and well commented

I left a few other comments that would be nice to address too -- comments and adding a few more tests.

Thanks again

cc @jayzhan211

/// if one column is a list type, it can be recursively and simultaneously
/// unnested into the desired recursion levels
/// e.g select unnest(list_col,depth=1), unnest(list_col,depth=2)
pub fn unnest_columns_recursive_with_options(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternate API would be to add the ColumnUnnestType as a field on UnnestOptions

(for that matter, maybe we should just add columns: Vec<(Column, ColumnUnnestType)>, as a field 🤔

unnest_with_options(input, unnestings, UnnestOptions::default())
}

pub fn get_unnested_list_datatype_recursive(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this needs to be pub i think it should have some doc strings

}
}

pub fn get_struct_unnested_columns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here

preserve_nulls: true,
},
)?;
let actual =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the future, you an use assert_batches_eq for this kind of comparison

assert_eq!(
unnest_placeholder_columns,
vec!["UNNEST(struct_col[matrix])"]
// TODO: add a test case where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still something you mean to do?

@@ -165,6 +165,34 @@ select unnest(column1), column1 from unnest_table;
6 [6]
12 [12]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any .slt tests for unnesting structures? these tests seem to be on arrays (Lists) rather than Structs

expr_in_unnest.clone().alias(placeholder_name.clone()),
);

// let post_unnest_column = Column::from_name(post_unnest_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like leftover

// Full context, we are trying to plan the execution as InnerProjection->Unnest->OuterProjection
// inside unnest execution, each column inside the inner projection
// will be transformed into new columns. Thus we need to keep track of these placeholding column names
// let placeholder_name = unnest_expr.display_name()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like leftover

@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

I think we can handle the follow ons / comments above as a follow on PR. Let's merge this one to keep things flowing.

@duongcongtoai let me know if you plan to work on the suggestions.

Thanks again. 🚀 onwards. 🙏 @jayzhan211

@alamb alamb merged commit b35e720 into apache:main Sep 25, 2024
24 checks passed
@toaiduongdh
Copy link

Yes thank you for the reviews, i'll create a follow up Pr

bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024
* chore: poc

* fix unnest struct

* UT for memoization

* remove unnessary projection

* chore: temp test case

* multi depth unnest supported

* chore: add map of original column and transformed col

* transformation map to physical layer

* prototype for recursive array length

* chore: some compile err

* finalize input type in physical layer

* chore: refactor unnest builder

* add unnesting type inferred

* fix compile err

* fail test in builder

* Compile err

* chore: detect some bugs

* some work

* support recursive unnest in physical layer

* UT for new build batch function

* compile err

* fix unnesting into empty arrays

* some comment

* fix unnest struct

* some note

* chore: fix all test failure

* fix projection pushdown

* custom rewriter for recursive unnest

* simplify

* rm unnecessary projection

* chore: better comments

* more comments

* chore: better comments

* remove breaking api

* rename

* more unit test

* remove debug

* clean up

* fix proto

* fix dataframe

* fix clippy

* cargo fmt

* fix some test

* fix all test

* fix unnest in join

* fix doc and tests

* chore: better doc

* better doc

* tune comment

* rm todo

* refactor

* chore: reserve test

* add a basic test

* chore: more document

* doc on ColumnUnnestType List

* chore: add partialord to new types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
4 participants