Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move wildcard expansions to the analyzer #11681

Merged
merged 41 commits into from
Aug 13, 2024

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Jul 27, 2024

Which issue does this PR close?

Closes #11639 .

Rationale for this change

As discussed in #11639 (comment), I tried to solve the to-do issue for moving wildcard expansion to the analyzer.

// TODO: move it into analyzer
let input_schema = plan.schema();
let mut projected_expr = vec![];
for e in expr {
let e = e.into();
match e {
Expr::Wildcard { qualifier: None } => {
projected_expr.extend(expand_wildcard(input_schema, &plan, None)?)
}
Expr::Wildcard {
qualifier: Some(qualifier),
} => projected_expr.extend(expand_qualified_wildcard(
&qualifier,
input_schema,
None,
)?),

I keep the wildcard expression when planning the SQL and implementing ExpandWildcardRule.

What changes are included in this PR?

This change impacts many parts. I did the required modifications for them.

Expanding expressions to fields for the schema of the Projection plan

When planning the SQL, we base the schema on a plan to perform multiple validations and optimizations. Even if we move the expansion of Expr::Wildcard to the analyzer, the schema should contain the actual column information instead of the wildcard field.

We should be careful with calc_func_dependencies_for_project. functional_dependencies is related to the implementation of constraints (#7040) and also affects the implicit grouping keys optimization (#11903 (comment)).

Handle WildacrdAddiotionsOption for replace, exclude, or expect, ...

Datafusion supports options for wildcard expressions. Due to moving the expansion to the analyzer, we should handle WildcardOptions within Expr::Wildcard. We need to consider these options when expanding the wildcard in the following:

  • ExpandWildcardRule
  • utils::exprlist_to_fields
  • utils::exprlist_len

Type_coercion for union

Previously, we will do the type coercion when planning the union. We should do it again after expanding the wildcard expression.

Unparsing Expr::Wildcard

I also implement the unparsing for the Expr::Wildcard to pass the tests. However, I leave a to-do issue for unparsing WildcardOptions

Are these changes tested?

yes

Are there any user-facing changes?

no.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules labels Jul 27, 2024
@goldmedal goldmedal force-pushed the feature/11639-add-expand-rule branch from bdf474e to 59cf620 Compare August 5, 2024 16:06
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 7, 2024
@goldmedal
Copy link
Contributor Author

We performed many optimizations and validations based on the expanded expression when planning SQL. Ideally, we should move these behind the ExpandWildcardRule to avoid additional expansions for the schema fields.

Comment on lines +67 to +74
fn apply_required_rule(logical_plan: LogicalPlan) -> Result<LogicalPlan> {
let options = ConfigOptions::default();
Analyzer::with_rules(vec![Arc::new(ExpandWildcardRule::new())]).execute_and_check(
logical_plan,
&options,
|_, _| {},
)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To generate the correct schema, we need to apply ExpandWildcardRule for the plan of the view first

Comment on lines +628 to +633
if let Some(replace) = options.opt_replace {
let replace_expr = replace
.items
.iter()
.map(|item| {
Ok(self.sql_select_to_rex(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to plan the expression of replace first and then expand them in the ExpandWildcardRule.

@goldmedal goldmedal marked this pull request as ready for review August 12, 2024 17:50
@goldmedal
Copy link
Contributor Author

Hi @jayzhan211 @alamb
I ran into several issues, so it took me longer than expected to finish this PR, and the changes ended up being a lot more extensive than I initially thought. I’ve tried to keep the original SQL behavior as much as possible, but I think we should ideally move more of the logic to the Analyzer.
If you could take a look at this PR, I’d appreciate it. Thanks!

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me!


#[derive(Clone, PartialEq, Eq, Hash, Debug, Default)]
pub struct PlannedReplaceSelectItem {
pub items: Vec<Box<ReplaceSelectElement>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need Box here. Vec<ReplaceSelectElement>

.map(|c| c.flat_name())
.collect();
Ok::<_, DataFusionError>(
(0..wildcard_schema.fields().len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use wildcard_schema.field_names()

.zip(replace.expressions().iter())
.find(|(item, _)| item.column_name.value == *name)
{
*expr = Expr::Alias(Alias {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try new_expr.alias(name)

datafusion/sqllogictest/test_files/window.slt Show resolved Hide resolved
items: replace.items,
planned_expressions: replace_expr,
};
Ok(WildcardOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is worth to have a with_replace function, so we just need options.with_replace(planned_option)

@@ -61,6 +64,15 @@ impl ViewTable {
Ok(view)
}

fn apply_required_rule(logical_plan: LogicalPlan) -> Result<LogicalPlan> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -970,6 +976,60 @@ impl GroupingSet {
}
}

#[derive(Clone, PartialEq, Eq, Hash, Debug, Default)]
pub struct WildcardOptions {
pub opt_ilike: Option<IlikeSelectItem>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think opt_xxx is redundant

@github-actions github-actions bot added the proto Related to proto crate label Aug 13, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty epic -- thank you @goldmedal and @jayzhan211

@@ -970,6 +976,72 @@ impl GroupingSet {
}
}

#[derive(Clone, PartialEq, Eq, Hash, Debug, Default)]
pub struct WildcardOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a follow on PR, can we add some doc comments to this struct explaining why this structure is needed and how to interpret it?

It seems like the core rationale is that wildcards have different semantics depending on what part of the query they appear in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added descriptions for the purpose and source of each option.

Comment on lines 141 to 142
/// use datafusion_common::TableReference;
/// use datafusion_expr::{qualified_wildcard};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put a # in front of lines in an example they are not shown but still compiled

Suggested change
/// use datafusion_common::TableReference;
/// use datafusion_expr::{qualified_wildcard};
/// # use datafusion_common::TableReference;
/// # use datafusion_expr::{qualified_wildcard};

@alamb
Copy link
Contributor

alamb commented Aug 13, 2024

🚀

Thanks again @goldmedal and @jayzhan211

@alamb alamb merged commit 3438b35 into apache:main Aug 13, 2024
24 checks passed
@goldmedal
Copy link
Contributor Author

Thanks @alamb @jayzhan211 ❤️

@goldmedal goldmedal deleted the feature/11639-add-expand-rule branch August 14, 2024 01:46
@alamb
Copy link
Contributor

alamb commented Aug 14, 2024

Man I could spend all day reviewing DataFusion PRs these days. There is so much good stuff going on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow custom planning behavior for selecting wildcard expression
3 participants