Fix the schema mismatch between logical and physical for aggregate function, add `AggregateUDFImpl::is_null` #11989

jayzhan211 · 2024-08-14T15:28:18Z

Which issue does this PR close?

Closes #.
Part of #11782 , it would be nice to cleanup schema before fighting with physical name

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 · 2024-08-14T15:28:38Z

datafusion/core/src/physical_planner.rs

+                let physical_input_schema_from_logical: Arc<Schema> =
+                    logical_input_schema.as_ref().clone().into();
+
+                debug_assert_eq!(physical_input_schema_from_logical, physical_input_schema, "Physical input schema should be the same as the one converted from logical input schema. Please file an issue or send the PR");


The main goal of the change is to ensure they are the same. And, we pass physical_input_schema through the function that require input's schema.

Nice!

Did you consider making this function return an internal_error rather than debug_assert ?

If we are concerned about breaking existing tests, we could add a config setting like datafusion.optimizer.skip_failed_rules to let users bypass the check

The objective here is to ensure that the logical schema from ExprSchemable and the physical schema from ExecutionPlan.schema() are equivalent. if they are not, it indicates a potential schema mismatch issue. This is also why you can see the code change in this PR are mostly fixing schema related things and they are all required thus I don't think we should let user bypass the check 🤔

If we encounter inconsistent schemas, it raises an important question: Which schema should we use?

Did you consider making this function return an internal_error rather than debug_assert ?

It looks good to me

jayzhan211 · 2024-08-14T15:29:59Z

datafusion/core/src/physical_planner.rs

@@ -1599,11 +1603,10 @@ pub fn create_aggregate_expr_with_name_and_maybe_filter(
                let ordering_reqs: Vec<PhysicalSortExpr> =
                    physical_sort_exprs.clone().unwrap_or(vec![]);

-                let schema: Schema = logical_input_schema.clone().into();


workaround cleanup

jayzhan211 · 2024-08-14T15:30:29Z

datafusion/expr/src/expr_schema.rs

+                    WindowFunctionDefinition::AggregateUDF(func) => {
+                        // TODO: UDF should be able to customize nullability
+                        if func.name() == "count" {
+                            // TODO: there is issue unsolved for count with window, should return false


Not so familiar with window function yet, leave it as TODO

Perhaps we can file a ticket to track this -- ideally it would eventually be part of the window function definition itself rather than relying on names

jayzhan211 · 2024-08-14T15:32:29Z

datafusion/physical-plan/src/windows/utils.rs

+use datafusion_physical_expr::window::WindowExpr;
+use std::sync::Arc;
+
+pub(crate) fn create_schema(


move the common function to utils. The logic is the same

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

alamb

This seems like a good change to me, but I don't fully understand how it is all connected . Thank you for taking this on @jayzhan211

I am quite concerned about the use of unsafe but otherwise I think all this PR needs is some TODOs with ticket references and it would be good to go from my perspective.

alamb · 2024-08-16T15:24:50Z

datafusion/core/src/physical_planner.rs

+                let physical_input_schema_from_logical: Arc<Schema> =
+                    logical_input_schema.as_ref().clone().into();
+
+                debug_assert_eq!(physical_input_schema_from_logical, physical_input_schema, "Physical input schema should be the same as the one converted from logical input schema. Please file an issue or send the PR");


Nice!

Did you consider making this function return an internal_error rather than debug_assert ?

If we are concerned about breaking existing tests, we could add a config setting like datafusion.optimizer.skip_failed_rules to let users bypass the check

alamb · 2024-08-16T15:25:59Z

datafusion/expr/src/expr_schema.rs

+                    WindowFunctionDefinition::AggregateUDF(func) => {
+                        // TODO: UDF should be able to customize nullability
+                        if func.name() == "count" {
+                            // TODO: there is issue unsolved for count with window, should return false


Perhaps we can file a ticket to track this -- ideally it would eventually be part of the window function definition itself rather than relying on names

alamb · 2024-08-16T15:26:16Z

datafusion/expr/src/expr_schema.rs

@@ -328,10 +328,45 @@ impl ExprSchemable for Expr {
                    Ok(true)
                }
            }
+            Expr::WindowFunction(WindowFunction { fun, .. }) => {


Is this change required for this PR or is it a "drive by" improvement?

alamb · 2024-08-16T15:28:43Z

datafusion/expr/src/expr_schema.rs

+                }
+            }
+            Expr::ScalarFunction(ScalarFunction { func, args }) => {
+                // If all the element in coalesce is non-null, the result is non-null


We should probably add an API to ScalarUDFImpl to signal its null/non-nullness (as a follow on PR) instead of hard coding this function name

func.is_nullable(args)

alamb · 2024-08-16T15:30:53Z

datafusion/expr/src/udaf.rs

@@ -196,6 +196,10 @@ impl AggregateUDF {
        self.inner.state_fields(args)
    }

+    pub fn fields(&self, args: StateFieldsArgs) -> Result<Field> {


Could we document this function and what it is for (also in AggregateUdfImpl)?

Also, the name is strange to me -- it is fields but it returns a single Field and the corresponding function on AggregateUDFImpl is called field (no s) 🤔

alamb · 2024-08-16T15:31:45Z

datafusion/functions-aggregate-common/src/aggregate.rs

@@ -171,6 +171,9 @@ pub trait AggregateExpr: Send + Sync + Debug + PartialEq<dyn Any> {
    fn get_minmax_desc(&self) -> Option<(Field, bool)> {
        None
    }
+
+    /// Get function's name, for example `count(x)` returns `count`
+    fn func_name(&self) -> &str;


is there a reason this isn't name() ? func_name is fine, it just seems inconsistent with the rest of the code

This is to identify function (i.e. count), there is name() already, but it includes arguments (i.e. count(x)), which is not I want.
Alternative way is introduce nullable() for AggregateUDF, so we don't need name checking. Maybe I should done it before this PR.

alamb · 2024-08-16T15:33:51Z

datafusion/optimizer/src/analyzer/type_coercion.rs

-            *union_nullable = *union_nullable || plan_field.is_nullable();
+
+        // Safety: Length is checked
+        unsafe {


I think this unsafe block is unecessary -- this isn't a performance critical piece of code. I think izip or just manuallly zipping three times would be better

alamb · 2024-08-16T15:34:32Z

datafusion/physical-expr/src/window/aggregate.rs

@@ -80,6 +80,14 @@ impl WindowExpr for PlainAggregateWindowExpr {
    }

    fn field(&self) -> Result<Field> {
+        // TODO: Fix window function to always return non-null for count


I don't understand this comment -- can we please file a ticket to track it (and add the ticket reference to the comments)?

alamb · 2024-08-16T15:34:58Z

datafusion/physical-expr/src/window/built_in.rs

@@ -97,6 +97,10 @@ impl BuiltInWindowExpr {
 }

 impl WindowExpr for BuiltInWindowExpr {
+    fn func_name(&self) -> Result<&str> {
+        not_impl_err!("function name not determined")


why wouldn't we implement func_name for a built in window function 🤔

The reason is because I don't need it -- for name checking in nullable

jayzhan211 · 2024-08-17T00:59:53Z

I think func_name is indeed another workaround, would like to getting rid of it before this PR merged

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 · 2024-08-17T09:11:10Z

datafusion/physical-expr-functions-aggregate/src/aggregate.rs

@@ -435,6 +459,10 @@ impl AggregateExpr for AggregateFunctionExpr {
            .is_descending()
            .and_then(|flag| self.field().ok().map(|f| (f, flag)))
    }
+
+    fn default_value(&self, data_type: &DataType) -> Result<ScalarValue> {


I need to add 4 places for a new function, might be room to improve 🤔

Maybe we don't need AggregateExpr since there is only one implement at all. I think trait is useful if there are at least 2 implementation shares similar function. Similar idea from #11810

berkaysynnada · 2024-08-20T08:12:33Z

datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs

+use std::collections::{HashMap, VecDeque};
+use std::pin::Pin;
+use std::sync::Arc;
+use std::task::{Context, Poll};



Can you move them back to top?

berkaysynnada · 2024-08-20T08:13:23Z

datafusion/physical-expr/src/window/built_in.rs

@@ -35,6 +30,9 @@ use datafusion_common::utils::evaluate_partition_ranges;
 use datafusion_common::{Result, ScalarValue};
 use datafusion_expr::window_state::{WindowAggState, WindowFrameContext};
 use datafusion_expr::WindowFrame;
+use std::any::Any;
+use std::ops::Range;
+use std::sync::Arc;



Can you move them back to the top?

alamb · 2024-08-20T17:07:50Z

datafusion/expr/src/udaf.rs

+    /// while `count` returns 0 if input is Null
+    fn default_value(&self, data_type: &DataType) -> Result<ScalarValue> {
+        ScalarValue::try_from(data_type)
+    }
 }


Maybe we can improve the docuemention?

…ma-fix

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 · 2024-08-21T00:47:53Z

Thanks @alamb @berkaysynnada @ozankabak

phillipleblanc · 2024-09-19T10:23:58Z

@jayzhan211 I'm in the process of upgrading spiceai to use DataFusion 42 and I'm running into the schema mismatch error from this PR:

Internal("Physical input schema should be the same as the one converted from logical input schema.")

I have a custom TableProvider and I get the error when running a SELECT COUNT(1) FROM my_table. This is what the explain plan looked like on DF 41:

let expected_plan = [
        "+---------------+--------------------------------------------------------------------------------+",
        "| plan_type     | plan                                                                           |",
        "+---------------+--------------------------------------------------------------------------------+",
        "| logical_plan  | Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]                              |",
        "|               |   BytesProcessedNode                                                           |",
        "|               |     TableScan: non_federated_abc projection=[]                                 |",
        "| physical_plan | AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]                      |",
        "|               |   CoalescePartitionsExec                                                       |",
        "|               |     AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]                |",
        "|               |       BytesProcessedExec                                                       |",
        "|               |         SchemaCastScanExec                                                     |",
        "|               |           RepartitionExec: partitioning=RoundRobinBatch(3), input_partitions=1 |",
        "|               |             SqlExec sql=SELECT \"id\", \"created_at\" FROM non_federated_abc       |",
        "|               |                                                                                |",
        "+---------------+--------------------------------------------------------------------------------+",
    ];

(BytesProcessedNode and BytesProcessedExec are custom operators that we inject for tracking the number of bytes processed, I don't think its relevant to this bug - but I initially had a similar schema check for it that I ended up removing for the reason below)

My assumption of what is going on here is that logically no columns are required for the logical plan to come up with the count of the number of rows, but the TableProvider has to return all of the columns because it needs the rows to perform the count aggregation. But it ends up throwing away the columns because they get erased in the aggregation. Thus the check that the physical schema and the logical schema are equal is not strictly needed for this plan. Does that sound right?

alamb · 2024-09-19T10:50:10Z

@itsjunetime is also having some issues related to this ticket during our upgrade of DataFusion, I am not sure if they are releated

jayzhan211 · 2024-09-19T10:53:59Z

What is the logical schema and physical schema you have (the error)?

I think they should be consistent even for count (*) statement. They (logical & physical)should either have all the columns or no columns

phillipleblanc · 2024-09-19T12:12:13Z

I added dbg! statements to both, and I see one has the fields and the one from the logical plan is empty:

[/Users/phillip/code/apache/datafusion/datafusion/core/src/physical_planner.rs:676:21] &physical_input_schema = Schema {
    fields: [
        Field {
            name: "id",
            data_type: Utf8,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
        Field {
            name: "created_at",
            data_type: Timestamp(
                Nanosecond,
                Some(
                    "+00:00",
                ),
            ),
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}
[/Users/phillip/code/apache/datafusion/datafusion/core/src/physical_planner.rs:677:21] &physical_input_schema_from_logical = Schema {
    fields: [],
    metadata: {},
}
thread 'acceleration::query_push_down::acceleration_with_and_without_federation' panicked at crates/runtime/tests/acceleration/query_push_down.rs:218:10:
collect working: Internal("Physical input schema should be the same as the one converted from logical input schema.")

jayzhan211 · 2024-09-19T13:45:21Z

I think the ideal way is to have something like a wildcard field for both logical and physical but this requires modification of Schema 🤔

logically no columns are required for the logical plan to come up with the count of the number of rows,

Could we keep all the fields for the logical plan to make them consistent?

phillipleblanc · 2024-09-19T13:51:55Z

Could we keep all the fields for the logical plan to make them consistent?

That seems fine to me.

itsjunetime · 2024-09-19T17:42:05Z

What is the logical schema and physical schema you have (the error)?

I'm getting issues where both schemas are equivalent except for the metadata on the fields. I think (correct me if I'm wrong) that field metadata doesn't actually need to be equivalent for the invariant that this error is trying to catch to be upheld. I think we could comfortably switch from directly comparing the schemas with PartialEq to using equivalent_names_and_types or something like that. I can file a ticket and reproducer/fix for this.

I'm also seeing issues that don't propagate with this exact error (but rather complain that count(*) is a non-nullable column that contains null values), so once I've figured out more specifics for this issue I can file another ticket for it.

jayzhan211 · 2024-09-20T00:32:40Z

I think (correct me if I'm wrong) that field metadata doesn't actually need to be equivalent for the invariant that this error is trying to catch to be upheld

If the metadata is mismatched, it indicates we lost the metadata somewhere when passing through the schema info, so I think it makes sense to check the metadata too. Maybe we should figure out the reason why metadata is mismatched first

…nction, add `AggregateUDFImpl::is_null` (apache#11989) * schema assertion and fix the mismatch from logical and physical Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add more msg Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm test1 Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * nullable for scalar func Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * nullable Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm field Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm unsafe block and use internal error Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm func_name Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm nullable option Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add more msg Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm row number Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Update datafusion/expr/src/udaf.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/expr/src/udaf.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix failed test from apache#12050 Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

findepi · 2024-10-23T10:32:04Z

thank you!

jayzhan211 added 2 commits August 14, 2024 23:25

schema assertion and fix the mismatch from logical and physical

aed01f0

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

add more msg

cbfefc6

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate functions labels Aug 14, 2024

jayzhan211 commented Aug 14, 2024

View reviewed changes

cleanup

b3fc2c8

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 14, 2024

rm test1

20d0a5f

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Aug 14, 2024

jayzhan211 marked this pull request as ready for review August 15, 2024 00:55

jayzhan211 requested a review from alamb August 16, 2024 12:08

alamb reviewed Aug 16, 2024

View reviewed changes

jayzhan211 marked this pull request as draft August 17, 2024 01:00

jayzhan211 added 6 commits August 17, 2024 12:19

nullable for scalar func

1132686

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

nullable

611092e

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Merge branch 'custom-nullable' into schema-fix

e732adc

rm field

ab38a5a

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

rm unsafe block and use internal error

1d299eb

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

rm func_name

19a1ac7

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

github-actions bot added sqllogictest SQL Logic Tests (.slt) proto Related to proto crate labels Aug 17, 2024

jayzhan211 commented Aug 17, 2024

View reviewed changes

berkaysynnada reviewed Aug 20, 2024

View reviewed changes

alamb approved these changes Aug 20, 2024

View reviewed changes

jayzhan211 added 2 commits August 21, 2024 07:55

Merge branch 'main' of https://github.com/apache/datafusion into sche…

356faa8

…ma-fix

add doc

043c332

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 mentioned this pull request Aug 21, 2024

A simple count() query caused Internal Error in PhysicalOptimizer (SQLancer) #12077

Closed

jayzhan211 merged commit 6786f15 into apache:main Aug 21, 2024
24 checks passed

jayzhan211 deleted the schema-fix branch August 21, 2024 00:47

This was referenced Sep 20, 2024

chore: upgrade to datafusion 43 delta-io/delta-rs#2886

Open

Potential regression in Schema / nullability calculations after upgrade to 42.0.0 #12560

Closed

wiedld mentioned this pull request Sep 28, 2024

Possible reproducer of schema metadata bug. #12658

Closed

alamb mentioned this pull request Sep 30, 2024

Field Metadata Lost on COUNT DISTINCT queries resulting in Internal Error: Physical input schema should be the same as the one converted from logical input schema #12687

Closed

This was referenced Oct 1, 2024

Provide field and schema metadata missing on distinct aggregations. #12691

Merged

Provide field and schema metadata missing on cross joins, and union with null fields. #12729

Merged

alamb mentioned this pull request Oct 3, 2024

[EPIC] Schema metadata handling / bugs #12733

Open

8 tasks

alamb mentioned this pull request Oct 22, 2024

Option to disable Physical input schema should be the same as the one converted from logical schema error #13065

Closed

findepi mentioned this pull request Oct 23, 2024

Fix tests on 41 branch and temporarily disable those unfixable sdf-labs/arrow-datafusion#71

Merged

alamb mentioned this pull request Oct 29, 2024

Add config option skip_physical_aggregate_schema_check #13176

Merged

eejbyfeldt mentioned this pull request Oct 30, 2024

nullable Expr being constant fold to value can cause schema change and internal error #13190

Open

Fix the schema mismatch between logical and physical for aggregate function, add AggregateUDFImpl::is_null #11989

Fix the schema mismatch between logical and physical for aggregate function, add AggregateUDFImpl::is_null #11989

Conversation

jayzhan211 commented Aug 14, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Aug 17, 2024

jayzhan211 Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Aug 21, 2024

phillipleblanc commented Sep 19, 2024

alamb commented Sep 19, 2024

jayzhan211 commented Sep 19, 2024

phillipleblanc commented Sep 19, 2024

jayzhan211 commented Sep 19, 2024 • edited Loading

phillipleblanc commented Sep 19, 2024

itsjunetime commented Sep 19, 2024

jayzhan211 commented Sep 20, 2024

findepi commented Oct 23, 2024

Fix the schema mismatch between logical and physical for aggregate function, add `AggregateUDFImpl::is_null` #11989

Fix the schema mismatch between logical and physical for aggregate function, add `AggregateUDFImpl::is_null` #11989

jayzhan211 commented Aug 14, 2024 •

edited

Loading

jayzhan211 Aug 14, 2024 •

edited

Loading

jayzhan211 Aug 17, 2024 •

edited

Loading

jayzhan211 Aug 17, 2024 •

edited

Loading

jayzhan211 Aug 17, 2024 •

edited

Loading

jayzhan211 commented Sep 19, 2024 •

edited

Loading