Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add input_nullable for UDAF args StateField and Accumulator #11063

Closed
wants to merge 1 commit into from

Conversation

eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Jun 22, 2024

Which issue does this PR close?

Part of #8708

Rationale for this change

This is need when we are moving arrag_agg (#11045) to udaf where one of the
states nullability will depend on the nullability of the input.

What changes are included in this PR?

This follows how it done for input_type and only provide a single value.
But might need to be changed into a Vec in the future.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

New fields in the args struct in UDAF APIs.

This follows how it done for input_type and only provide a single value.
But might need to be changed into a Vec in the future.

This is need when we are moving `arrag_agg` to udaf where one of the
states nullability will depend on the nullability of the input.
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jun 22, 2024
@@ -83,6 +83,9 @@ pub struct AccumulatorArgs<'a> {
/// The input type of the aggregate function.
pub input_type: &'a DataType,

/// If the input type is nullable.
pub input_nullable: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need nullable for state_field 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not used in the PR. I just thought it made sense to make the API more similar but will revert.

But I also noticed this is not enough to resolved the array_agg regression. There are two more limitiations in the current UDAF API. Firstly here
https://github.com/eejbyfeldt/datafusion/blob/18042fd69138e19613844580408a71a200ea6caa/datafusion/physical-expr-common/src/aggregate/mod.rs#L287-L289
the nullability of the returned field is hardcoded to true and it not controllable AggregateUDFImpl. What is the desired way to fix this?

Should be api be changed to instead implement a method fn field?

Or should we add a method return_nullable method with a default false implementation?

I also noticed that the current implementation for array_agg does not propagate the nullability of the input to the field in the returned array. This is probably because the return_type method does not have access to nullability. But probably something we want to be able to resolve in the long run.

Copy link
Contributor

@jayzhan211 jayzhan211 Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nullability of the returned field is hardcoded to true

I think we can use nullable for field().

    fn field(&self) -> Result<Field> {
        Ok(Field::new(&self.name, self.data_type.clone(), self.nullable))
    }

You can get input_nullable in create_aggregate_expr

    let expr = input_phy_exprs[0].clone();
    let input_nullable = expr.nullable(schema)?;
    Ok(Arc::new(AggregateFunctionExpr {
        fun: fun.clone(),
        args: input_phy_exprs.to_vec(),
        logical_args: input_exprs.to_vec(),
        data_type: fun.return_type(&input_exprs_types)?,
        name: name.into(),
        schema: schema.clone(),
        sort_exprs: sort_exprs.to_vec(),
        ordering_req: ordering_req.to_vec(),
        ignore_nulls,
        ordering_fields,
        is_distinct,
        input_type: input_exprs_types[0].clone(),
        input_nullable,
    }))

I also noticed that the current implementation for array_agg does not propagate the nullability of the input to the field in the returned array. This is probably because the return_type method does not have access to nullability. But probably something we want to be able to resolve in the long run.

I think nullable is both set in state_field and field, so the returned array should match the schema of them. 🤔

Copy link
Contributor Author

@eejbyfeldt eejbyfeldt Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use nullable for field().

Looking closer at the code I see that this will maintain the behavior of the old code. But it seems wrong to me that we in general assume that the aggregate maintains the nullability of the input type. If we consider the aggregate array_agg. Then there are two "nullable" fields in the return value the "top level" value and the "field inside" the returned array. I think our array_agg (or at least a possible array_agg) will return an empty array when there are no values. This means that the nullability of the "top level" field should always be false regardless of input nullability and the nullabillity that depends on the input is the "field inside" the array. Note that I think the existing code also does not implement this correctly.

I tried out the suggested fix and that will break existing code. Probably because it wrong for some existing aggregtes like sum that might return null even if the input is not nullable. So that is further indication that is not the correct way to go.

This comment was marked as outdated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nullability is introduced in #8055
There might be other way to fix #8055 🤔

Copy link
Contributor

@jayzhan211 jayzhan211 Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test the code in #8032, and there is no error after I change the "top level null" back to false 🤔

Copy link
Contributor

@jayzhan211 jayzhan211 Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the field for array agg should be, the nullable is the nullability of element, not the nullability of the List

 fn field(&self) -> Result<Field> {
        Ok(Field::new_list(
            &self.name,
            // This should be the same as return type of AggregateFunction::ArrayAgg
            Field::new("item", self.input_data_type.clone(), self.nullable),
            false,
        ))
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was what I was trying to explain. I created this PR that fixes that #11093 it required some other changes to make that change possible.

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Aug 24, 2024
@github-actions github-actions bot closed this Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants