doc: why nullable of list item is set to true #11626

jcsherin · 2024-07-23T18:22:25Z

Which issue does this PR close?

Rationale for this change

When working on issues related to #8708 there have been multiple PRs which dealt with the nullability of list item in accumulator state. This doc patch makes the reasoning of existing code explicit.

What changes are included in this PR?

Only doc comments are added. There are no code changes.

Aggregate functions which use data type of first argument:

ArrayAgg
NthValueAgg
Count

Aggregate functions which use data type of returned value:

BitwiseOperation
Sum

Are these changes tested?

Are there any user-facing changes?

jcsherin · 2024-07-23T18:31:29Z

Notes:

I felt inline comments made more sense here rather than adding this as part of docs for state_fields. This is primarily intended for the reader of code rather than the user of the library.
There is repetition of comment text in aggregate functions.

Please recommend changes to copy or critique this approach.

comphead

tbh I would be putting this to trait object instead of copying through methods.

alamb · 2024-07-23T21:56:53Z

tbh I would be putting this to trait object instead of copying through methods.

Do you mean add the documentation to https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html#method.state_fields ?

comphead · 2024-07-23T22:08:29Z

tbh I would be putting this to trait object instead of copying through methods.

Do you mean add the documentation to https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html#method.state_fields ?

yep, it might be a better place imho. Especially if it related to all function implementations

review again

jcsherin · 2024-07-24T10:40:35Z

tbh I would be putting this to trait object instead of copying through methods.

This does not belong in trait object because this affects only a few aggregate functions, not all of them.

For some aggregate functions the intermediate accumulator state often has:

a list of items and,
the type of the item is same as the first argument or the returned value

Field::new_list( 
    format_state_name(args.name, "distinct_array_agg"), 
    Field::new("item", args.input_type.clone(), true), // [1] should always be true
    true,  // [2] or false
)

At first glance it looked like nullable of the list item should be configurable. There were also multiple PRs in this direction before we realized it was all unnecessary.

To get rid of the comment duplication, I'll instead move them to a markdown doc within functions-aggregate and link to it from the comments.

alamb

Looks like an improvement to me -- thank you @jcsherin and @jayzhan211 and @comphead

I also marked this PR ready for review as it looks good to me

alamb · 2024-07-24T23:52:06Z

datafusion/functions-aggregate/src/bit_and_or_xor.rs

@@ -203,6 +203,7 @@ impl AggregateUDFImpl for BitwiseOperation {
                    args.name,
                    format!("{} distinct", self.name()).as_str(),
                ),
+                // See COMMENTS.md to understand why nullable is set to true


alamb · 2024-07-24T23:53:51Z

datafusion/functions-aggregate/COMMENTS.md

+
+## Computing Intermediate State
+
+By setting `nullable` to be always `true` like this we ensure that the


Is another rationale that the intermediate results need to be able to represent "saw no rows" (e.g that partition had no values)?

For nth_value accumulator, when now rows are present in the partition, then no values are added to the intermediate state.

I haven't checked the other aggregates though. So I don't know for certain if this is the case always. I'll verify and make a follow-on PR if any differences exist. I think we've only looked deeper into nth_value and array_agg (by @jayzhan211) at the moment.

Makes sense -- I vaguely remember that the null was needed in one of the aggregators to distinguish between

only empty lists had been seen []

No lists at all had been seen NULL

@alamb Thanks for the pointer. I'll keep this in mind while making pass through the aggregates next time.

I made a minor copy change to disambiguate that the "Computing Intermediate State" section is talking about the nullability of the list item rather than the nullability of the list container.

Sorry for the confusion. I was not clear earlier.

jcsherin · 2024-07-25T18:26:50Z

I pushed a minor copy edit and CI failed. Looking at the error logs it looks to me like it is not related to this change.

jcsherin · 2024-07-25T18:36:14Z

I pushed a minor copy edit and CI failed. Looking at the error logs it looks to me like it is not related to this change.

The clippy errors in CI are being tracked here - #11651.

…state-fields

alamb · 2024-07-25T20:16:14Z

I merged up from main to get the fix for the clippy errors

jcsherin · 2024-07-25T20:18:04Z

I merged up from main to get the fix for the clippy errors

@alamb Thank you.

In `array_agg` the list is nullable, so changed the example to `nth_value` where the list is not nullable to be correct.

alamb · 2024-07-25T21:49:31Z

Thanks @jcsherin

alamb · 2024-07-26T14:58:56Z

Thanks again - we can iterate on the docs in follow on PRs if there is more to do

jcsherin · 2024-07-26T15:05:05Z

Thanks for the review feedback - @alamb, @comphead and for prior discussions @jayzhan211.

doc: why nullable of list item is set to true

adf502d

jcsherin marked this pull request as ready for review July 23, 2024 18:32

comphead reviewed Jul 23, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

jcsherin marked this pull request as draft July 24, 2024 10:41

jcsherin added 6 commits July 24, 2024 21:05

Adds an external doc to avoid repeating text

e2c1bf1

rewrite

d46f01c

redirects to external doc

c339eb0

Adds ASF license

02dc91f

Minor: formatting fixes

f9c1b1e

Minor: copy edits

23b3b63

alamb approved these changes Jul 24, 2024

View reviewed changes

alamb added the documentation Improvements or additions to documentation label Jul 24, 2024

alamb marked this pull request as ready for review July 24, 2024 23:54

github-actions bot removed the documentation Improvements or additions to documentation label Jul 25, 2024

Retrigger CI

039ea6c

Merge remote-tracking branch 'apache/main' into doc-list-nullable-in-…

9ee664e

…state-fields

jcsherin added 2 commits July 26, 2024 02:21

Fixes: name of aggregation in example

f518117

In `array_agg` the list is nullable, so changed the example to `nth_value` where the list is not nullable to be correct.

Disambiguates list item nullability in copy

1601390

jcsherin requested review from comphead and jayzhan211 July 25, 2024 20:56

alamb approved these changes Jul 25, 2024

View reviewed changes

alamb merged commit d6e016e into apache:main Jul 26, 2024
24 checks passed

jcsherin deleted the doc-list-nullable-in-state-fields branch July 26, 2024 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: why nullable of list item is set to true #11626

doc: why nullable of list item is set to true #11626

jcsherin commented Jul 23, 2024 •

edited

Loading

jcsherin commented Jul 23, 2024

comphead left a comment

alamb commented Jul 23, 2024

comphead commented Jul 23, 2024

This comment was marked as outdated.

jcsherin commented Jul 24, 2024

alamb left a comment

alamb Jul 24, 2024

alamb Jul 24, 2024

jcsherin Jul 25, 2024

alamb Jul 25, 2024 •

edited

Loading

jcsherin Jul 25, 2024

jcsherin Jul 25, 2024

jcsherin commented Jul 25, 2024

jcsherin commented Jul 25, 2024

alamb commented Jul 25, 2024

jcsherin commented Jul 25, 2024

alamb commented Jul 25, 2024

alamb commented Jul 26, 2024

jcsherin commented Jul 26, 2024


		## Computing Intermediate State

		By setting `nullable` to be always `true` like this we ensure that the

doc: why nullable of list item is set to true #11626

doc: why nullable of list item is set to true #11626

Conversation

jcsherin commented Jul 23, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jcsherin commented Jul 23, 2024

comphead left a comment

Choose a reason for hiding this comment

alamb commented Jul 23, 2024

comphead commented Jul 23, 2024

This comment was marked as outdated.

jcsherin commented Jul 24, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Jul 24, 2024

Choose a reason for hiding this comment

alamb Jul 24, 2024

Choose a reason for hiding this comment

jcsherin Jul 25, 2024

Choose a reason for hiding this comment

alamb Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

jcsherin Jul 25, 2024

Choose a reason for hiding this comment

jcsherin Jul 25, 2024

Choose a reason for hiding this comment

jcsherin commented Jul 25, 2024

jcsherin commented Jul 25, 2024

alamb commented Jul 25, 2024

jcsherin commented Jul 25, 2024

alamb commented Jul 25, 2024

alamb commented Jul 26, 2024

jcsherin commented Jul 26, 2024

jcsherin commented Jul 23, 2024 •

edited

Loading

alamb Jul 25, 2024 •

edited

Loading