Fix count(null) and count(distinct null) #8511

joroKr21 · 2023-12-12T09:42:06Z

Use logical_nulls when the array data type is Null.

Which issue does this PR close?

Closes #8509.

Rationale for this change

The semantics of NullArray in Arrow are confusing: apache/arrow-rs#4838
So we have to handle the Null data type in a special way.

What changes are included in this PR?

Patches in count and count distinct accumulators to handle the Null data type or use logical_nulls when appropriate.

Are these changes tested?

Yes, added SQL tests.

Are there any user-facing changes?

Yes, count and count distinct now behave consistently wrt null regardless of the data type.

Dandandan · 2023-12-12T09:45:25Z

datafusion/physical-expr/src/aggregate/count.rs

+        } else {
+            accumulate_indices(
+                group_indices,
+                values.nulls(),


I think we can use logical_nulls for all cases.

logical_nulls returns an owned value, so in most cases it will clone, that's why I added a branch

Hm that's a bit unfortunate 🤔
The clone is relatively cheap though, as the buffer holding the bitmap is wrapped in Arc.

Oh ok, then we can use it I guess. I assumed that's why they don't want to change it in Arrow.

Dandandan · 2023-12-12T09:48:38Z

datafusion/physical-expr/src/aggregate/count.rs

+        if values.data_type() == &DataType::Null {
+            values.len()
+        } else {
+            values.null_count()


Could be all based on logical_nulls as well

alamb

Thank you very much for this contribution @joroKr21 -- the basic idea is great. I am just worried about the use of the logical_nulls method as it copies things under the covers that may cause performance slowdowns.

I offered an alternate suggestion -- let me know what you think

alamb · 2023-12-12T22:30:58Z

datafusion/physical-expr/src/aggregate/count.rs

@@ -198,16 +198,18 @@ fn null_count_for_multiple_cols(values: &[ArrayRef]) -> usize {
    if values.len() > 1 {
        let result_bool_buf: Option<BooleanBuffer> = values
            .iter()
-            .map(|a| a.nulls())
+            .map(|a| a.logical_nulls())


I am slightly worried about the need to allocate a new null buffer each time, even for arrays when we could just use the exising one

This is particularly concerning given this is on the critical path in aggregates

I reviewed the logical_nulls method --
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls and I see the issue is that it returns an owned Option

What would you think about implemeting a method in DataFusion that avoids the copy if it is not necessary, like

fn logical_nulls(arr: &ArrayRef) -> Cow<'_, Option<BooleanBuffer>> { }

That only creates the nulll buffer for NullArrays?

Then we can propose upstreaming that back to arrow-rs to avoid the potential performance issue

I know the Cow thing is not always the easiest to make happen -- if you need help I can try and find time to help code it up.

I'm not sure how I would implement this outside of the Array trait while ensuring that all cases are covered. Originally I had some branching logic based on the datatype but removed it after the discussion here: #8511 (comment)

Yeah @alamb in the end NullBuffer has Arc<Bytes> so it mostly clones this + a few usizes etc. While not ideal I don't think it will be very expensive?
https://arrow.apache.org/rust/arrow_buffer/buffer/immutable/struct.Buffer.html

But I like the suggestion of returning a reference or Cow in arrow-rs.

I am not opposed to this PR, but I would prefer to have the Cow thing. Let me see if I can whip it up quickly

Here is my proposal for improvement: coralogix#221

alamb

Thank you @joroKr21 and @Dandandan

It would be great to do some benchmark runs to show this doesn't impact performance, but I think I am over worrying this extra copy of the NullBuffer -- it is 48 bytes and one increment

alamb · 2023-12-14T16:25:13Z

FWIW I want to be clear I think we can revert cecc493 unless it shows a performance improvement. I am sorry for all the noise

joroKr21 · 2023-12-14T16:27:56Z

Oh sorry, I saw your comment too late. I will force push in that case.

Use `logical_nulls` when the array data type is `Null`.

joroKr21 · 2023-12-14T16:32:07Z

I don't know how long it takes to run the benchmarks. I could probably do a run during the weekend.

alamb · 2023-12-14T17:14:50Z

I don't know how long it takes to run the benchmarks. I could probably do a run during the weekend.

I'll run some now as I have it all setup. I'll post them here when ready

alamb · 2023-12-14T20:19:04Z

I ran benchmarks and my conclusion is that this branch doesn't change the performance and any changes are within the level of noise

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ bugfix_count-null ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     3.64ms │            3.74ms │     no change │
│ QQuery 1     │   135.02ms │          138.12ms │     no change │
│ QQuery 2     │   299.58ms │          312.29ms │     no change │
│ QQuery 3     │   280.64ms │          293.30ms │     no change │
│ QQuery 4     │  2783.49ms │         2773.07ms │     no change │
│ QQuery 5     │  4050.48ms │         3897.47ms │     no change │
│ QQuery 6     │   125.78ms │          123.34ms │     no change │
│ QQuery 7     │   140.97ms │          141.36ms │     no change │
│ QQuery 8     │  4433.68ms │         4286.59ms │     no change │
│ QQuery 9     │  7768.12ms │         7501.32ms │     no change │
│ QQuery 10    │  1120.87ms │         1073.25ms │     no change │
│ QQuery 11    │  1206.52ms │         1178.95ms │     no change │
│ QQuery 12    │  3528.50ms │         3434.86ms │     no change │
│ QQuery 13    │  6754.81ms │         6778.36ms │     no change │
│ QQuery 14    │  3859.19ms │         3913.48ms │     no change │
│ QQuery 15    │  3078.75ms │         3047.18ms │     no change │
│ QQuery 16    │  8303.34ms │         8266.43ms │     no change │
│ QQuery 17    │  7910.97ms │         7943.00ms │     no change │
│ QQuery 18    │ 16263.82ms │        15770.30ms │     no change │
│ QQuery 19    │   235.28ms │          228.10ms │     no change │
│ QQuery 20    │  3989.39ms │         3658.45ms │ +1.09x faster │
│ QQuery 21    │  4972.73ms │         4778.85ms │     no change │
│ QQuery 22    │ 13516.67ms │        13320.12ms │     no change │
│ QQuery 23    │ 35451.65ms │        33649.33ms │ +1.05x faster │
│ QQuery 24    │  1885.83ms │         1919.50ms │     no change │
│ QQuery 25    │  1630.22ms │         1604.27ms │     no change │
│ QQuery 26    │  2034.69ms │         2042.98ms │     no change │
│ QQuery 27    │  5236.29ms │         5144.58ms │     no change │
│ QQuery 28    │ 42455.40ms │        41939.68ms │     no change │
│ QQuery 29    │  1493.39ms │         1500.64ms │     no change │
│ QQuery 30    │  3408.16ms │         3666.76ms │  1.08x slower │
│ QQuery 31    │  4357.88ms │         4430.19ms │     no change │
│ QQuery 32    │ 23382.55ms │        22710.19ms │     no change │
│ QQuery 33    │ 18094.43ms │        17377.25ms │     no change │
│ QQuery 34    │ 19085.58ms │        18796.15ms │     no change │
│ QQuery 35    │  5181.19ms │         5264.16ms │     no change │
│ QQuery 36    │   697.38ms │          730.89ms │     no change │
│ QQuery 37    │   368.22ms │          392.02ms │  1.06x slower │
│ QQuery 38    │   324.24ms │          325.51ms │     no change │
│ QQuery 39    │  1643.98ms │         1595.20ms │     no change │
│ QQuery 40    │   186.05ms │          176.14ms │ +1.06x faster │
│ QQuery 41    │   165.92ms │          146.29ms │ +1.13x faster │
│ QQuery 42    │   174.23ms │          171.96ms │     no change │
└──────────────┴────────────┴───────────────────┴───────────────┘

--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ bugfix_count-null ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  345.57ms │          337.26ms │     no change │
│ QQuery 2     │   85.16ms │           80.29ms │ +1.06x faster │
│ QQuery 3     │  193.96ms │          190.57ms │     no change │
│ QQuery 4     │  197.21ms │          201.03ms │     no change │
│ QQuery 5     │  333.08ms │          339.23ms │     no change │
│ QQuery 6     │   37.63ms │           35.96ms │     no change │
│ QQuery 7     │  890.08ms │          881.96ms │     no change │
│ QQuery 8     │  172.86ms │          167.49ms │     no change │
│ QQuery 9     │  192.89ms │          200.49ms │     no change │
│ QQuery 10    │  393.93ms │          403.95ms │     no change │
│ QQuery 11    │   66.91ms │           65.34ms │     no change │
│ QQuery 12    │  231.26ms │          239.59ms │     no change │
│ QQuery 13    │  173.97ms │          172.41ms │     no change │
│ QQuery 14    │   71.21ms │           71.20ms │     no change │
│ QQuery 15    │  224.79ms │          225.91ms │     no change │
│ QQuery 16    │   79.26ms │           76.84ms │     no change │
│ QQuery 17    │  228.44ms │          238.25ms │     no change │
│ QQuery 18    │  762.62ms │          806.16ms │  1.06x slower │
│ QQuery 19    │  114.11ms │          122.09ms │  1.07x slower │
│ QQuery 20    │  267.40ms │          271.21ms │     no change │
│ QQuery 21    │ 1017.97ms │         1032.86ms │     no change │
│ QQuery 22    │   49.09ms │           56.83ms │  1.16x slower │
└──────────────┴───────────┴───────────────────┴───────────────┘
alamb@aal-dev:~/datafusion-benchmarking$

alamb · 2023-12-14T20:19:27Z

Thanks @joroKr21

github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Dec 12, 2023

Dandandan reviewed Dec 12, 2023

View reviewed changes

joroKr21 force-pushed the bugfix/count-null branch from 170f528 to 9d3a3e9 Compare December 12, 2023 09:45

Dandandan reviewed Dec 12, 2023

View reviewed changes

joroKr21 force-pushed the bugfix/count-null branch 2 times, most recently from e7fadf6 to 23097f5 Compare December 12, 2023 10:12

alamb mentioned this pull request Dec 12, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 11, 2023 #8490

Closed

8 tasks

alamb reviewed Dec 12, 2023

View reviewed changes

This was referenced Dec 14, 2023

Avoid copy in count / logical nulls coralogix/arrow-datafusion#221

Merged

Avoid forced copy in Array::logical_nulls apache/arrow-rs#5208

Open

alamb approved these changes Dec 14, 2023

View reviewed changes

Fix count(null) and count(distinct null)

9fbe554

Use `logical_nulls` when the array data type is `Null`.

joroKr21 force-pushed the bugfix/count-null branch from cecc493 to 9fbe554 Compare December 14, 2023 16:30

alamb merged commit 06d3bcc into apache:main Dec 14, 2023
22 checks passed

joroKr21 deleted the bugfix/count-null branch December 14, 2023 21:35

joroKr21 restored the bugfix/count-null branch December 14, 2023 21:35

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

joroKr21 deleted the bugfix/count-null branch February 1, 2024 06:08

findepi mentioned this pull request Oct 21, 2024

Fix count on all null VALUES clause #13029

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix count(null) and count(distinct null) #8511

Fix count(null) and count(distinct null) #8511

joroKr21 commented Dec 12, 2023

Dandandan Dec 12, 2023

joroKr21 Dec 12, 2023 •

edited

Loading

Dandandan Dec 12, 2023 •

edited

Loading

joroKr21 Dec 12, 2023

Dandandan Dec 12, 2023

alamb left a comment

alamb Dec 12, 2023

joroKr21 Dec 13, 2023

Dandandan Dec 13, 2023

alamb Dec 14, 2023

alamb Dec 14, 2023

alamb left a comment

alamb commented Dec 14, 2023

joroKr21 commented Dec 14, 2023

joroKr21 commented Dec 14, 2023

alamb commented Dec 14, 2023

alamb commented Dec 14, 2023

alamb commented Dec 14, 2023

Fix count(null) and count(distinct null) #8511

Fix count(null) and count(distinct null) #8511

Conversation

joroKr21 commented Dec 12, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

joroKr21 Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Dandandan Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Dec 14, 2023

joroKr21 commented Dec 14, 2023

joroKr21 commented Dec 14, 2023

alamb commented Dec 14, 2023

alamb commented Dec 14, 2023

alamb commented Dec 14, 2023

joroKr21 Dec 12, 2023 •

edited

Loading

Dandandan Dec 12, 2023 •

edited

Loading