Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix count(null) and count(distinct null) #8511

Merged
merged 1 commit into from
Dec 14, 2023

Conversation

joroKr21
Copy link
Contributor

Use logical_nulls when the array data type is Null.

Which issue does this PR close?

Closes #8509.

Rationale for this change

The semantics of NullArray in Arrow are confusing: apache/arrow-rs#4838
So we have to handle the Null data type in a special way.

What changes are included in this PR?

Patches in count and count distinct accumulators to handle the Null data type or use logical_nulls when appropriate.

Are these changes tested?

Yes, added SQL tests.

Are there any user-facing changes?

Yes, count and count distinct now behave consistently wrt null regardless of the data type.

@github-actions github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Dec 12, 2023
} else {
accumulate_indices(
group_indices,
values.nulls(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use logical_nulls for all cases.

Copy link
Contributor Author

@joroKr21 joroKr21 Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logical_nulls returns an owned value, so in most cases it will clone, that's why I added a branch

Copy link
Contributor

@Dandandan Dandandan Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm that's a bit unfortunate 🤔
The clone is relatively cheap though, as the buffer holding the bitmap is wrapped in Arc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, then we can use it I guess. I assumed that's why they don't want to change it in Arrow.

if values.data_type() == &DataType::Null {
values.len()
} else {
values.null_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be all based on logical_nulls as well

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this contribution @joroKr21 -- the basic idea is great. I am just worried about the use of the logical_nulls method as it copies things under the covers that may cause performance slowdowns.

I offered an alternate suggestion -- let me know what you think

@@ -198,16 +198,18 @@ fn null_count_for_multiple_cols(values: &[ArrayRef]) -> usize {
if values.len() > 1 {
let result_bool_buf: Option<BooleanBuffer> = values
.iter()
.map(|a| a.nulls())
.map(|a| a.logical_nulls())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly worried about the need to allocate a new null buffer each time, even for arrays when we could just use the exising one

This is particularly concerning given this is on the critical path in aggregates

I reviewed the logical_nulls method --
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls and I see the issue is that it returns an owned Option

What would you think about implemeting a method in DataFusion that avoids the copy if it is not necessary, like

fn logical_nulls(arr: &ArrayRef) -> Cow<'_, Option<BooleanBuffer>> {
  
}

That only creates the nulll buffer for NullArrays?

Then we can propose upstreaming that back to arrow-rs to avoid the potential performance issue

I know the Cow thing is not always the easiest to make happen -- if you need help I can try and find time to help code it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I would implement this outside of the Array trait while ensuring that all cases are covered. Originally I had some branching logic based on the datatype but removed it after the discussion here: #8511 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @alamb in the end NullBuffer has Arc<Bytes> so it mostly clones this + a few usizes etc. While not ideal I don't think it will be very expensive?
https://arrow.apache.org/rust/arrow_buffer/buffer/immutable/struct.Buffer.html

But I like the suggestion of returning a reference or Cow in arrow-rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this PR, but I would prefer to have the Cow thing. Let me see if I can whip it up quickly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my proposal for improvement: coralogix#221

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @joroKr21 and @Dandandan

It would be great to do some benchmark runs to show this doesn't impact performance, but I think I am over worrying this extra copy of the NullBuffer -- it is 48 bytes and one increment

@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

FWIW I want to be clear I think we can revert cecc493 unless it shows a performance improvement. I am sorry for all the noise

@joroKr21
Copy link
Contributor Author

Oh sorry, I saw your comment too late. I will force push in that case.

Use `logical_nulls` when the array data type is `Null`.
@joroKr21
Copy link
Contributor Author

I don't know how long it takes to run the benchmarks. I could probably do a run during the weekend.

@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

I don't know how long it takes to run the benchmarks. I could probably do a run during the weekend.

I'll run some now as I have it all setup. I'll post them here when ready

@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

I ran benchmarks and my conclusion is that this branch doesn't change the performance and any changes are within the level of noise

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ bugfix_count-null ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     3.64ms │            3.74ms │     no change │
│ QQuery 1     │   135.02ms │          138.12ms │     no change │
│ QQuery 2     │   299.58ms │          312.29ms │     no change │
│ QQuery 3     │   280.64ms │          293.30ms │     no change │
│ QQuery 4     │  2783.49ms │         2773.07ms │     no change │
│ QQuery 5     │  4050.48ms │         3897.47ms │     no change │
│ QQuery 6     │   125.78ms │          123.34ms │     no change │
│ QQuery 7     │   140.97ms │          141.36ms │     no change │
│ QQuery 8     │  4433.68ms │         4286.59ms │     no change │
│ QQuery 9     │  7768.12ms │         7501.32ms │     no change │
│ QQuery 10    │  1120.87ms │         1073.25ms │     no change │
│ QQuery 11    │  1206.52ms │         1178.95ms │     no change │
│ QQuery 12    │  3528.50ms │         3434.86ms │     no change │
│ QQuery 13    │  6754.81ms │         6778.36ms │     no change │
│ QQuery 14    │  3859.19ms │         3913.48ms │     no change │
│ QQuery 15    │  3078.75ms │         3047.18ms │     no change │
│ QQuery 16    │  8303.34ms │         8266.43ms │     no change │
│ QQuery 17    │  7910.97ms │         7943.00ms │     no change │
│ QQuery 18    │ 16263.82ms │        15770.30ms │     no change │
│ QQuery 19    │   235.28ms │          228.10ms │     no change │
│ QQuery 20    │  3989.39ms │         3658.45ms │ +1.09x faster │
│ QQuery 21    │  4972.73ms │         4778.85ms │     no change │
│ QQuery 22    │ 13516.67ms │        13320.12ms │     no change │
│ QQuery 23    │ 35451.65ms │        33649.33ms │ +1.05x faster │
│ QQuery 24    │  1885.83ms │         1919.50ms │     no change │
│ QQuery 25    │  1630.22ms │         1604.27ms │     no change │
│ QQuery 26    │  2034.69ms │         2042.98ms │     no change │
│ QQuery 27    │  5236.29ms │         5144.58ms │     no change │
│ QQuery 28    │ 42455.40ms │        41939.68ms │     no change │
│ QQuery 29    │  1493.39ms │         1500.64ms │     no change │
│ QQuery 30    │  3408.16ms │         3666.76ms │  1.08x slower │
│ QQuery 31    │  4357.88ms │         4430.19ms │     no change │
│ QQuery 32    │ 23382.55ms │        22710.19ms │     no change │
│ QQuery 33    │ 18094.43ms │        17377.25ms │     no change │
│ QQuery 34    │ 19085.58ms │        18796.15ms │     no change │
│ QQuery 35    │  5181.19ms │         5264.16ms │     no change │
│ QQuery 36    │   697.38ms │          730.89ms │     no change │
│ QQuery 37    │   368.22ms │          392.02ms │  1.06x slower │
│ QQuery 38    │   324.24ms │          325.51ms │     no change │
│ QQuery 39    │  1643.98ms │         1595.20ms │     no change │
│ QQuery 40    │   186.05ms │          176.14ms │ +1.06x faster │
│ QQuery 41    │   165.92ms │          146.29ms │ +1.13x faster │
│ QQuery 42    │   174.23ms │          171.96ms │     no change │
└──────────────┴────────────┴───────────────────┴───────────────┘
--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ bugfix_count-null ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  345.57ms │          337.26ms │     no change │
│ QQuery 2     │   85.16ms │           80.29ms │ +1.06x faster │
│ QQuery 3     │  193.96ms │          190.57ms │     no change │
│ QQuery 4     │  197.21ms │          201.03ms │     no change │
│ QQuery 5     │  333.08ms │          339.23ms │     no change │
│ QQuery 6     │   37.63ms │           35.96ms │     no change │
│ QQuery 7     │  890.08ms │          881.96ms │     no change │
│ QQuery 8     │  172.86ms │          167.49ms │     no change │
│ QQuery 9     │  192.89ms │          200.49ms │     no change │
│ QQuery 10    │  393.93ms │          403.95ms │     no change │
│ QQuery 11    │   66.91ms │           65.34ms │     no change │
│ QQuery 12    │  231.26ms │          239.59ms │     no change │
│ QQuery 13    │  173.97ms │          172.41ms │     no change │
│ QQuery 14    │   71.21ms │           71.20ms │     no change │
│ QQuery 15    │  224.79ms │          225.91ms │     no change │
│ QQuery 16    │   79.26ms │           76.84ms │     no change │
│ QQuery 17    │  228.44ms │          238.25ms │     no change │
│ QQuery 18    │  762.62ms │          806.16ms │  1.06x slower │
│ QQuery 19    │  114.11ms │          122.09ms │  1.07x slower │
│ QQuery 20    │  267.40ms │          271.21ms │     no change │
│ QQuery 21    │ 1017.97ms │         1032.86ms │     no change │
│ QQuery 22    │   49.09ms │           56.83ms │  1.16x slower │
└──────────────┴───────────┴───────────────────┴───────────────┘
alamb@aal-dev:~/datafusion-benchmarking$

@alamb alamb merged commit 06d3bcc into apache:main Dec 14, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

Thanks @joroKr21

@joroKr21 joroKr21 deleted the bugfix/count-null branch December 14, 2023 21:35
@joroKr21 joroKr21 restored the bugfix/count-null branch December 14, 2023 21:35
@joroKr21 joroKr21 deleted the bugfix/count-null branch February 1, 2024 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COUNT(NULL) should always return 0
3 participants