Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce correct answers for Group BY NULL (Option 1) #793

Merged
merged 2 commits into from
Aug 2, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 29, 2021

Which issue does this PR close?

This PR closes #781 and #782;

Built on #786 so review that first.

Rationale for this change

  1. Grouping on columns that contain NULLs today produces incorrect results
  2. This is what I think is the minimum change required to produce correct results

This is a version of the "Alternative" approach described in #790 which I think is the minimum change to GroupByHash to produce the correct answers when grouping on columns that contain nulls. Thanks to @jhorstmann and @Dandandan for the ideas leading to this PR

It will likely reduce the speed of grouping as well as require more memory than the current implementation (though it does get correct answers!)

I created this PR it available for comparison and as a fallback in case I run into trouble or run out of time time trying to implement #790, which I expect will take longer to code and review.

What changes are included in this PR?

  1. Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786
  2. Include a "null byte" for each column when creating the key to the hash table
  3. Fix some bugs related to NULL handling in ScalarValue
  4. Tests

On master keys are created like this:

                            string len                   0x1234
{                          (as usize le)      "foo"    (as u16 le)
  k1: "foo"         ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
  k2: 0x1234u16     │03│00│00│00│00│00│00│00│"f│"o│"o│34│12│
}                   └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
                    0  1  2  3  4  5  6  7  8  9  10 11 12

After this PR, the keys are created as follows (note the two extra bytes, one for each grouping column)

Example of a key without any nulls:

                       0xFF byte at the start of each column
                          signifies the value is non-null
                                         │

                     ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┐

                     │        string len                 │  0x1234
{                    ▼       (as usize le)      "foo"    ▼(as u16 le)
  k1: "foo"        ╔ ═┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──╦ ═┌──┬──┐
  k2: 0x1234u16     FF║03│00│00│00│00│00│00│00│"f│"o│"o│FF║34│12│
}                  ╚ ═└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──╩ ═└──┴──┘
                    0  1  2  3  4  5  6  7  8  9  10 11 12 13 14

Example of a key with NULL values:

                        0xFE byte at the start of k1 column
                    ┌ ─     signifies the value is NULL

                    └ ┐
                             0x1234
{                     ▼    (as u16 le)
  k1: NULL          ╔ ═╔ ═┌──┬──┐
  k2: 0x1234u16      FE║FF║12│34│
}                   ╚ ═╚ ═└──┴──┘
                      0  1  2  3

Are there any user-facing changes?

Correct answers!

Benchmark results

The benchmarks show a slight slowdown, which is not unexpected given there is now more work being done

group                                                gby_null_alternative                   master
-----                                                --------------------                   ------
aggregate_query_group_by                             1.16      3.6±0.14ms        ? ?/sec    1.00      3.1±0.10ms        ? ?/sec
aggregate_query_group_by_u64 15 12                   1.06      3.7±0.09ms        ? ?/sec    1.00      3.5±0.04ms        ? ?/sec
aggregate_query_group_by_with_filter                 1.06      2.5±0.05ms        ? ?/sec    1.00      2.3±0.04ms        ? ?/sec
aggregate_query_group_by_with_filter_u64 15 12       1.02      2.4±0.10ms        ? ?/sec    1.00      2.4±0.04ms        ? ?/sec
aggregate_query_no_group_by 15 12                    1.00  1152.2±30.38µs        ? ?/sec    1.00  1155.2±28.60µs        ? ?/sec
aggregate_query_no_group_by_count_distinct_narrow    1.14      6.1±0.14ms        ? ?/sec    1.00      5.4±0.05ms        ? ?/sec
aggregate_query_no_group_by_count_distinct_wide      1.18      8.8±0.30ms        ? ?/sec    1.00      7.4±0.10ms        ? ?/sec
aggregate_query_no_group_by_min_max_f64              1.06  1225.6±27.57µs        ? ?/sec    1.00  1160.5±29.22µs        ? ?/sec

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jul 29, 2021
@alamb alamb changed the title Produce correct ansers for Group BY NULL (Option 1) Produce correct answers for Group BY NULL (Option 1) Jul 29, 2021
"+-----------------+----+",
"| COUNT(UInt8(1)) | c1 |",
"+-----------------+----+",
"| 1 | |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2021

Clippy error is unrelated -- see fix in #794

pub(crate) fn create_key(
group_by_keys: &[ArrayRef],
row: usize,
vec: &mut Vec<u8>,
) -> Result<()> {
vec.clear();
for col in group_by_keys {
create_key_for_col(col, row, vec)?
if !col.is_valid(row) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it makes sense to improve performance here, but an optimization might be to check on null-count==0 outside of this function to avoid the is_valid call and just always add an 0xFF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion.

If you don't mind I would like to spend time on #790 which, if successful, I expect to significantly remove all this code.

I will attempt to add that optimization at a later date.

fn scalar_try_from_dict_datatype() {
let data_type =
DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8));
let data_type = &data_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amusingly, supporting this behavior ended up causing a test to fail when I brought the code into IOx and I think I traced the problem to an issue in parquet file statistics: apache/arrow-rs#641 🤣 this was not a side effect I had anticipated

"+-----------------+----+-----+",
"| 1 | | |",
"| 2 | | bar |",
"| 3 | 0 | |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// any newly added enum variant will require editing this list
// or else face a compile error
match (self, other) {
(Boolean(v1), Boolean(v2)) => v1.eq(v2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also use == instead?

@alamb alamb force-pushed the alamb/gby_null_alternative branch from ddf2298 to b0d834a Compare July 29, 2021 19:23
@jhorstmann
Copy link
Contributor

Looks good. I was trying to come up with an example where two distinct keys would end up as the same encoding but could not find any because any variable length types include the length prefix. 🚀

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2021

Thanks for the reviews @Dandandan and @jhorstmann ! I plan to wait another day or two to see if anyone else has feedback on this approach, but what I am thinking of doing is merging this PR (after addressing comments) so at least we get correct answers and then work on the more sophisticated implementation in parallel.

Looks good. I was trying to come up with an example where two distinct keys would end up as the same encoding but could not find any because any variable length types include the length prefix. 🚀

Thanks for double checking -- this also worried me a lot so I am glad to hear someone else did a double check too.

I convinced myself that since each key had entries for the same columns in the same order, there was no way to concoct the same bytes with different column values

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through this carefully, and it looks great! Great work, @alamb !

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2021

Rebased given that #786 is merged, so that this PR just shows the delta now

@alamb alamb force-pushed the alamb/gby_null_alternative branch from b0d834a to b6c6a3c Compare July 30, 2021 17:42
Co-authored-by: Daniël Heres <danielheres@gmail.com>
@alamb
Copy link
Contributor Author

alamb commented Aug 2, 2021

#808 contains the PR that should give us back any performance we lost in this one

@alamb alamb merged commit 2bcf040 into apache:master Aug 2, 2021
@alamb alamb deleted the alamb/gby_null_alternative branch August 2, 2021 11:23
@houqp houqp added the bug Something isn't working label Aug 3, 2021
igorcalabria added a commit to igorcalabria/arrow-datafusion that referenced this pull request Oct 9, 2023
takes the relevant part out of
apache#793 which was ignored by
cube maintainers
cfms3 pushed a commit to inloco/arrow-datafusion that referenced this pull request May 31, 2024
takes the relevant part out of
apache#793 which was ignored by
cube maintainers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong results when grouping with dictionary arrays with nulls
5 participants