Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong pre aggregation result for boolean dimension containing NULL values #7197

Open
igorcalabria opened this issue Oct 5, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@igorcalabria
Copy link

igorcalabria commented Oct 5, 2023

Describe the bug
Hi, we noticed that some of our preAggregations are returning inconsistent values for boolean dimensions. At first, we believed that this might be a import issue and cubestore was simply reporting the data it had, but we queried it directly and confirmed that something strange is happening. With a direct connection to cubestore on mysql port we did the following:

SELECT 
  my_model__bool_col,
  COUNT(*) AS total_entries 
FROM prod_pre_aggregations.my_model_pre_agg GROUP BY 1;

The query returned

+--------------------+---------------+
| my_model__bool_col | total_entries |
+--------------------+---------------+
| true               |          9393 |
| false              |       3888043 |
+--------------------+---------------+

true and false values are reported correctly but it's missing the NULLs. Strangely, we can do the following

SELECT COUNT(*) AS total_entries 
FROM prod_pre_aggregations.my_model_pre_agg 
WHERE my_model__bool_col IS NULL;

And it reports the correct NULL values

+---------------+
| total_entries |
+---------------+
|        464018 |
+---------------+

We're using a deployment with separate workers and a router, but couldn't find anything wrong with our configuration. Besides that, every query that's not aggregating on a boolean dimension seems to work fine, so it seems unlikely that this is caused by a miss configuration.

I even patched cube to log the partial data on each worker before it is returned to the main one and individually each worker already responded with a wrong aggregation (missing the nulls, even tough the partitions contained it)

To Reproduce

I didn't manage to reproduce this with artificial data, but I'll update the issue as soon as I have. I'm opening it now in case somebody has experienced the same thing

Version:
0.34.0

@igorcalabria
Copy link
Author

To reproduce this (works on cubecloud too). Run these on cubestore:

CREATE SCHEMA test;
CREATE TABLE test.null_test (id int, bool_col boolean);
INSERT INTO test.null_test (id, bool_col) VALUES (0, false), (1, true), (2, NULL), (3, NULL), (4, true), (5, false);
MySQL [(none)]>  SELECT bool_col, count(*) FROM test.null_test GROUP BY 1;
+----------+-----------------+
| bool_col | COUNT(UInt8(1)) |
+----------+-----------------+
| false    |               4 |
| true     |               2 |
+----------+-----------------+
MySQL [(none)]>  SELECT count(*) FROM test.null_test WHERE bool_col IS NULL;
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
|               2 |
+-----------------+

This issue goes beyond just dropping NULL values from aggregations, depending on the order, it may return wrong results. For example, if NULLS come first with adjacent False columns.

INSERT INTO test.null_test (id, bool_col) VALUES (0, NULL), (1, NULL), (2, NULL), (3, false), (4, true), (5, true);
MySQL [(none)]>  SELECT bool_col, count(*) FROM test.null_test GROUP BY 1;
+----------+-----------------+
| bool_col | COUNT(UInt8(1)) |
+----------+-----------------+
| NULL     |               4 |
| true     |               2 |
+----------+-----------------+

There are only 3 NULL columns, but it returned 4. Probably because the engine slurped the false as NULL too.

This seems releated to apache/datafusion#782 which was patched upstream a couple years ago. Cube also applied a similar patch cube-js/arrow-datafusion@e524173 but it's not the same as the one applied upstream. Using cubestore version of datafusion and querying the same parquet file used by cube we get the same result so this points to a problem with cube's version. Using the newest version of datafusion produces the correct results

@igorcalabria
Copy link
Author

I think I've found the problem. https://github.com/cube-js/arrow-datafusion/blob/cube/datafusion/src/physical_plan/hash_aggregate.rs#L588 Does not deal with NULLs, só any type where NULL looks like an actual value will be grouped together. This is most apparent with booleans since there are only 2 values (NULL = 0, False = 0, True = 1) but any other type is also susceptible to this bug. For example, I just tested with a INT table and 0 and NULL where grouped together too.

A possible patch is just to use the same strategy that upstream arrow adopted at first in apache/datafusion#793

diff --git a/datafusion/src/physical_plan/hash_aggregate.rs b/datafusion/src/physical_plan/hash_aggregate.rs
index fc0bf4af8..357b3779a 100644
--- a/datafusion/src/physical_plan/hash_aggregate.rs
+++ b/datafusion/src/physical_plan/hash_aggregate.rs
@@ -783,7 +783,12 @@ pub(crate) fn create_key(
 ) -> Result<()> {
     vec.clear();
     for col in group_by_keys {
-        create_key_for_col(col, row, vec)?
+        if !col.is_valid(row) {
+            vec.push(0xFE);
+        } else {
+            vec.push(0xFF);
+            create_key_for_col(col, row, vec)?
+        }
     }
     Ok(())
 }

This strategy was later dropped for a more optimized approach.

@paveltiunov paveltiunov added the bug Something isn't working label Oct 11, 2023
@paveltiunov
Copy link
Member

Hey @igorcalabria ! Thanks for posting this one! If you found the cause, please don't hesitate to provide PR.

igorcalabria added a commit to inloco/cube that referenced this issue Nov 24, 2023
cfms3 pushed a commit to inloco/cube that referenced this issue Mar 5, 2024
cfms3 pushed a commit to inloco/cube that referenced this issue May 30, 2024
cfms3 pushed a commit to inloco/cube that referenced this issue Jun 1, 2024
cfms3 pushed a commit to inloco/cube that referenced this issue Jun 2, 2024
cfms3 pushed a commit to inloco/cube that referenced this issue Jun 5, 2024
cfms3 pushed a commit to inloco/cube that referenced this issue Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants