-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong pre aggregation result for boolean dimension containing NULL values #7197
Comments
To reproduce this (works on cubecloud too). Run these on cubestore: CREATE SCHEMA test; CREATE TABLE test.null_test (id int, bool_col boolean); INSERT INTO test.null_test (id, bool_col) VALUES (0, false), (1, true), (2, NULL), (3, NULL), (4, true), (5, false); MySQL [(none)]> SELECT bool_col, count(*) FROM test.null_test GROUP BY 1;
MySQL [(none)]> SELECT count(*) FROM test.null_test WHERE bool_col IS NULL;
This issue goes beyond just dropping NULL values from aggregations, depending on the order, it may return wrong results. For example, if INSERT INTO test.null_test (id, bool_col) VALUES (0, NULL), (1, NULL), (2, NULL), (3, false), (4, true), (5, true);
There are only 3 NULL columns, but it returned 4. Probably because the engine slurped the false as NULL too. This seems releated to apache/datafusion#782 which was patched upstream a couple years ago. Cube also applied a similar patch cube-js/arrow-datafusion@e524173 but it's not the same as the one applied upstream. Using cubestore version of datafusion and querying the same parquet file used by cube we get the same result so this points to a problem with cube's version. Using the newest version of datafusion produces the correct results |
I think I've found the problem. https://github.com/cube-js/arrow-datafusion/blob/cube/datafusion/src/physical_plan/hash_aggregate.rs#L588 Does not deal with NULLs, só any type where NULL looks like an actual value will be grouped together. This is most apparent with booleans since there are only 2 values (NULL = 0, False = 0, True = 1) but any other type is also susceptible to this bug. For example, I just tested with a INT table and 0 and NULL where grouped together too. A possible patch is just to use the same strategy that upstream arrow adopted at first in apache/datafusion#793 diff --git a/datafusion/src/physical_plan/hash_aggregate.rs b/datafusion/src/physical_plan/hash_aggregate.rs
index fc0bf4af8..357b3779a 100644
--- a/datafusion/src/physical_plan/hash_aggregate.rs
+++ b/datafusion/src/physical_plan/hash_aggregate.rs
@@ -783,7 +783,12 @@ pub(crate) fn create_key(
) -> Result<()> {
vec.clear();
for col in group_by_keys {
- create_key_for_col(col, row, vec)?
+ if !col.is_valid(row) {
+ vec.push(0xFE);
+ } else {
+ vec.push(0xFF);
+ create_key_for_col(col, row, vec)?
+ }
}
Ok(())
}
This strategy was later dropped for a more optimized approach. |
Hey @igorcalabria ! Thanks for posting this one! If you found the cause, please don't hesitate to provide PR. |
Describe the bug
Hi, we noticed that some of our preAggregations are returning inconsistent values for boolean dimensions. At first, we believed that this might be a import issue and cubestore was simply reporting the data it had, but we queried it directly and confirmed that something strange is happening. With a direct connection to cubestore on mysql port we did the following:
The query returned
true
andfalse
values are reported correctly but it's missing the NULLs. Strangely, we can do the followingAnd it reports the correct NULL values
We're using a deployment with separate workers and a router, but couldn't find anything wrong with our configuration. Besides that, every query that's not aggregating on a boolean dimension seems to work fine, so it seems unlikely that this is caused by a miss configuration.
I even patched cube to log the partial data on each worker before it is returned to the main one and individually each worker already responded with a wrong aggregation (missing the nulls, even tough the partitions contained it)
To Reproduce
I didn't manage to reproduce this with artificial data, but I'll update the issue as soon as I have. I'm opening it now in case somebody has experienced the same thing
Version:
0.34.0
The text was updated successfully, but these errors were encountered: