Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in Parquet pruning code? #8309

Closed
maxburke opened this issue Nov 22, 2023 · 7 comments
Closed

Possible bug in Parquet pruning code? #8309

maxburke opened this issue Nov 22, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@maxburke
Copy link
Contributor

maxburke commented Nov 22, 2023

Describe the bug

I'm running into a weird issue with a datafusion query on a parquet file where if I select with a condition testing certain values, I get no results back.

To Reproduce

For example:

❯ create external table t0 stored as parquet location 'a7546b6b206d882e928a1325f8cbcce4.parquet';
0 rows in set. Query took 0.019 seconds.
❯ select distinct direction from t0;
+-----------+
| direction |
+-----------+
| Merged    |
| Two Way   |
| Outgoing  |
| Incoming  |
+-----------+
4 rows in set. Query took 0.015 seconds.
❯ select * from t0 where "direction" = 'Outgoing';
+--------------------------+-----------+-------+
| ul_node_id               | direction | adt   |
+--------------------------+-----------+-------+
| XRLDMlrMwNT4jrCE2hzvbA== | Outgoing  | 222   |
| uuzYwO69bCHsouZHOeG8Wg== | Outgoing  | 178   |
...snip...
| YlaylK0fQqkfvKbIONzkJQ== | Outgoing  | 115   |
| AUsoU3ojLVrmgRJhuypwMA== | Outgoing  | 649   |
+--------------------------+-----------+-------+
100 rows in set. Query took 0.013 seconds.
❯ select * from t0 where "direction" = 'Merged';
+--------------------------+-----------+------+
| ul_node_id               | direction | adt  |
+--------------------------+-----------+------+
| YxRwadNzW7uJht9xp4g46Q== | Merged    | 5605 |
+--------------------------+-----------+------+
1 row in set. Query took 0.010 seconds.
❯ select * from t0 where "direction" = 'Incoming';
0 rows in set. Query took 0.007 seconds.
❯ select * from t0 where "direction" = 'Two Way';
0 rows in set. Query took 0.006 seconds.

I've checked to see if there's trailing whitespace in the values "Incoming" and "Two Way in the table and there isn't. pandas seems to be able to do similar queries just fine:

>>> df[df['direction'] == "Two Way"]
                   ul_node_id direction    adt
1    XRLDMlrMwNT4jrCE2hzvbA==   Two Way    420
5    uuzYwO69bCHsouZHOeG8Wg==   Two Way    337
8    crHf6lzksh5sVFw8/Sf5Ew==   Two Way    324
11   OSS+4pM008wnvNH/c19uHg==   Two Way   4674
12   NF20BBZ0OjDvIbcGxQuhwQ==   Two Way  23116
..                        ...       ...    ...
288  kvQm8I+WxR9C1Il1XxyvbQ==   Two Way    204
291  RCYo+XLiUXnvL7tb8pml3w==   Two Way   5777
294  YTsSJkf0qT7tvaosC0gRhA==   Two Way   4409
297  YlaylK0fQqkfvKbIONzkJQ==   Two Way    336
298  AUsoU3ojLVrmgRJhuypwMA==   Two Way   1049

[101 rows x 3 columns]

The parquet file above is attached.
a7546b6b206d882e928a1325f8cbcce4.parquet.zip

Expected behavior

The query should return values.

Additional context

Andy Grove demonstrated that disabling Parquet pruning causes the query to return the correct values: https://the-asf.slack.com/archives/C01QUFS30TD/p1700671664714069

$ export DATAFUSION_EXECUTION_PARQUET_PRUNING=false

❯ select count(*) from "test.parquet" where direction = 'Incoming';
+----------+
| COUNT(*) |
+----------+
| 99       |
+----------+
1 row in set. Query took 0.024 seconds.

$ export DATAFUSION_EXECUTION_PARQUET_PRUNING=true

❯ select count(*) from "test.parquet" where direction = 'Incoming';
+----------+
| COUNT(*) |
+----------+
| 0        |
+----------+
1 row in set. Query took 0.023 seconds.
@maxburke maxburke added the bug Something isn't working label Nov 22, 2023
@alamb
Copy link
Contributor

alamb commented Nov 22, 2023

Added to #8227

I agree this looks like a bug in the pruning code.

@andygrove
Copy link
Member

I don't know if this helps, but the min and max values seem to be incorrect in the parquet file (or maybe I have a bug in bdt).

$ bdt view-parquet-meta test.parquet 
+------------+------------+
| Key        | Value      |
+------------+------------+
| Version    | 1          |
| Created By | UrbanLogiq |
| Rows       | 301        |
| Row Groups | 1          |
+------------+------------+

Row Group 0 of 1 contains 301 rows and has 3384 bytes:

+-------------+--------------+---------------+-----------------+-------+--------------------------+--------------------------+
| Column Name | Logical Type | Physical Type | Distinct Values | Nulls | Min                      | Max                      |
+-------------+--------------+---------------+-----------------+-------+--------------------------+--------------------------+
| ul_node_id  | String       | BYTE_ARRAY    | N/A             | 0     | /ehIvdei+UGfkQ4Gy5fr1w== | /ehIvdei+UGfkQ4Gy5fr1w== |
| direction   | String       | BYTE_ARRAY    | N/A             | 0     | Merged                   | Merged                   |
| adt         | N/A          | INT32         | N/A             | 0     | 15                       | 23116                    |
+-------------+--------------+---------------+-----------------+-------+--------------------------+--------------------------+

@andygrove
Copy link
Member

nm, it is a bug in bdt

@andygrove
Copy link
Member

Here is the corrected version:

$ bdt view-parquet-meta test.parquet 
+------------+------------+
| Key        | Value      |
+------------+------------+
| Version    | 1          |
| Created By | UrbanLogiq |
| Rows       | 301        |
| Row Groups | 1          |
+------------+------------+

Row Group 0 of 1 contains 301 rows and has 3384 bytes:

+-------------+--------------+---------------+-----------------+-------+--------------------------+--------------------------+
| Column Name | Logical Type | Physical Type | Distinct Values | Nulls | Min                      | Max                      |
+-------------+--------------+---------------+-----------------+-------+--------------------------+--------------------------+
| ul_node_id  | String       | BYTE_ARRAY    | N/A             | 0     | /ehIvdei+UGfkQ4Gy5fr1w== | zThqpswvY6fa3VHF4BKWfw== |
| direction   | String       | BYTE_ARRAY    | N/A             | 0     | Merged                   | Outgoing                 |
| adt         | N/A          | INT32         | N/A             | 0     | 15                       | 23116                    |
+-------------+--------------+---------------+-----------------+-------+--------------------------+--------------------------+

@andygrove
Copy link
Member

This still seems incorrect. I would expect min of Incoming and max of Two Way.

+-----------+
| direction |
+-----------+
| Merged    |
| Two Way   |
| Outgoing  |
| Incoming  |
+-----------+

@alamb
Copy link
Contributor

alamb commented Nov 22, 2023

I used parquet-tools to verify that metadata in the file appears to claim that direction has a min value of Merged and a max value of Outgoing 🤔

alamb@MacBook-Pro-8:~/Downloads$ parquet-tools meta  a7546b6b206d882e928a1325f8cbcce4.parquet
file:        file:/Users/alamb/Downloads/a7546b6b206d882e928a1325f8cbcce4.parquet
creator:     UrbanLogiq
extra:       ARROW:schema = /////+gAAAAQAAAAAAAKAA4ADAALAAQACgAAABQAAAAAAAABBAAKAAwAAAAIAAQACgAAAAgAAAAIAAAAAAAAAAMAAAB8AAAAPAAAAAQAAACg////GAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAMAAABhZHQA1P///xQAAAAMAAAAAAAABQwAAAAAAAAAxP///wkAAABkaXJlY3Rpb24AAAAQABQAEAAAAA8ABAAAAAgAEAAAABgAAAAMAAAAAAAABRAAAAAAAAAABAAEAAQAAAAKAAAAdWxfbm9kZV9pZAAA

file schema: arrow_schema
--------------------------------------------------------------------------------
ul_node_id:  REQUIRED BINARY L:STRING R:0 D:0
direction:   REQUIRED BINARY L:STRING R:0 D:0
adt:         REQUIRED INT32 R:0 D:0

row group 1: RC:301 TS:3384 OFFSET:4
--------------------------------------------------------------------------------
ul_node_id:   BINARY ZSTD DO:4 FPO:1796 SZ:2143/3187/1.49 VC:301 ENC:RLE_DICTIONARY,RLE,PLAIN ST:[min: /ehIvdei+UGfkQ4Gy5fr1w==, max: zThqpswvY6fa3VHF4BKWfw==, num_nulls not defined]
direction:    BINARY ZSTD DO:2243 FPO:2311 SZ:195/177/0.91 VC:301 ENC:RLE_DICTIONARY,RLE,PLAIN ST:[min: Merged, max: Outgoing, num_nulls not defined]
adt:          INT32 ZSTD DO:2500 FPO:3159 SZ:1046/1503/1.44 VC:301 ENC:RLE_DICTIONARY,RLE,PLAIN ST:[min: 15, max: 23116, num_nulls not defined]

@alamb
Copy link
Contributor

alamb commented Nov 22, 2023

However, the calculated min / maxes are different (they are consistent with what @andygrove expects in #8309 (comment))

❯ select min(direction), max(direction) from 'a7546b6b206d882e928a1325f8cbcce4.parquet';
+---------------------------------------------------------+---------------------------------------------------------+
| MIN(a7546b6b206d882e928a1325f8cbcce4.parquet.direction) | MAX(a7546b6b206d882e928a1325f8cbcce4.parquet.direction) |
+---------------------------------------------------------+---------------------------------------------------------+
| Incoming                                                | Two Way                                                 |
+---------------------------------------------------------+---------------------------------------------------------+
1 row in set. Query took 0.003 seconds.
❯ select arrow_typeof(direction) from 'a7546b6b206d882e928a1325f8cbcce4.parquet';
+------------------------------------------------------------------+
| arrow_typeof(a7546b6b206d882e928a1325f8cbcce4.parquet.direction) |
+------------------------------------------------------------------+
| Utf8                                                             |
| Utf8                                                             |
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants