[Bug] Pyiceberg row filter expression "In" takes longer to query than using "EqualTo" #1295

jouwenl · 2024-11-05T19:26:48Z

Apache Iceberg version

0.7.1 (latest release)

Please describe the bug 🐞

Setup: Using Pyiceberg & Duckdb to query from an Iceberg Table (S3, AWS glue catalog)

When doing a table scan with a row filter, I noticed that using the "In" pyiceberg expression has a lot worse performance than using the "EqualTo" expression when querying Iceberg.

Using In():

scan = table.scan(
    row_filter = In("column1", ["value1", "value2"]),
    selected_fileds=("column1", "column2")
)

versus using EqualTo():

scan = table.scan(
    row_filter = Or(EqualTo("column1", "value1"), EqualTo("column1", "value2")),
    selected_fileds=("column1", "column2")
)

Using mitmproxy, we checked the S3 requests that were made to Iceberg and noticed that the query using the In() row filter was grabbing a lot of row groups from the parquet files, row groups that were irrelevant to the query, while the EqualTo() row filter doesn't. This results in significantly worse query performance.

After running

scan.to_duckdb(table_name="my_table", connection=duckdb_con)
results = duckdb_con.execute("SELECT * FROM my_table")

The number of rows returned in the results is the same when using "In" and "EqualTo" in the row filter. Somehow using the "In" expression in the row filter makes more requests to get the same result.

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2024-11-06T03:53:41Z

This one's interesting, we ultimately pushdown filter to pyarrow

iceberg-python/pyiceberg/io/pyarrow.py

Line 1251 in 36e4de6

filter=pyarrow_filter if not positional_deletes else None,

And from reading this issue apache/arrow#36283, it seems like in is slower than or when reading parquet.

Fokko mentioned this issue Nov 8, 2024

feat(table): Initial implementation of Reading Data apache/iceberg-go#185

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Pyiceberg row filter expression "In" takes longer to query than using "EqualTo" #1295

[Bug] Pyiceberg row filter expression "In" takes longer to query than using "EqualTo" #1295

jouwenl commented Nov 5, 2024

kevinjqliu commented Nov 6, 2024

[Bug] Pyiceberg row filter expression "In" takes longer to query than using "EqualTo" #1295

[Bug] Pyiceberg row filter expression "In" takes longer to query than using "EqualTo" #1295

Comments

jouwenl commented Nov 5, 2024

Apache Iceberg version

Please describe the bug 🐞

kevinjqliu commented Nov 6, 2024