Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Pyiceberg row filter expression "In" takes longer to query than using "EqualTo" #1295

Open
jouwenl opened this issue Nov 5, 2024 · 1 comment

Comments

@jouwenl
Copy link

jouwenl commented Nov 5, 2024

Apache Iceberg version

0.7.1 (latest release)

Please describe the bug 🐞

Setup: Using Pyiceberg & Duckdb to query from an Iceberg Table (S3, AWS glue catalog)

When doing a table scan with a row filter, I noticed that using the "In" pyiceberg expression has a lot worse performance than using the "EqualTo" expression when querying Iceberg.

Using In():

scan = table.scan(
    row_filter = In("column1", ["value1", "value2"]),
    selected_fileds=("column1", "column2")
)

versus using EqualTo():

scan = table.scan(
    row_filter = Or(EqualTo("column1", "value1"), EqualTo("column1", "value2")),
    selected_fileds=("column1", "column2")
)

Using mitmproxy, we checked the S3 requests that were made to Iceberg and noticed that the query using the In() row filter was grabbing a lot of row groups from the parquet files, row groups that were irrelevant to the query, while the EqualTo() row filter doesn't. This results in significantly worse query performance.

After running

scan.to_duckdb(table_name="my_table", connection=duckdb_con)
results = duckdb_con.execute("SELECT * FROM my_table")

The number of rows returned in the results is the same when using "In" and "EqualTo" in the row filter. Somehow using the "In" expression in the row filter makes more requests to get the same result.

@kevinjqliu
Copy link
Contributor

This one's interesting, we ultimately pushdown filter to pyarrow

filter=pyarrow_filter if not positional_deletes else None,

And from reading this issue apache/arrow#36283, it seems like in is slower than or when reading parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants