You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Setup: Using Pyiceberg & Duckdb to query from an Iceberg Table (S3, AWS glue catalog)
When doing a table scan with a row filter, I noticed that using the "In" pyiceberg expression has a lot worse performance than using the "EqualTo" expression when querying Iceberg.
Using mitmproxy, we checked the S3 requests that were made to Iceberg and noticed that the query using the In() row filter was grabbing a lot of row groups from the parquet files, row groups that were irrelevant to the query, while the EqualTo() row filter doesn't. This results in significantly worse query performance.
After running
scan.to_duckdb(table_name="my_table", connection=duckdb_con)
results = duckdb_con.execute("SELECT * FROM my_table")
The number of rows returned in the results is the same when using "In" and "EqualTo" in the row filter. Somehow using the "In" expression in the row filter makes more requests to get the same result.
The text was updated successfully, but these errors were encountered:
Apache Iceberg version
0.7.1 (latest release)
Please describe the bug 🐞
Setup: Using Pyiceberg & Duckdb to query from an Iceberg Table (S3, AWS glue catalog)
When doing a table scan with a row filter, I noticed that using the "In" pyiceberg expression has a lot worse performance than using the "EqualTo" expression when querying Iceberg.
Using In():
versus using EqualTo():
Using mitmproxy, we checked the S3 requests that were made to Iceberg and noticed that the query using the In() row filter was grabbing a lot of row groups from the parquet files, row groups that were irrelevant to the query, while the EqualTo() row filter doesn't. This results in significantly worse query performance.
After running
The number of rows returned in the results is the same when using "In" and "EqualTo" in the row filter. Somehow using the "In" expression in the row filter makes more requests to get the same result.
The text was updated successfully, but these errors were encountered: