-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled #4028
Comments
This is a good proposal I think -- it would skip unecessary filtering and likely make plans faster. It will become more useful (maybe even necessary) when predicate pushdown is enabled by default -- #3463 |
I think it is a good idea. I confirm it would help improve performance by 30x in #5404 (comment). I am thinking to pick this up but seems the work is not as trivial as I thought. Following are my two cents' thoughts. Related code to improve is following
This is trivial, we can
This setting could come from However, when Idea: shall we add
is it making sure the |
I wonder if it might be possible to always return exact for Parquet files, and to just manually insert a FilterExec for any predicates that can't be pushed down Tagging @alamb and @crepererum who are more familiar with this part of the codebase |
I'll try and look into this in more detail tomorrow |
That's what we want to do for InfluxDB IOx as well: https://github.com/influxdata/influxdb_iox/issues/7408 |
I am going to find time sometime this week |
Background ReadingHere is a background article about parquet predicate pushdown: https://www.influxdata.com/blog/querying-parquet-millisecond-latency/#heading5 Specifically the section on Code links for Listing TableSupports filter pushdown: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html#impl-TableProvider-for-ListingTable Conditions:Here are the conditions @tustvold lists above, and some links about what they mean
This refers to the
This should be relatively straightforward to determine
"pushdown" refers to this code in ParquetExec (there is some background information in the parquet RowFilter API) datafusion/datafusion/core/src/datasource/physical_plan/parquet/row_filter.rs Lines 43 to 70 in 16a3557
|
I think there are likely two things needed for this PR:
|
FWIW I think @itsjunetime is looking into this ticket |
Yes, sorry - I'm still somewhat new to this codebase, so I'm taking a good second to read through the associated docs and comments to figure it out first. |
No need to rush. I think @alamb's comment was mostly meant as an assignment signal, so that nobody else starts to work on it and we end up wasting resources 🙂 |
Yes this is what I meant. Sorry about not being clear. |
I filed #12115 to track some additional testing that I would like to do for this feature. |
There is a nice PR here for this feature: #12135 |
…ed (#12135) * feat: Preemptively filter for pushdown-preventing columns in ListingTable * Fix behavior to make all previous tests work and lay groundwork for future tests * fix: Add some more tests and fix small issue with pushdown specificity * test: Revive unneccesarily removed test * ci: Fix CI issues with different combinations of exprs * fix: run fmt * Fix doc publicity issues * Add ::new fn for PushdownChecker * Remove unnecessary 'pub' qualifier * Fix naming and doc comment of non_pushdown_columns to reflect what it actually does (the opposite) and add back useful comments * fmt * Extend FileFormat trait to allow library users to define formats which support pushdown * fmt * fix: reference real fn in doc to fix CI * Minor: Add tests for using FilterExec when parquet was pushed down * Update datafusion/core/src/datasource/file_format/mod.rs * Pipe schema information through to TableScan and ParquetExec to facilitate unnecessary FilterExec removal * - Remove collect::<(_, _)> to satisfy msrv - Remove expect(_) attr to satisfy msrv - Update comments with more accurate details and explanations * Add more details in comments for `map_partial_batch` Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Remove reference to issue #4028 as it will be closed * Convert normal comments to doc-comments Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Clarify meaning of word `projected` in comment Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Clarify more how `table_schema` is used differently from `projected_table_schema` Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Finish partially-written comment about SchemaMapping struct --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)
Currently even when parquet predicate pushdown is enabled, and the predicate can be fully pushed down, the physical plan still contains a
FilterExec
when usingListingTable
Describe the solution you'd like
ListingTable::supports_filter_pushdown
should returnTableProviderFilterPushDown::Exact
whenDescribe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: