[WIP] Support deep schema pruning and projection #11747
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #11745
What changes are included in this PR?
This PR is currently opened as a discussion support for this feature.
projection_deep
added after aprojection
parameter before. This parameter needs to be propagated all the way down to the physical data source, in our case parquet.The representation of the deep projection is a
HashMap<usize, Vec<String>>
. The key represents the index of the top level column, while theVec<String>
is a list of "paths". Each path represents a way to navigate inside a deep field separated by dots, liketop_level_struct.subfield1.*.subfield2
.Current problems / questions
projection
andprojection_deep
parameters, even though we could only pass the second one, that has in the keys the information currently in projectionCurrent problems / questions
field[0]["some_sub_field"]["other_sub_field"] -> 0: Vec<String>{"*.some_sub_field.other_sub_field"}
Current problems / questions
a["b"]["c"]
to compute a path likeb.c
, or*.c
, orb.*
. The["b"]
could mean either a map access or a substruct access, same for the c*
- so we also introduce a magic string here :(scan_deep
function in the TableProviderCurrent problems / questions
Are these changes tested?
A version of this is tested on top of Datafusion-40, the rebase on main has not been tested. The PR is opened as a discussion support, but the patch applied mostly cleanly apart from some refactorings in data fucion
Are there any user-facing changes?
Possible API change for implementers of
TableProvider