[WIP] Support deep schema pruning and projection #11747

adragomir · 2024-07-31T12:47:20Z

Which issue does this PR close?

What changes are included in this PR?

This PR is currently opened as a discussion support for this feature.

We need a way to add the deep projection. This is currently represented (for ease of merging etc) as an extra parameter, projection_deep added after a projection parameter before. This parameter needs to be propagated all the way down to the physical data source, in our case parquet.

The representation of the deep projection is a HashMap<usize, Vec<String>>. The key represents the index of the top level column, while the Vec<String> is a list of "paths". Each path represents a way to navigate inside a deep field separated by dots, like top_level_struct.subfield1.*.subfield2.

Current problems / questions

Duplicated information - for ease of merging, at the moment, we have both the projection and projection_deep parameters, even though we could only pass the second one, that has in the keys the information currently in projection
Untyped representation for path - maybe we need a better / safer way to represent the deep projections ? due to unfamiliarity and ease of development, currently it's just a String, but maybe we need to have it represented as a typed thing ?

We have a set of functions that can: rewrite schemas and record batches from an input to an output schema, rewrite an input schema to a (potentially) much smaller output schema that contains ONLY the necessary columns. This is the first patch in the PR (). These functions can be tested in isolation we need to add more unit tests, etc.

Current problems / questions

We have a lot of seemingly duplicated code that recurses through schemas. We tried making it more generic, but it was pretty complicated, so for the moment there is some duplication in the code, and not really sure how to make it better
Deep Schema test cases - creating deep schemas in code is very verbose, so we don't have enough test cases. One way would be to use pre-generated parquet files created with other tools and use those for the schema.

We need to detect the deep projections from the input logical plan, that means navigating the expressions, down to the columns and transforming index and field access expressions to a map field[0]["some_sub_field"]["other_sub_field"] -> 0: Vec<String>{"*.some_sub_field.other_sub_field"}

Current problems / questions

Uniform field accesses need 2 accesses through the fields:
- At the moment, we recurse through the input expressions like a["b"]["c"] to compute a path like b.c, or *.c, or b.*. The ["b"] could mean either a map access or a substruct access, same for the c
- So, we go through the paths twice - once we create them and after that we iterate them again, WITH the input schems so we can detect map accesses and replace actual field names that are actual map or list accesses with * - so we also introduce a magic string here :(
Implementing a new scan_deep function in the TableProvider
- We can get rid of this. It's done this way for now, so that we don't break existing TableProvider implementations

We need to push these through the physical layer and finally implement the reading in the parquet layer.

Current problems / questions

Named field in map key value or list element
- In the parquet schema, a list has a field inside it with a specific name. However, we don't have this name anywhere, except at the parquet layer. The arrow reader as far as we can tell loads these and replaces the names with standard names ("element" in the list case, "key_value.key", "key_value".value). BUT, if the arrow loader behavior changes, this functionality will break. At the moment we detect these hardcoded names and we replace them with a wildcard "*" that signifies that we don't care about this name. We also do this in the INPUT paths, because we don't have these actual names at that level.

Are these changes tested?

A version of this is tested on top of Datafusion-40, the rebase on main has not been tested. The PR is opened as a discussion support, but the patch applied mostly cleanly apart from some refactorings in data fucion

Are there any user-facing changes?

Possible API change for implementers of TableProvider

…ctions

… casting and converting batches between any schema

adragomir · 2024-08-29T07:11:23Z

Closing this for now, we will retest on main and reopen

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate labels Jul 31, 2024

adragomir added 4 commits August 23, 2024 19:46

[HSTACK] Bugfix apache#10978: Inconsistent behavior in HashJoin Proje…

5a78fdc

…ctions

[HSTACK] Add NestedSchemaMapping functionality - correct behavior for…

6b07104

… casting and converting batches between any schema

[HSTACK] Deep column projection

d262b5f

[HSTACK] Push projection_deep down the physical nodes path

0a278ae

adragomir force-pushed the main branch from b8e8616 to 0a278ae Compare August 23, 2024 16:47

github-actions bot added physical-expr Physical Expressions catalog Related to the catalog crate common Related to common crate proto Related to proto crate labels Aug 23, 2024

adragomir closed this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support deep schema pruning and projection #11747

[WIP] Support deep schema pruning and projection #11747

adragomir commented Jul 31, 2024 •

edited

Loading

adragomir commented Aug 29, 2024

[WIP] Support deep schema pruning and projection #11747

[WIP] Support deep schema pruning and projection #11747

Conversation

adragomir commented Jul 31, 2024 • edited Loading

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

adragomir commented Aug 29, 2024

adragomir commented Jul 31, 2024 •

edited

Loading