Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support deep schema pruning and projection #11747

Closed
wants to merge 4 commits into from

Conversation

adragomir
Copy link

@adragomir adragomir commented Jul 31, 2024

Which issue does this PR close?

Closes #11745

What changes are included in this PR?

This PR is currently opened as a discussion support for this feature.

  1. We need a way to add the deep projection. This is currently represented (for ease of merging etc) as an extra parameter, projection_deep added after a projection parameter before. This parameter needs to be propagated all the way down to the physical data source, in our case parquet.

The representation of the deep projection is a HashMap<usize, Vec<String>>. The key represents the index of the top level column, while the Vec<String> is a list of "paths". Each path represents a way to navigate inside a deep field separated by dots, like top_level_struct.subfield1.*.subfield2.

Current problems / questions

  • Duplicated information - for ease of merging, at the moment, we have both the projection and projection_deep parameters, even though we could only pass the second one, that has in the keys the information currently in projection
  • Untyped representation for path - maybe we need a better / safer way to represent the deep projections ? due to unfamiliarity and ease of development, currently it's just a String, but maybe we need to have it represented as a typed thing ?
  1. We have a set of functions that can: rewrite schemas and record batches from an input to an output schema, rewrite an input schema to a (potentially) much smaller output schema that contains ONLY the necessary columns. This is the first patch in the PR (). These functions can be tested in isolation we need to add more unit tests, etc.

Current problems / questions

  • We have a lot of seemingly duplicated code that recurses through schemas. We tried making it more generic, but it was pretty complicated, so for the moment there is some duplication in the code, and not really sure how to make it better
  • Deep Schema test cases - creating deep schemas in code is very verbose, so we don't have enough test cases. One way would be to use pre-generated parquet files created with other tools and use those for the schema.
  1. We need to detect the deep projections from the input logical plan, that means navigating the expressions, down to the columns and transforming index and field access expressions to a map field[0]["some_sub_field"]["other_sub_field"] -> 0: Vec<String>{"*.some_sub_field.other_sub_field"}

Current problems / questions

  • Uniform field accesses need 2 accesses through the fields:
    • At the moment, we recurse through the input expressions like a["b"]["c"] to compute a path like b.c, or *.c, or b.*. The ["b"] could mean either a map access or a substruct access, same for the c
    • So, we go through the paths twice - once we create them and after that we iterate them again, WITH the input schems so we can detect map accesses and replace actual field names that are actual map or list accesses with * - so we also introduce a magic string here :(
  • Implementing a new scan_deep function in the TableProvider
    • We can get rid of this. It's done this way for now, so that we don't break existing TableProvider implementations
  1. We need to push these through the physical layer and finally implement the reading in the parquet layer.

Current problems / questions

  • Named field in map key value or list element
    • In the parquet schema, a list has a field inside it with a specific name. However, we don't have this name anywhere, except at the parquet layer. The arrow reader as far as we can tell loads these and replaces the names with standard names ("element" in the list case, "key_value.key", "key_value".value). BUT, if the arrow loader behavior changes, this functionality will break. At the moment we detect these hardcoded names and we replace them with a wildcard "*" that signifies that we don't care about this name. We also do this in the INPUT paths, because we don't have these actual names at that level.

Are these changes tested?

A version of this is tested on top of Datafusion-40, the rebase on main has not been tested. The PR is opened as a discussion support, but the patch applied mostly cleanly apart from some refactorings in data fucion

Are there any user-facing changes?

Possible API change for implementers of TableProvider

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate labels Jul 31, 2024
@github-actions github-actions bot added physical-expr Physical Expressions catalog Related to the catalog crate common Related to common crate proto Related to proto crate labels Aug 23, 2024
@adragomir
Copy link
Author

Closing this for now, we will retest on main and reopen

@adragomir adragomir closed this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog Related to the catalog crate common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support deep schema pruning and projection
1 participant