Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Supported ops page differentiate type supported in key vs passed along #2102

Closed
tgravescs opened this issue Apr 8, 2021 · 4 comments · Fixed by #3183
Closed

[DOC] Supported ops page differentiate type supported in key vs passed along #2102

tgravescs opened this issue Apr 8, 2021 · 4 comments · Fixed by #3183
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@tgravescs
Copy link
Collaborator

Report incorrect documentation

The supported ops page doesn't seem to indicate if a type is supposed as the main key (like joining on) vs just being able to be passed along as another columns. Perhaps we should add a note for this.

@tgravescs tgravescs added documentation Improvements or additions to documentation ? - Needs Triage Need team to review and classify labels Apr 8, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Apr 13, 2021
@andygrove andygrove self-assigned this Jul 22, 2021
@andygrove andygrove added this to the July 19 - July 30 milestone Jul 22, 2021
@andygrove
Copy link
Contributor

PR #3011 adds this information for types that cannot be usued as join keys in sort-merge, hash, and broadcast joins.

@tgravescs is that sufficient to address this issue or are there other types that need clarification?

@tgravescs
Copy link
Collaborator Author

so that sounds sufficient for joins but I wonder about other operations like hashAggregate, sorting, union, etc...

@revans2
Copy link
Collaborator

revans2 commented Jul 26, 2021

We can do the same thing for all of them. It is mostly a question of should the framework have a more generic way to do this, or should we add in one off solutions.

SparkPlan/Scala gives us a productIterator that we might be able to use to filter it to just expressions/sequences of expressions and then match those up with a pattern, similar to what we do when matching parameters to an Expression. That to me feels like a fairly generic way to make this work. But it is a lot of code compared to what is happening in #3011 along with the already existing check in the code.

    val keyDataTypes = (leftKeys ++ rightKeys).map(_.dataType)
    if (keyDataTypes.exists(dType =>
      dType.isInstanceOf[ArrayType] || dType.isInstanceOf[MapType])) {
      meta.willNotWorkOnGpu("ArrayType or MapType in join keys are not supported")
    }

So The difference would be something like

BroadcastHashJoinExec Implementation of join using broadcast data None S ... S NS NS PS* (Cannot be used as join key; missing nested BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT) PS* (Cannot be used as join key; missing nested BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT) PS* (Cannot be used as join key; missing nested BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT) NS

vs

BroadcastHashJoinExec Implementation of join using broadcast data LeftKeys None S ... S NS NS NS NS NS NS
BroadcastHashJoinExec Implementation of join using broadcast data RightKeys None S ... S NS NS NS NS NS NS
BroadcastHashJoinExec Implementation of join using broadcast data Output None S ... S NS NS PS* (missing nested BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT) PS* (missing nested BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT) PS* (missing nested BINARY, CALENDAR, ARRAY, MAP, STRUCT, UDT) NS

@sameerz
Copy link
Collaborator

sameerz commented Jul 28, 2021

The latter / more generic approach that @revans2 mentions above seems like the better way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants