-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: normalize the join types #662
feat: normalize the join types #662
Conversation
e344aeb
to
6d85e06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally ok with this but we could clean the language up a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think that we just need to update description of single input. I proposed a description that will hopefully be correct.
| Anti | Return records from the left input. These are returned only if the records do not have a join partner on the right side. | | ||
| Single | Returns one join partner per entry on the left input. If more than one join partner exists, there are two valid semantics. 1) Only the first match is returned. 2) The system throws an error. If there is no match between the left and right inputs, NULL is returned. | | ||
| Right Anti | Return records from the right input. These are returned only if the records do not have a join partner on the left side. | | ||
| Single | Return all records from the left input. For each cross input match, return a record including the data from both sides. For any remaining non-matching records from the left input, return the left record along with nulls for the right input. If a record from the left input has more than one match in the right input then an error must be raised. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Single | Return all records from the left input. For each cross input match, return a record including the data from both sides. For any remaining non-matching records from the left input, return the left record along with nulls for the right input. If a record from the left input has more than one match in the right input then an error must be raised. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. | | |
| Single | Return all records from the primary (left) input with no join expansion. If at least one record from the secondary (right) input matches the primary, return one arbitrary matching record from the secondary (right) input. For any primary (left) records without matching secondary (right) records, return the primary (left) record along with nulls for the secondary (right) input. Similar to a left outer join but only returns one right match at most. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this. Not a huge fan of introducing new terminology (primary, secondary) and its confusing to me how this actually solves the nested subquery problem (since it doesn't raise an error) but I agree that we should document how engines are actually working when they are in disagreement with the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced it since the other one referred to primary but it wasn't defined above.
The more I think about this I think we should call this Left Single and Right Single. That would be more consistent with other names. Having Single and Right Single is just weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it should be left single and right single. But for the purposes of this PR I just want to normalize the types available in the two join types (essentially a non-breaking change). A rename could break code so I would rather accomplish that in a later step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A rename could break code so I would rather accomplish that in a later step.
I'm inclined to fix it now while we're already fixing the other stuff instead of punting it. If we're changing the definition of what SINGLE means (since it was vague before), that's basically a breaking spec change even if the protobuf change itself isn't breaking.
Who wants to spend the time to come back and fix that? If we need to fix something, I'd prefer to just fix it while we're in that exact code already.
Co-authored-by: Jacques Nadeau <jacquesnadeau@gmail.com>
I've updated the PR to rename SINGLE to LEFT_SINGLE and have removed primary/secondary from the text. There's still a looming breaking change in the future which is replacing all of the enums with a single version with the same values but that's best handled in a major release with a number of other major breaking changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor change around primary wording.
also, would add to the breaking change that the change is only breaking at the name of the field/json plans. Protobuf plans will continue to be cross compatible.
| Anti | Return records from the left input. These are returned only if the records do not have a join partner on the right side. | | ||
| Single | Returns one join partner per entry on the left input. If more than one join partner exists, there are two valid semantics. 1) Only the first match is returned. 2) The system throws an error. If there is no match between the left and right inputs, NULL is returned. | | ||
| Right Anti | Return records from the right input. These are returned only if the records do not have a join partner on the left side. | | ||
| Left Single | Return all records from the left input with no join expansion. If at least one record from the right input matches the primary, return one arbitrary matching record from the right input. For any left records without matching right records, return the primary (left) record along with nulls for the right input. Similar to a left outer join but only returns one right match at most. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed changing some of the primary references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed those. Didn't touch the original use of primary in the set portion of the documentation.
| Single | Returns one join partner per entry on the left input. If more than one join partner exists, there are two valid semantics. 1) Only the first match is returned. 2) The system throws an error. If there is no match between the left and right inputs, NULL is returned. | | ||
| Type | Description | | ||
| ----- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| Inner | Return records from the left side only if they match the right side. Return records from the right side only when they match the left side. For each cross input match, return a record including the data from both sides. Non-matching records are ignored. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you avoid the reformatting. makes it really hard to see what is changed in diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also renamed ANTI and SEMI to LEFT_ANTI and LEFT_SEMI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for pushing this along @EpsilonPrime !
There are two groups of join types in the definition with differing enums.
This PR leaves JoinRel's SEMI and ANTI as the canonical names for
LEFT_SEMI and LEFT_ANTI. Aliases are not allowed due to JSON (and
text) serialization behavior.
This PR also adds RIGHT_SEMI and RIGHT_ANTI to JoinRel's JoinType.
RIGHT_SINGLE is added to all types. The PR correspondingly renames
SINGLE to LEFT_SINGLE, ANTI TO LEFT_ANTI, and SEMI to LEFT_SEMI.
Finally this PR adds LEFT_SINGLE to all of the other join types.
BREAKING CHANGE: JoinRel's type enum now has LEFT_SINGLE
instead of SINGLE. Similarly there is now LEFT_ANTI and LEFT_SEMI.
Other values are available in all join type enums. This affects JSON and
text formats only (binary plans -- the interoperable part of Substrait --
will still be compatible before and after this change).