Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: normalize the join types #662

Merged
merged 12 commits into from
Aug 8, 2024

Conversation

EpsilonPrime
Copy link
Member

@EpsilonPrime EpsilonPrime commented Jul 11, 2024

There are two groups of join types in the definition with differing enums.

This PR leaves JoinRel's SEMI and ANTI as the canonical names for
LEFT_SEMI and LEFT_ANTI. Aliases are not allowed due to JSON (and
text) serialization behavior.

This PR also adds RIGHT_SEMI and RIGHT_ANTI to JoinRel's JoinType.

RIGHT_SINGLE is added to all types. The PR correspondingly renames
SINGLE to LEFT_SINGLE, ANTI TO LEFT_ANTI, and SEMI to LEFT_SEMI.
Finally this PR adds LEFT_SINGLE to all of the other join types.

BREAKING CHANGE: JoinRel's type enum now has LEFT_SINGLE
instead of SINGLE. Similarly there is now LEFT_ANTI and LEFT_SEMI.
Other values are available in all join type enums. This affects JSON and
text formats only (binary plans -- the interoperable part of Substrait --
will still be compatible before and after this change).

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally ok with this but we could clean the language up a bit.

site/docs/relations/logical_relations.md Show resolved Hide resolved
proto/substrait/algebra.proto Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think that we just need to update description of single input. I proposed a description that will hopefully be correct.

| Anti | Return records from the left input. These are returned only if the records do not have a join partner on the right side. |
| Single | Returns one join partner per entry on the left input. If more than one join partner exists, there are two valid semantics. 1) Only the first match is returned. 2) The system throws an error. If there is no match between the left and right inputs, NULL is returned. |
| Right Anti | Return records from the right input. These are returned only if the records do not have a join partner on the left side. |
| Single | Return all records from the left input. For each cross input match, return a record including the data from both sides. For any remaining non-matching records from the left input, return the left record along with nulls for the right input. If a record from the left input has more than one match in the right input then an error must be raised. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Single | Return all records from the left input. For each cross input match, return a record including the data from both sides. For any remaining non-matching records from the left input, return the left record along with nulls for the right input. If a record from the left input has more than one match in the right input then an error must be raised. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. |
| Single | Return all records from the primary (left) input with no join expansion. If at least one record from the secondary (right) input matches the primary, return one arbitrary matching record from the secondary (right) input. For any primary (left) records without matching secondary (right) records, return the primary (left) record along with nulls for the secondary (right) input. Similar to a left outer join but only returns one right match at most. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. |

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this. Not a huge fan of introducing new terminology (primary, secondary) and its confusing to me how this actually solves the nested subquery problem (since it doesn't raise an error) but I agree that we should document how engines are actually working when they are in disagreement with the paper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced it since the other one referred to primary but it wasn't defined above.

The more I think about this I think we should call this Left Single and Right Single. That would be more consistent with other names. Having Single and Right Single is just weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should be left single and right single. But for the purposes of this PR I just want to normalize the types available in the two join types (essentially a non-breaking change). A rename could break code so I would rather accomplish that in a later step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rename could break code so I would rather accomplish that in a later step.

I'm inclined to fix it now while we're already fixing the other stuff instead of punting it. If we're changing the definition of what SINGLE means (since it was vague before), that's basically a breaking spec change even if the protobuf change itself isn't breaking.

Who wants to spend the time to come back and fix that? If we need to fix something, I'd prefer to just fix it while we're in that exact code already.

@jacques-n jacques-n added the awaiting-user-input This issue is waiting on further input from users label Aug 1, 2024
@EpsilonPrime
Copy link
Member Author

I've updated the PR to rename SINGLE to LEFT_SINGLE and have removed primary/secondary from the text.

There's still a looming breaking change in the future which is replacing all of the enums with a single version with the same values but that's best handled in a major release with a number of other major breaking changes.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor change around primary wording.

also, would add to the breaking change that the change is only breaking at the name of the field/json plans. Protobuf plans will continue to be cross compatible.

| Anti | Return records from the left input. These are returned only if the records do not have a join partner on the right side. |
| Single | Returns one join partner per entry on the left input. If more than one join partner exists, there are two valid semantics. 1) Only the first match is returned. 2) The system throws an error. If there is no match between the left and right inputs, NULL is returned. |
| Right Anti | Return records from the right input. These are returned only if the records do not have a join partner on the left side. |
| Left Single | Return all records from the left input with no join expansion. If at least one record from the right input matches the primary, return one arbitrary matching record from the right input. For any left records without matching right records, return the primary (left) record along with nulls for the right input. Similar to a left outer join but only returns one right match at most. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed changing some of the primary references.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed those. Didn't touch the original use of primary in the set portion of the documentation.

@EpsilonPrime EpsilonPrime removed the awaiting-user-input This issue is waiting on further input from users label Aug 8, 2024
| Single | Returns one join partner per entry on the left input. If more than one join partner exists, there are two valid semantics. 1) Only the first match is returned. 2) The system throws an error. If there is no match between the left and right inputs, NULL is returned. |
| Type | Description |
| ----- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Inner | Return records from the left side only if they match the right side. Return records from the right side only when they match the left side. For each cross input match, return a record including the data from both sides. Non-matching records are ignored. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you avoid the reformatting. makes it really hard to see what is changed in diff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also renamed ANTI and SEMI to LEFT_ANTI and LEFT_SEMI.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for pushing this along @EpsilonPrime !

@jacques-n jacques-n merged commit bed84ec into substrait-io:main Aug 8, 2024
15 checks passed
@EpsilonPrime EpsilonPrime deleted the join_type_standardization branch September 26, 2024 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants