Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ALL/DISTINCT modifier for all set operation types #708

Merged
merged 9 commits into from
Sep 27, 2024

Conversation

kadinrabo
Copy link
Contributor

@kadinrabo kadinrabo commented Sep 18, 2024

Today the SetRel spec contains Intersection and Minus supporting 1 or more secondary inputs, but it doesn’t consider distinct/all.

Union distinct/all were considered common enough operations to be included, but this PR is necessary to support sql intersect/except distinct/all without performing the set op and distinct/all as separate stages.

Eventually this would lead to the deprecation of union distinct/all. My initial inquiry

@CLAassistant
Copy link

CLAassistant commented Sep 18, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protobuf changes look good to me, but we should updated the documentation for SetRel as well

## Set Operation
The set operation encompasses several set-level operations that support combining datasets, possibly excluding records based on various types of record level matching.
| Signature | Value |
| -------------------- | ------------------------------------------------------------ |
| Inputs | 2 or more |
| Outputs | 1 |
| Property Maintenance | Maintains distribution if all inputs have the same ordinal distribution. Orderedness is not maintained. |
| Direct Output Order | The field order of the inputs. All inputs must have identical field *types*, but field nullabilities may vary. |
### Set Properties
| Property | Description | Required |
| ------------------ | --------------------------------- | --------------------- |
| Primary Input | The primary input of the dataset. | Required |
| Secondary Inputs | One or more relational inputs. | At least one required |
| Set Operation Type | From list below. | Required |
### Set Operation Types
The set operation type determines both the records that are emitted and the type of the output record.
| Property | Description | Output Nullability
| ----------------------- | ------------------------------------------------------------------------------------------------------------- | ----------------------------- |
| Minus (Primary) | Returns all records from the primary input excluding any matching records from secondary inputs. | The same as the primary input.
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | The same as the primary input.
| Intersection (Primary) | Returns all records from the primary input that match at least one record from *any* secondary inputs. | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
| Intersection (Multiset) | Returns all records from the primary input that match at least one record from *all* secondary inputs. | If a field is required in any of the inputs, it is required in the output.
| Union Distinct | Returns all the records from each set, removing any rows that are duplicated (within or across sets). | If a field is nullable in any of the inputs, it is nullable in the output.
| Union All | Returns all records from each set, allowing duplicates. | If a field is nullable in any of the inputs, it is nullable in the output. |

@vbarua vbarua changed the title feat: add a SetMode option to the SetRel spec to allow encoding of distinct/all SQL modifiers feat: add ALL/DISTINCT modifier for all set operation types Sep 18, 2024
@jacques-n
Copy link
Contributor

A few thoughts:

  • I'm not a fan of refactoring where it isn't necessary (e.g. creating something that needs to impls to handle compatibility cases). In this case I would prefer to keep the existing enums and just add more. That means everything in existence already would work without changes (and no compat handling code).
  • Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.
  • The spec update should drive the protobuf update. (What @vbarua said about lets make sure the spec is clear.)
  • I do wonder if we should define a Distinct relation, as well. Doesn't need to be addressed here but the Aggregate with all columns and no measures seems excessive.

@EpsilonPrime
Copy link
Member

Spark Connect puts the distinctness on the function instead of the relation:

https://github.com/apache/spark/blob/branch-3.5/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto#L236

@westonpace
Copy link
Member

westonpace commented Sep 19, 2024

Spark Connect puts the distinctness on the function instead of the relation:

We already have distinct on the function (

AggregationInvocation invocation = 6;
). I think @jacques-n point was more that a "distinct with no functions" is a common enough use case to warrant a relation.

Also, it doesn't help with set ops since they aren't functions.

Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.

This would be good to document as it would have been a surprise to me (though I just double checked and that does match the postgres definition). The current description (as I interpret it) is NOT deduplicating (at least, it wouldn't deduplicate the primary). We should be make sure Substrait is clear enough to not require knowledge of the SQL spec.

Union distinct/all were considered common enough operations to be included, but this PR is necessary to support sql intersect/except distinct/all without performing the set op and distinct/all as separate stages.

What does INTERSECT ALL do in the presence of duplicates? It seems ill defined in postgres. Given the table:

a b
3 1
3 1
2 2
2 3
2 2

If I issue the query SELECT a FROM table INTERSECT ALL SELECT b from table I get:

RESULT
3
2
2

Why does 2 show up twice and 3 only shows up once? If I were to consider all combinations I think I would have two combinations of (3, 3) and six combinations of (2, 2).

(and indeed, the query SELECT t1.a, t2.b FROM table as t1 CROSS JOIN table as t2 WHERE t1.a = t2.b has 8 results)

@kadinrabo
Copy link
Contributor Author

Agree on keeping existing things working and less refactoring. It will be a pretty small change then

SET_OP_MINUS_DISTINCT = 7;
SET_OP_INTERSECTION_DISTINCT = 8;

Why does 2 show up twice and 3 only shows up once? If I were to consider all combinations I think I would have two combinations of (3, 3) and six combinations of (2, 2).

@westonpace You might be thinking of Cartesian product! I'm pretty sure for INTERSECT ALL, the number of times a value appears in the final output is determined by the column with the fewer occurrences of that value in both tables. It looks at all occurrences of the values in both tables, but the output is limited by the smallest number of matches between the two. For example for this:

a b
1 1
3 1
2 2
2 3
2 2

we would get:

RESULT
1
3
2
2

because although 1 has two combinations the minimum times it appears in both tables is once so only one 1 in the output. For distinct, duplicates are just removed after:

RESULT
1
3
2

@westonpace
Copy link
Member

It looks at all occurrences of the values in both tables, but the output is limited by the smallest number of matches between the two

Thanks, I agree that matches my testing. Let's make sure to include that in the description when the spec part is updated. I appreciate the explanation!

@vbarua
Copy link
Member

vbarua commented Sep 19, 2024

Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.

Like for @westonpace, this is was also a surprise for me based on my reading of the Substrait spec. For example for Minus (Primary)

Returns all records from the primary input excluding any matching records from secondary inputs.

I assumed all records here included dupes (i.e ALL not DISTINCT).

If we want to treat SET_OP_MINUS_PRIMARY, SET_OP_MINUS_MULTISET, SET_OP_INTERSECTION_PRIMARY and SET_OP_INTERSECTION_MULTISET as deduplicating, that sounds reasonable to me but we should make it more explicit in the spec because as written I don't think that's how they behave.

@kadinrabo TIL about the INTERSECT ALL behaviour. Per the spec as you indicated:
https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

            b) If a set operator is specified, then the result of applying
              the set operator is a table containing the following rows:

              i) Let R be a row that is a duplicate of some row in T1 or of
                 some row in T2 or both. Let m be the number of duplicates
                 of R in T1 and let n be the number of duplicates of R in
                 T2, where m � 0 and n � 0.

            ...            

            iii) If ALL is specified, then

                 Case:

                 1) If UNION is specified, then the number of duplicates of
                   R that T contains is (m + n).

                 2) If EXCEPT is specified, then the number of duplicates of
                   R that T contains is the maximum of (m - n) and 0.

                 3) If INTERSECT is specified, then the number of duplicates
                   of R that T contains is the minimum of m and n.

we should probably make sure the spec docs match the EXCEPT ALL and INTERSECT ALL behaviour described here.

In this case I would prefer to keep the existing enums and just add more.

@jacques-n that sounds reasonable. I was hoping to avoid a future were we end up with something like:

  enum SetOp {
    SET_OP_UNSPECIFIED = 0;
    SET_OP_MINUS_PRIMARY = 1;
    SET_OP_MINUS_PRIMARY_ALL = ???;
    SET_OP_MINUS_MULTISET = 2;
    SET_OP_MINUS_MULTISET_ALL = ???;
    SET_OP_INTERSECTION_PRIMARY = 3;
    SET_OP_INTERSECTION_PRIMARY_ALL = ???;
    SET_OP_INTERSECTION_MULTISET = 4;
    SET_OP_INTERSECTION_MULTISET_ALL = ???;
    SET_OP_UNION_DISTINCT = 5;
    SET_OP_UNION_ALL = 6;
  }

with two enum values per set operation, but that may not be a huge deal and also means we can defer defining the meaning of SET_OP_MINUS_PRIMARY_ALL and SET_OP_INTERSECTION_MULTISET_ALL to a future time.

@@ -338,6 +338,8 @@ message SetRel {
SET_OP_INTERSECTION_MULTISET = 4;
SET_OP_UNION_DISTINCT = 5;
SET_OP_UNION_ALL = 6;
SET_OP_MINUS_DISTINCT = 7;
SET_OP_INTERSECTION_DISTINCT = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me, the changes here and below are the opposite of what we were saying.

If default is deduplicating, then you should be introducing MINUS ALL, not MINUS DISTINCT, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining it how @vbarua suggested makes most sense to me too I was just trying to minimize the effect of my changes because I also initially interpreted Minus (Primary) to include duplicates according to the current spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍 for treating SET_OP_MINUS_PRIMARY, SET_OP_MINUS_MULTISET, SET_OP_INTERSECTION_PRIMARY and SET_OP_INTERSECTION_MULTISET as deduplicating that makes more sense although it's longer

@jacques-n
Copy link
Contributor

@jacques-n point was more that a "distinct with no functions" is a common enough use case to warrant a relation

Yep

We should be make sure Substrait is clear enough to not require knowledge of the SQL spec.

Agreed

@jacques-n
Copy link
Contributor

I was hoping to avoid a future were we end up with something like... [long list ensues]

@vbarua , if we were starting from scratch I would agree. Since we already have a bunch of people writing this, I'm inclined to stick with the current pattern. Besides, enums are cheap, right? :D

@kadinrabo
Copy link
Contributor Author

I added the new enum values at the end to keep the existing ones unchanged for people already referencing them

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check something, does it make sense for our versions of INTERSECT ALL and MINUS ALL to follow from the SQL spec?

cc: @jacques-n @westonpace @EpsilonPrime

Intersection All looks at all occurrences of values in both tables, but the output is limited by the smallest number of matches between the two:
```
{1, 3, 2, 2, 2} INTERSECTION ALL {1, 1, 2, 3, 2} === {1, 3, 2, 2}
{1, 3, 2, 2, 2} INTERSECTION DISTINCT {1, 1, 2, 3, 2} === {1, 3, 2}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion this should be inlined into the description, and we should also have updated descriptions for

  • Minus All
  • Minus Multiset All
  • Intersection All
  • Intersection Multiset All

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good. I noticed one small bit of weirdness around the INTERSECTION ops. Left a big comment about it, but would also be interested to see if @jacques-n @westonpace @EpsilonPrime have any additional input on it.

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
| Minus (Primary) | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicate rows within or across sets.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result. | MINUS<br/>&nbsp;&nbsp;m: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{4} | The same as the primary input.
| Minus (Primary All) | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies. | MINUS ALL<br/>&nbsp;&nbsp;m: {1, 2, 2, 3, 3, 3, 3}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 4}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{2, 3, 3} | The same as the primary input.
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | MINUS MULTISET<br/>&nbsp;&nbsp;m: {1, 2, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {1, 2, 3}<br/>YIELDS<br/>{4} | The same as the primary input.
| Intersection (Primary) | Returns all records from the primary input that are present in every secondary input, removing duplicate rows within or across sets. | INTERSECT<br/>&nbsp;&nbsp;m: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 5}<br/>&nbsp;&nbsp;n2: {2, 3, 6}<br/>YIELDS<br/>{2, 3} | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns all records from the primary input that are present in every secondary input.

This used to be any secondary input and not every secondary input, so this changes the definition beyond just making deduplication explicit.

I'm also realising that while

<A> INTERSECT <B>

can map to SET_OP_INTERSECTION_PRIMARY

<A> INTERSECT <B> INTERSECT <C>`

doesn't, because the behaviour of chained intersects like this is consistent with SET_OP_INTERSECTION_MULTISET. SQL INTERSECT should generally map to this.

Taking this further,

<A> INTERSECT ALL <B> INTERSECT ALL <C>

would need to be mapped to something like

SET_OP_INTERSECTION_MULTISET_ALL

instead of

SET_OP_INTERSECTION_PRIMARY_ALL

I think it might make sense to define

SET_OP_INTERSECTION_MULTISET_ALL

and not

SET_OP_INTERSECTION_PRIMARY_ALL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the new definition of Intersection (Primary) is incorrect.

Why aren't we adding both SET_OP_INTERSECTION_MULTISET_ALL and SET_OP_INTERSECTION_PRIMARY_ALL

I agree with your logic that, if we are adding only one, MULTISET_ALL makes more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: Ah, I think I see now, the duplicate handling behavior of SET_OP_INTERSECTION_PRIMARY_ALL is ambiguous. I could interpret it as:

min(m, max(n1, n2, ..., n))

or...

min(m, sum(n1, n2, ..., n))

Copy link
Member

@vbarua vbarua Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to not handle SET_OP_INTERSECTION_PRIMARY_ALL as part of this PR because I think figuring out which of those 2 behaviours is the "correct" one seems like it's own thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree but to confirm: the only new additions will be SET_OP_MINUS_PRIMARY_ALL and SET_OP_INTERSECTION_MULTISET_ALL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree but to confirm: the only new additions will be SET_OP_MINUS_PRIMARY_ALL and SET_OP_INTERSECTION_MULTISET_ALL

Yes, we should leave the other variants for their own PR (if anyone ends up needing/wanting them).

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @vbarua 's logic. Let's fix the definition of Intersection (Primary) and change Intersection (Primary All) into Intersection (Multiset All).

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
| Minus (Primary) | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicate rows within or across sets.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result. | MINUS<br/>&nbsp;&nbsp;m: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{4} | The same as the primary input.
| Minus (Primary All) | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies. | MINUS ALL<br/>&nbsp;&nbsp;m: {1, 2, 2, 3, 3, 3, 3}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 4}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{2, 3, 3} | The same as the primary input.
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | MINUS MULTISET<br/>&nbsp;&nbsp;m: {1, 2, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {1, 2, 3}<br/>YIELDS<br/>{4} | The same as the primary input.
| Intersection (Primary) | Returns all records from the primary input that are present in every secondary input, removing duplicate rows within or across sets. | INTERSECT<br/>&nbsp;&nbsp;m: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 5}<br/>&nbsp;&nbsp;n2: {2, 3, 6}<br/>YIELDS<br/>{2, 3} | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the new definition of Intersection (Primary) is incorrect.

Why aren't we adding both SET_OP_INTERSECTION_MULTISET_ALL and SET_OP_INTERSECTION_PRIMARY_ALL

I agree with your logic that, if we are adding only one, MULTISET_ALL makes more sense.

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
| Minus (Primary) | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicate rows within or across sets.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result. | MINUS<br/>&nbsp;&nbsp;m: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{4} | The same as the primary input.
| Minus (Primary All) | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies. | MINUS ALL<br/>&nbsp;&nbsp;m: {1, 2, 2, 3, 3, 3, 3}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 4}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{2, 3, 3} | The same as the primary input.
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | MINUS MULTISET<br/>&nbsp;&nbsp;m: {1, 2, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {1, 2, 3}<br/>YIELDS<br/>{4} | The same as the primary input.
| Intersection (Primary) | Returns all records from the primary input that are present in every secondary input, removing duplicate rows within or across sets. | INTERSECT<br/>&nbsp;&nbsp;m: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 5}<br/>&nbsp;&nbsp;n2: {2, 3, 6}<br/>YIELDS<br/>{2, 3} | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: Ah, I think I see now, the duplicate handling behavior of SET_OP_INTERSECTION_PRIMARY_ALL is ambiguous. I could interpret it as:

min(m, max(n1, n2, ..., n))

or...

min(m, sum(n1, n2, ..., n))

vbarua
vbarua previously approved these changes Sep 27, 2024
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments and one nullability change, but otherwise this looks good to me.

vbarua
vbarua previously approved these changes Sep 27, 2024
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick updated @kadinrabo

westonpace
westonpace previously approved these changes Sep 27, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit but otherwise looks good. Thanks for doing this!

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@kadinrabo kadinrabo dismissed stale reviews from westonpace and vbarua via 63f5e16 September 27, 2024 20:17
Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for the helpful documentation!

@vbarua vbarua merged commit f796521 into substrait-io:main Sep 27, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants