Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precise definition of equality of grouping columns in the aggregate relation #700

Open
ingomueller-net opened this issue Sep 5, 2024 · 10 comments

Comments

@ingomueller-net
Copy link
Contributor

This issue summarizes two threads in the Slack channel, which will be behind a paywall at some point.

The aggregate relation uses a definition of expression equality that affects its semantics, including its output schema. The current definition is ambiguous and based on protobuf message equality. The first is a problem because it makes the precise semantics of the relation ambiguous; the second is problematic because the protobuf definition should be derived from the spec and not the other way around.

This is the relevant part of the spec (emphasis mine):

It’s possible to specify multiple grouping sets in a single aggregate operation. The grouping sets behave more or less independently, with each returned record belonging to one of the grouping sets. The values for the grouping expression columns that are not part of the grouping set for a particular record will be set to null. Two grouping expressions will be returned using the same column if the protobuf messages describing the expressions are equal.

To elaborate the problem, let's consider an example. Let's say we have the following three grouping sets, where a and b are two arbitrary but different expressions:

(a, b)
(a)
(b)

Then the set of grouping expressions will be (a, b) and the output schema will have a column for each each of a and b.

The challenge arises during the deserialization of the protobuf messages. At that point, we do not have a and b yet, but four different messages:

(m1, m2)
(m3)
(m4)

In order to get to the representation above, we need to establish that m1 == m3, m2 == m4, and !(m1 == m2). For different definitions of ==, we may either come to the interpretation above, i.e., that the grouping set expressions is (m1, m2), or any other subset of (m1, m2, m3, m4), most of which have different semantics and, thus, break the common understanding between systems that don't agree on ==.

The current formulation of the spec allows for at least the following interpretations:

  • The protobuf messages are considered equal if their underlying buffers are byte-wise equal.
    • This doesn't make much sense since the fields of a protobuf can be serialized in any order.
  • They are considered equal if their field values are recursively equal. This is what MessageDifferencer implements.
    There are still multiple questions with this approach:
    • Are the advanced_extensions fields of relations relevant for equality?
    • Are deprecated fields part of equality if the newer fields are present?
    • Are type variations relevant for equality?
    • Are function references equal if they have different anchors but the two referred to extensions are equivalent?
  • They are considered equal if the expressions the deserialize into are equal ("the same set of expressions with the same arguments for each").
    • I believe that many of the questions of the previous approach still hold.

Note that expressions may contain sub-queries so pretty much all of the spec is affected and we have to come up with a definition that works unambiguously for all of that.

For reference, #92/#100 introduced the support for arbitrary grouping expressions instead of just field references. This simplified producers because it allowed them to store the grouping expressions directly in the aggregate relation rather than having to emit a project relation with the expressions and refer to those. The spec was then updated in #258/#260 to reflect the grouping expressions and introduced the current definition of equality.

@ingomueller-net
Copy link
Contributor Author

One solution is to come up with a more precise definition of expressions equality. That may be the best trade-off eventually.

However, I have the feeling that that is relatively brittle and can be difficult to implement (at least for consumers, see below). I thus proposed an alternative design to the aggregate relation on Slack that I think we should consider. The idea is to avoid repeating equal expressions altogether and express the grouping sets as references into a list of grouping expressions.

The relation would then look something like this:

message AggregateRel {
  // ...

  // A list of one or more grouping expression sets that the aggregation measures should be calculated for.
  // Required if there are no measures.
  repeated Grouping groupings = 3;

  // The grouping expressions that the grouping sets refer to.
  repeated Expression grouping_expressions = 5;   // <--- this is new

  // ...

  message Grouping {
    // References into `AggregateRel.grouping_expressions`.
    repeated int32 grouping_expressions = 1;  // <--- this is now a reference instead of an `Expression`
  }

  // ...
}

This avoids the definition of equality and, as a minor added benefit, compresses the plan in case of multiple grouping sets by avoiding redundant copies of grouping expressions.

However, there are also a couple of downsides:

  • This would be a breaking change.
  • It may be cumbersome for systems to implement this representation.
    Personally, I am not really convinced of this argument yet. Anyways, I think it's helpful to look at producers and consumers separately:
    • Producers may need to do the deduplication on their end with the definition. I understand that that may be more work on their end but I think it's fair to argue that it's their job to decide which of the expressions are equal.
    • Consumers, or at least engines, on the other hand, need to establish expression equality eventually because it's part of the semantics of the operation. I can't see why providing that equality makes their life any harder; in fact, I expect the opposite. Among possibly other systems, DataFusion parses grouping sets from Substrait into such a representation right away.

@jacques-n
Copy link
Contributor

It may be cumbersome for systems to implement this representation.

This was probably from a comment I made on Slack. I made the comment largely because most implementers start with a single grouping set (the most common case). Additionally, a lot of newer implementers might not even be aware of the concept of grouping sets. Having them normalize expressions in this way would be surprising for them. I don't think that is a showstopper but I thought it worth noting.

@jacques-n
Copy link
Contributor

I agree we need to be more specific on this. I don't think we need to have the most generous definition of equality. So to concretely answer your questions:

They are considered equal if their field values are recursively equal. This is what [MessageDifferencer]...

Are the advanced_extensions fields of relations relevant for equality?
yes. must be equal.
Are deprecated fields part of equality if the newer fields are present?
yes. must be equal
Are type variations relevant for equality?
yes. must be equal.
Are function references equal if they have different anchors but the two referred to extensions are equivalent?
no. must be same anchor.

Maybe this "simple" definition of equality may be enough for the next several years?...

@ingomueller-net
Copy link
Contributor Author

However, there are also a couple of downsides:
[...]

  • It may be cumbersome for systems to implement this representation.
    Personally, I am not really convinced of this argument yet. [...]

One argument that came up on Slack (which seems to go into the direction of @jacques-n's comment) is that Substraits high-level philosophy is to "allow for dumb consumers" (see this slide deck). In the context of the current issue, one could argue that sparing consumers to deduplicate expressions would allow to make them simpler, and that, if we have to decide whether producers or consumers need to do that, we'd rather give the producers this responsibility.

@ingomueller-net
Copy link
Contributor Author

I agree we need to be more specific on this. I don't think we need to have the most generous definition of equality. So to concretely answer your questions:
[...]

Are the advanced_extensions fields of relations relevant for equality?
yes. must be equal.

This may be more complex than it appears on a first look. A first (solvable) issue is the fact that part of AdvancedExtension is explicitly optional. I think it would be surprising that a performance optimization that can otherwise be ignored influences the number of output column of a relation.

message AdvancedExtension {
  // An optimization is helpful information that don't influence semantics. May
  // be ignored by a consumer.
  repeated google.protobuf.Any optimization = 1;
  // ...
}

The more complex issue appears here and in quite a number of other places I hadn't previously thought of: how do we compare two protobuf.Any messages? These are not part of the spec so it seems like we have to resort to one variant or the other of "message equality" -- with all the ambiguities mentioned above.

I looked a bit into MessageDifferencer handles this case. I believe it can't do what we are looking for. It is somehow able to compare messages of unknown message types but it either (1) considers all unknown fields different (if in EQUIVALENT mode), in which case even two byte-wise equal messages would be considered different, or (2) (if not using Equivalent, considers two messages different if one uses the implicit default value and the other one sets the same value explicitly, which we almost certainly also don't want.

I hate to be a pain in the neck here but I hope it is in everybody's interest to bring up this issue now rather than running into incompatibilities later on, when the spec may be widely adopted and considered final.

@ingomueller-net
Copy link
Contributor Author

I thought about this a bit more. The gravity of this problem is somewhat lower than I thought because, in many cases where Protobuf.Any messages are used, the consumer cannot understand the plan anyways. Those include.

  • ReadRel.ExtensionTable.detail
  • ReadRel.LocalFiles.FileOrFiles.file_format.extension
  • ExtensionSingleRel.detail
  • ExtensionLeafRel.detail
  • ExtensionMultiRel.detail
  • ExtensionObject.detail
  • Literal.UserDefined.value

However, there are a few instances where it would actually be nice if consumers not understanding the Any messages could just move them around as blackboxes. Those include:

  • ExchangeRel.ExchangeTarget.target_type.extended
  • AdvancedExtension.optimization
  • (One could argue that the whole previous list should be included here: there may be cases where a consumer may want to translate a plan into its internal representation even though it doesn't understand some parts of it.)

TL;DR: We can probably classify each of these into "can be ignored" and "cannot be ignored" but I maintain my argument that this makes consumers more complex than necessary.

@jacques-n
Copy link
Contributor

jacques-n commented Sep 6, 2024

"allow for dumb consumers"...

Yeah, good point.

One of the painful things here is that I don't see any clean way to support backwards compatibility if we want to move to the new representation. We might as well just introduce AggregatelRel2 or something as repeated fields don't work well with oneofs. (It begs the question of whether moving forward we might want to always wrap repeated fields in a message to avoid this kind of compatibility pain.) I think a lot of systems have implemented AggregatelRel already so I don't think we can do a hard breaking changes here like we might be able to do on edge cases that are rarely implemented.

@EpsilonPrime
Copy link
Member

EpsilonPrime commented Sep 7, 2024

We could repurpose groupings to be the ordered list of grouping expressions that we refer to and add a GroupingDetail? message that (if present) is the new grouping information with indexes to the expressions. It would be ugly to require that the groupings only contain one expression though.

@jacques-n
Copy link
Contributor

We could repurpose groupings to be the ordered list of grouping expressions that we refer to and add a GroupingDetail? message that (if present) is the new grouping information with indexes to the expressions. It would be ugly to require that the groupings only contain one expression though.

It'd be awkward since groupings is a two dimensional array of expressions and we need a one dimensional array.

ingomueller-net added a commit to ingomueller-net/substrait that referenced this issue Sep 13, 2024
This PR changes the definition of grouping sets in `AggregateRel` to
consist of references into a list of grouping expressions instead of
consisting of expressions directly. As discussed in more detail in substrait-io#700,
the previous version is problematic because it requires consumers to
deduplicate these expressions, which, in turn, requires to parse and
understand 100% of these expression even in cases where that
understanding is otherwise optional. The new version avoids that problem
and, thus, allows consumers to be simpler.

Signed-off-by: Ingo Müller <ingomueller@google.com>
ingomueller-net added a commit to ingomueller-net/substrait that referenced this issue Sep 13, 2024
BREAKING CHANGE: This PR changes the definition of grouping sets in
`AggregateRel` to consist of references into a list of grouping
expressions instead of consisting of expressions directly. As discussed
in more detail in substrait-io#700, the previous version is problematic because it
requires consumers to deduplicate these expressions, which, in turn,
requires to parse and understand 100% of these expression even in cases
where that understanding is otherwise optional. The new version avoids
that problem and, thus, allows consumers to be simpler.

Signed-off-by: Ingo Müller <ingomueller@google.com>
@ingomueller-net
Copy link
Contributor Author

There seems to be an general openess to following my proposal -- if there are still arguments against it please let me know. I thus went ahead and made a concrete proposal: #706. I think it avoids the problem that @EpsilonPrime and @jacques-n have been discussing.

ingomueller-net added a commit to ingomueller-net/substrait that referenced this issue Sep 13, 2024
BREAKING CHANGE: This PR changes the definition of grouping sets in
`AggregateRel` to consist of references into a list of grouping
expressions instead of consisting of expressions directly. As discussed
in more detail in substrait-io#700, the previous version is problematic because it
requires consumers to deduplicate these expressions, which, in turn,
requires to parse and understand 100% of these expression even in cases
where that understanding is otherwise optional. The new version avoids
that problem and, thus, allows consumers to be simpler.

Signed-off-by: Ingo Müller <ingomueller@google.com>
ingomueller-net added a commit to ingomueller-net/substrait that referenced this issue Sep 23, 2024
BREAKING CHANGE: This PR changes the definition of grouping sets in
`AggregateRel` to consist of references into a list of grouping
expressions instead of consisting of expressions directly. As discussed
in more detail in substrait-io#700, the previous version is problematic because it
requires consumers to deduplicate these expressions, which, in turn,
requires to parse and understand 100% of these expression even in cases
where that understanding is otherwise optional. The new version avoids
that problem and, thus, allows consumers to be simpler.

Signed-off-by: Ingo Müller <ingomueller@google.com>
jacques-n pushed a commit that referenced this issue Sep 27, 2024
BREAKING CHANGE: This PR changes the definition of grouping sets in
`AggregateRel` to consist of references into a list of grouping
expressions instead of consisting of expressions directly.

With the previous definition, consumers had to deduplicate the
expressions in the grouping sets in order to execute the query or even
derive the output schema (which is problematic, as explained below).
With this change, the responsibility of deduplicating expressions is now
on the producer. Concretely, consumers are now expected to be simpler:
The list of grouping expressions immediately provides the information
needed to derive the output schema and the list of grouping sets
explicitly and unambiguously provides the equality of grouping
expressions. Producers now have to specify the grouping sets explicitly.
If their internal representation of grouping sets consists of full
grouping expressions (rather than references), then they must
deduplicate these expressions according to their internal notion of
expression equality in order to produce grouping sets consisting of
references to these deduplicated expressions.

If the previous format is desired, it can be obtained from the new
format by (1) deduplicating the grouping expressions (according to the
previously applicable definition of expression equality), (2)
re-establishing the duplicates using the emit clause, and (3)
"dereferencing" the references in the grouping sets, i.e., by replacing
each reference in the grouping sets with the expression it refers to.

The previous version was problematic because it required the *consumers*
to deduplicate the expressions from the grouping sets. This, in turn,
requires to parse and understand 100% of these expression even in cases
where that understanding is otherwise optional, which is in opposition
to the general philosophy of allowing for simple-minded consumers. The
new version avoids that problem and, thus, allows consumers to be
simpler. The issues are discussed in even more detail in #700.

---------

Signed-off-by: Ingo Müller <ingomueller@google.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants