feat: allow naming/aliasing relations #649

Blizzara · 2024-06-11T07:25:00Z

Same goal as in #648 - to support what Spark's and DataFusion's SubqueryAlias relation does.

As Substrait mostly only referes to columns by their index, there is no inherent need for table name/qualifiers within Substrait. However, some consumers, e.g. DataFusion, require column names to be either unique or qualified for joins, which is troublesome w/o the possibility to qualify relations.

Also, for debugging failed plans and for roundtrip testing of X -> Substrait -> X conversions, it would be convenient to have proper, human-readable names to refer to.

Closes #571

westonpace

How would you feel about making this something like:

map<string, string> metadata = 6;

Then you could have {"alias": "catalog.table.rel_alias"}. We could define some canonical keys (e.g. alias) so that different producers / consumers have interoperability.

I ask this because there have been different asks for this kind of "relation metadata" information. For example, some users would like to attach "the SQL representation of the relation" so that they can use it for debugging. Others have wanted to attach a "unique id", again, that they can use for debugging.

Rather than add a bunch of optional properties would it be better to have one dictionary of optimal metadata that can be attached to relations?

westonpace · 2024-06-11T14:02:47Z

proto/substrait/algebra.proto

+  // Optional name (alias) for this relation, if provided, should be unique across the root.
+  // Safe to ignore, or can be used for qualifying the relation or debugging.
+  // Repeated to allow for multiple levels of naming, e.g. catalog/schema/table
+  repeated string rel_names = 5;


Is this an alias for the relation? Or for the fields in the relation? If it is an alias for the relation should this maybe just be repeated string alias = 5;?

Alias for the relation. Yeah, naming is hard 😅 I tried to align with the "names" in ReadTable rel, but yes I'd be happy with just "alias".

Blizzara · 2024-06-11T14:26:33Z

How would you feel about making this something like:
map<string, string> metadata = 6;
Then you could have {"alias": "catalog.table.rel_alias"}. We could define some canonical keys (e.g. alias) so that different producers / consumers have interoperability.

I ask this because there have been different asks for this kind of "relation metadata" information. For example, some users would like to attach "the SQL representation of the relation" so that they can use it for debugging. Others have wanted to attach a "unique id", again, that they can use for debugging.

Rather than add a bunch of optional properties would it be better to have one dictionary of optimal metadata that can be attached to relations?

That'd work for me! I can see the benefit of having custom metadata as well. Just for my usecase I'd really need the "alias" key to be canonical given I need it in both Spark and DataFusion - so I wonder if the best way to define those canonical keys would still be to have some protobuf message, with the canonical keys + the extension point (like the map you suggested)? Or is there some reason not to have it in the message / some better place to have it? I guess technically one could also wrap it all in some SimpleExtension kind of thing, and then provide the canonical keys as extensions in this repo 🤔

westonpace · 2024-06-12T02:51:54Z

so I wonder if the best way to define those canonical keys would still be to have some protobuf message, with the canonical keys + the extension point (like the map you suggested)? Or is there some reason not to have it in the message / some better place to have it?

I think my main motivation here is for users to be able to add new metadata fields for experimentation and trials without requiring a spec change. By canonical keys + extension point do you mean something like...

message Metadata {
  repeated string names;
  ...
  map<string, string> other_metadata;
}

I'm not opposed to it. However...

I guess technically one could also wrap it all in some SimpleExtension kind of thing, and then provide the canonical keys as extensions in this repo 🤔

What if there were a section in the site that listed the canonical keys. This way the protobuf doesn't need to change when new keys are added:

Blizzara · 2024-06-12T12:46:23Z

I think my main motivation here is for users to be able to add new metadata fields for experimentation and trials without requiring a spec change. By canonical keys + extension point do you mean something like...

Exactly - I modified this PR to show that. I feel like that'd be a good compromise between allowing experimentation while also providing some guarantees of structure. (I'm not strict about the specific names, so if there's better ideas happy to rename!)

What if there were a section in the site that listed the canonical keys. This way the protobuf doesn't need to change when new keys are added:

I feel this could end up being too fragile - it might be easier to break the list of keys on the website, people might use keys for one reason before the same key might be added to the website for other reasons, etc. I'd prefer this proposal where there's a clear divide between optiona-but-defined-metadata and totally-custom-metadata.

It could still work, I guess, but would not be my preferred option.

EpsilonPrime · 2024-06-12T19:51:58Z

proto/substrait/algebra.proto

@@ -52,6 +53,22 @@ message RelCommon {
      substrait.extensions.AdvancedExtension advanced_extension = 10;
    }
  }
+
+  // Optional metadata that the producer may attach to the relation.
+  // Consumer should hold no expectations about the existance metadata.


How about: "The consumer should not rely on the metadata for the evaluation of the plan."

Might even want to mention that both the consumer and the producer shouldn't rely on this.

In other words:

As a consumer, that wants to handle plans from multiple producers, be prepared to handle plans that are missing this metadata

As a producer, that wants plans to be run by multiple consumers, understand that your consumer might ignore this metadata

EpsilonPrime · 2024-06-12T19:52:48Z

proto/substrait/algebra.proto

+    // Any custom metadata that the producer wants to attach to the relation.
+    // Consumer should hold no expectations about the existence, or lack of,
+    // or meaning of, of any specific key, unless the producer is known.
+    map<string, string> custom = 2;


Having string values is probably the right choice here as it means you can't stuff arbitrary protobufs here (that's the job on a protobuf extension anyway).

EpsilonPrime · 2024-06-12T19:53:55Z

proto/substrait/algebra.proto

+  // Consumer should hold no expectations about the existance metadata.
+  message Metadata {
+    // Name (alias) for this relation.
+    // If provided, must be unique across the relations in the plan/root.


Technically it only needs to be unique within the subplan that consumes it. But this is a safer qualification.

I hate to bike shed on names but if it must be unique across all relations then would id be a better name to use here than alias?

I'd have a slight preference for "alias" rather than "id", as "alias" is the name used in SQL (eg. https://www.w3schools.com/sql/sql_alias.asp) and also "id" sounds like something people might expect to be able to refer to later on in the (substrait) plan, which we probably don't want. But I can be convinced otherwise :D

Let's stick with alias then.

EpsilonPrime · 2024-06-12T19:59:18Z

proto/substrait/algebra.proto

+    // Safe to ignore, or can be used for e.g. qualifying the relation (see e.g.
+    // Spark's SubqueryAlias), or debugging.
+    // Repeated to allow for multiple levels of naming, e.g. catalog/schema/table.
+    repeated string alias = 1;


Having this separate is the only part left of the PR that could be contentious. By allowing this to qualify the relation this is participating in the evaluation of the plan. Naming a relation isn't going to be enough to fix Datafusion's duplicate field errors -- the equivalent of SELECT x+2, x+2 will still trigger them.

vbarua · 2024-06-12T23:54:02Z

To play devil's advocate somewhat.

I ask this because there have been different asks for this kind of "relation metadata" information. For example, some users would like to attach "the SQL representation of the relation" so that they can use it for debugging. Others have wanted to attach a "unique id", again, that they can use for debugging.

With the map<string, string> metadata; field, we're introducing a second mechanism for introducing limited amounts of arbitrary data. The current mechanism we have for data like this are Advanced Extensions. Those are currently limited to optimisations and enhancements, which I don't think aliases map to, but it could potentially be expanded to include arbitrary metadata with a new field for this kind of information.

Having string values is probably the right choice here as it means you can't stuff arbitrary protobufs here

To David's point, giving users an extension point to add arbitrary strings into the system feels like it starts with string names and string uuids and ends with string encoded binary messages.

I do like the idea of having canonical metadata information, like aliases, but maybe they could be baked into advanced extensions like:

message AdvancedExtension {
  // An optimization is helpful information that don't influence semantics. May
  // be ignored by a consumer.
  google.protobuf.Any optimization = 1;

  // An enhancement alter semantics. Cannot be ignored by a consumer.
  google.protobuf.Any enhancement = 2;

  // metadata provides additional information which must not alter semantics
  // metadata can assist with debugging and overall plan readability
  // can be ignore by consumers 
  repeated google.protobuf.Any metadata = 2;
}

and then we can provide messages in the core spec like

message AliasMetadata {
  // Name (alias) for this relation.
  // If provided, must be unique across the relations in the plan/root.

  // Safe to ignore, or can be used for e.g. qualifying the relation (see e.g.
  // Spark's SubqueryAlias), or debugging.
  // Repeated to allow for multiple levels of naming, e.g. catalog/schema/table.
  repeated string alias = 1;
}

Which let's us provide simple metadata but then also allows users (and us) to do things like:

message DebuggingMetadata {
   // Complex debugging info
  ...
}

This also avoid the issues of breaking key changes because there are no keys involved, just messages.

westonpace · 2024-06-13T13:42:47Z

I do like @vbarua 's suggestion. I would argue that we should couple it with changes in the site docs (add a "standard advanced extensions" section or "standard metadata" section). Otherwise it is not clear from the protobuf that AdvancedExtension::metadata and AliasMetadata are related (admittedly, the name is a good indicator, but messages tend to get scattered throughout the proto file and it is very large so it may be hard to link the two)

EpsilonPrime · 2024-06-15T00:21:29Z

One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types. We probably should have repeated advanced extensions included wherever we currently have one.

richtia · 2024-06-17T20:06:49Z

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

westonpace · 2024-06-17T21:35:06Z

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

It should support that already? So the substrait -> datafusion path works ok. It's the datafusion -> substrait -> datafusion path that loses fidelity.

richtia · 2024-06-17T21:44:12Z

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

It should support that already? So the substrait -> datafusion path works ok. It's the datafusion -> substrait -> datafusion path that loses fidelity.

I was thinking more along the lines of the duckdb -> substrait -> datafusion path.

Blizzara · 2024-06-19T10:09:11Z

the equivalent of SELECT x+2, x+2 will still trigger them
@EpsilonPrime true, but a) that will also fail if the sink is e.g. Parquet, which doesn't allow duplicate column names, so I think it's just a fair difference in behavior, and b) the entity creating the plan can instead decide to write the plan as SELECT x+2 as col1, x+2 as col2 and that'll work in DataFusion as well (there might be plans where it's not that easy, since column names internal to the plan are not maintained), wrt. this current issue where currently the entity creating the plan cannot do anything to make the plan work in DataFusion.

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

@richtia this is only a problem in some specific and not-that-common cases (mainly virtual tables with same column names and self-joins iirc). Plans not involving those would be supported w/o the additional metadata also in the future.

That said, I've been thinking about this and I think I want to propose changes into DataFusion to handle these cases w/o the aliasing.

I still think the aliasing would be useful feature for metadata/logging/debugging reasons, but if I manage to make DF more resilient that decouples this discussion from my immediate needs, so we can take all the time here to find a good solution :)

Blizzara · 2024-06-19T10:10:06Z

Re having the metadata be an AdvancedExtension, that sounds reasonable to me - however I'm not very familiar with those, and I had the same concern as @EpsilonPrime:

One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types.

westonpace · 2024-06-19T13:51:34Z

One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types.

This is a concern for optimization and enhancement but not for metadata (as proposed by @vbarua) since Victor added the repeated field for metadata.

I agree that optimization and enhancement should probably be repeated but that seems like a separate PR (to avoid breaking changes in this one and isolate breaking changes in their own PR).

@richtia this is only a problem in some specific and not-that-common cases (mainly virtual tables with same column names and self-joins iirc). Plans not involving those would be supported w/o the additional metadata also in the future.

That said, I've been thinking about this and I think I want to propose changes into DataFusion to handle these cases w/o the aliasing.

Ok, I understand now. Yes, I agree that DataFusion needs to be able to handle self-joins (and other scenarios) without requiring the incoming plan have an alias.

westonpace · 2024-06-19T20:31:58Z

This PR was discussed for quite a while in the community meeting (maybe we can link the video once its up on YouTube). I'll try and summarize (based on my biased and imperfect memory):

We definitely want this
Where in the plan to put alias info?
- The comments in the PR currently suggest a new field (metadata) placed in AdvancedExtension. @jacques-n mentioned that we already have optimization and enhancement and we should try and reuse optimization. There should only ever be two categories ("can consumer ignore this? Yes/no"). There is no need for a third "the consumer can ignore this" thing.
Which advanced extension?
- There are several extension points on a Rel (let's call it T). There is T::advanced_extension, T::common::advanced_extension, T::common::hint::advanced_extension, T::common::hint::constraint::advanced_extension, and finally T::common::hint::stats::advanced_extension.
Should we make name-based resolution a first-class and required concept in Substrait (followed by some discussion from last meeting about resolving parquet names and the ReadRel::base_schema). The general consensus seemed to be "probably not but let's discuss another time and not worry about it for this feature"
Should we add map<string,string> metadata? If so should it be a first-class structured thing (e.g. something like T::common::hint::metadata or an advanced extension? Or should we add alias by itself, without metadata, also should that be a first-class structured item (e.g. something like T::common::hint::alias)? Or should we make alias be an advanced extension all by itself? (we could even add both metadata AND alias)
Should we use repeated string for an alias or just string?
- Do SQL aliases every have nested names?
- Even if they did could we just use a dotted name?
- One point brought up is that Spark does this but probably just so there is a common "column identifier" type that aliases can share.
Should we make AdvancedExtension::optimization a repeated structure? Yes
Should we make AdvancedExtension::enhancement a repeated structure? No

westonpace · 2024-06-19T20:37:47Z

My personal opinion is split between several approaches

We should make a new field RelCommon::metadata which is map<string, string> and let alias be "canonical metadata"
We should make a new field RelCommon::hint::alias which is string and let alias be "first class structured"
We should make Alias a message like message Alias { string alias = 1; } and place it in RelCommon::hint::advanced_extension::optimization

Regardless of which approach we take I think we should make AdvancedExtension::optimization a repeated field.

Blizzara · 2024-06-21T16:39:28Z

Should we use repeated string for an alias or just string?
Do SQL aliases every have nested names?

Both Spark and DataFusion allow composite name (Spark iirc has a list, DataFusion has three fields for catalog/schema/name.
From what I see, DuckDB's alias seems to take only a single string. Dunno about others

That said, those make sense when coming from a ReadRel, but dunno if anyone would ever use those within the plan. I'd expect using just a single string would be okay for all usages.

Even if they did could we just use a dotted name?

I'd assume so, yes.

One point brought up is that Spark does this but probably just so there is a common "column identifier" type that aliases can share.

Yep, same for DataFusion. I used the repeated string here since it's just more generalizable (a consumer can always combine the list into dotted name, for example), but I'm happy to go with a single string as well - maybe that makes things simpler.

Blizzara · 2024-06-21T16:43:02Z

We should make a new field RelCommon::hint::alias which is string and let alias be "first class structured"
We should make Alias a message like message Alias { string alias = 1; } and place it in RelCommon::hint::advanced_extension::optimization

I'd vote for one of these. I think having it in hint would make things easier for this purpose, and given that ALIAS is a concept in probably every DB/query system it doesn't feel that wrong, but I also acknowledge that we may not want to include everything inside hint so there is a point in having it in the extension as well 🤷

…ring

Blizzara · 2024-06-21T16:46:57Z

I updated the PR to be just the hint version, since it sounds like there's no use in adding the extraneous metadata message anyways.

Also, FWIW, apache/datafusion#11049 should resolve the original problem I had on DataFusion side.

westonpace

This style works for me. We don't want extra fields growing without bounds but I think this property is common enough to be justified.

westonpace · 2024-07-02T16:09:54Z

@jacques-n / @vbarua / @EpsilonPrime can one of you take another look at this PR?

jacques-n

LGTM +1

vbarua

LGTM +1

feat: allow naming/aliasing relations

5316143

Blizzara requested review from jacques-n, cpcloud, westonpace, EpsilonPrime and vbarua as code owners June 11, 2024 07:25

westonpace mentioned this pull request Jun 11, 2024

feat: add SubqueryAliasRel #648

Closed

westonpace reviewed Jun 11, 2024

View reviewed changes

Blizzara added 2 commits June 12, 2024 14:39

move into a Metadata message and add a map for additional metadata

b6789f3

move some docs to metadata

b2ba49f

EpsilonPrime reviewed Jun 12, 2024

View reviewed changes

remove Metadata message, move alias into Hint and make it a simple st…

f38b9c6

…ring

Blizzara mentioned this pull request Jun 26, 2024

Substrait: Add SubqueryAlias support apache/datafusion#6867

Open

vbarua mentioned this pull request Jun 26, 2024

feat: make optimization a repeated field #653

Merged

westonpace approved these changes Jun 28, 2024

View reviewed changes

EpsilonPrime approved these changes Jul 2, 2024

View reviewed changes

jacques-n approved these changes Jul 3, 2024

View reviewed changes

vbarua approved these changes Jul 3, 2024

View reviewed changes

EpsilonPrime merged commit 4cf8108 into substrait-io:main Jul 3, 2024
14 of 15 checks passed

Blizzara deleted the avo/add-name-for-relation branch July 8, 2024 09:14

westonpace mentioned this pull request Jul 24, 2024

feat: add output_schema to ExpandRel message #661

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow naming/aliasing relations #649

feat: allow naming/aliasing relations #649

Blizzara commented Jun 11, 2024 •

edited

Loading

westonpace left a comment

westonpace Jun 11, 2024

Blizzara Jun 11, 2024

Blizzara commented Jun 11, 2024

westonpace commented Jun 12, 2024

Blizzara commented Jun 12, 2024

EpsilonPrime Jun 12, 2024

westonpace Jun 13, 2024

EpsilonPrime Jun 12, 2024

EpsilonPrime Jun 12, 2024

westonpace Jun 13, 2024

Blizzara Jun 19, 2024

westonpace Jun 19, 2024

EpsilonPrime Jun 12, 2024

vbarua commented Jun 12, 2024

westonpace commented Jun 13, 2024 •

edited

Loading

EpsilonPrime commented Jun 15, 2024

richtia commented Jun 17, 2024

westonpace commented Jun 17, 2024

richtia commented Jun 17, 2024

Blizzara commented Jun 19, 2024 •

edited

Loading

Blizzara commented Jun 19, 2024

westonpace commented Jun 19, 2024

westonpace commented Jun 19, 2024 •

edited

Loading

westonpace commented Jun 19, 2024

Blizzara commented Jun 21, 2024

Blizzara commented Jun 21, 2024

Blizzara commented Jun 21, 2024 •

edited

Loading

westonpace left a comment

westonpace commented Jul 2, 2024

jacques-n left a comment

vbarua left a comment

feat: allow naming/aliasing relations #649

feat: allow naming/aliasing relations #649

Conversation

Blizzara commented Jun 11, 2024 • edited Loading

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara commented Jun 11, 2024

westonpace commented Jun 12, 2024

Blizzara commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua commented Jun 12, 2024

westonpace commented Jun 13, 2024 • edited Loading

EpsilonPrime commented Jun 15, 2024

richtia commented Jun 17, 2024

westonpace commented Jun 17, 2024

richtia commented Jun 17, 2024

Blizzara commented Jun 19, 2024 • edited Loading

Blizzara commented Jun 19, 2024

westonpace commented Jun 19, 2024

westonpace commented Jun 19, 2024 • edited Loading

westonpace commented Jun 19, 2024

Blizzara commented Jun 21, 2024

Blizzara commented Jun 21, 2024

Blizzara commented Jun 21, 2024 • edited Loading

westonpace left a comment

Choose a reason for hiding this comment

westonpace commented Jul 2, 2024

jacques-n left a comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

Blizzara commented Jun 11, 2024 •

edited

Loading

westonpace commented Jun 13, 2024 •

edited

Loading

Blizzara commented Jun 19, 2024 •

edited

Loading

westonpace commented Jun 19, 2024 •

edited

Loading

Blizzara commented Jun 21, 2024 •

edited

Loading