Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow naming/aliasing relations #649

Merged
merged 4 commits into from
Jul 3, 2024

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Jun 11, 2024

Same goal as in #648 - to support what Spark's and DataFusion's SubqueryAlias relation does.

As Substrait mostly only referes to columns by their index, there is no inherent need for table name/qualifiers within Substrait. However, some consumers, e.g. DataFusion, require column names to be either unique or qualified for joins, which is troublesome w/o the possibility to qualify relations.

Also, for debugging failed plans and for roundtrip testing of X -> Substrait -> X conversions, it would be convenient to have proper, human-readable names to refer to.

Closes #571

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about making this something like:

map<string, string> metadata = 6;

Then you could have {"alias": "catalog.table.rel_alias"}. We could define some canonical keys (e.g. alias) so that different producers / consumers have interoperability.

I ask this because there have been different asks for this kind of "relation metadata" information. For example, some users would like to attach "the SQL representation of the relation" so that they can use it for debugging. Others have wanted to attach a "unique id", again, that they can use for debugging.

Rather than add a bunch of optional properties would it be better to have one dictionary of optimal metadata that can be attached to relations?

// Optional name (alias) for this relation, if provided, should be unique across the root.
// Safe to ignore, or can be used for qualifying the relation or debugging.
// Repeated to allow for multiple levels of naming, e.g. catalog/schema/table
repeated string rel_names = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an alias for the relation? Or for the fields in the relation? If it is an alias for the relation should this maybe just be repeated string alias = 5;?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alias for the relation. Yeah, naming is hard 😅 I tried to align with the "names" in ReadTable rel, but yes I'd be happy with just "alias".

@Blizzara
Copy link
Contributor Author

How would you feel about making this something like:

map<string, string> metadata = 6;

Then you could have {"alias": "catalog.table.rel_alias"}. We could define some canonical keys (e.g. alias) so that different producers / consumers have interoperability.

I ask this because there have been different asks for this kind of "relation metadata" information. For example, some users would like to attach "the SQL representation of the relation" so that they can use it for debugging. Others have wanted to attach a "unique id", again, that they can use for debugging.

Rather than add a bunch of optional properties would it be better to have one dictionary of optimal metadata that can be attached to relations?

That'd work for me! I can see the benefit of having custom metadata as well. Just for my usecase I'd really need the "alias" key to be canonical given I need it in both Spark and DataFusion - so I wonder if the best way to define those canonical keys would still be to have some protobuf message, with the canonical keys + the extension point (like the map you suggested)? Or is there some reason not to have it in the message / some better place to have it? I guess technically one could also wrap it all in some SimpleExtension kind of thing, and then provide the canonical keys as extensions in this repo 🤔

@westonpace
Copy link
Member

so I wonder if the best way to define those canonical keys would still be to have some protobuf message, with the canonical keys + the extension point (like the map you suggested)? Or is there some reason not to have it in the message / some better place to have it?

I think my main motivation here is for users to be able to add new metadata fields for experimentation and trials without requiring a spec change. By canonical keys + extension point do you mean something like...

message Metadata {
  repeated string names;
  ...
  map<string, string> other_metadata;
}

I'm not opposed to it. However...

I guess technically one could also wrap it all in some SimpleExtension kind of thing, and then provide the canonical keys as extensions in this repo 🤔

What if there were a section in the site that listed the canonical keys. This way the protobuf doesn't need to change when new keys are added:

Substrait-Relation-Metadata

@Blizzara
Copy link
Contributor Author

I think my main motivation here is for users to be able to add new metadata fields for experimentation and trials without requiring a spec change. By canonical keys + extension point do you mean something like...

Exactly - I modified this PR to show that. I feel like that'd be a good compromise between allowing experimentation while also providing some guarantees of structure. (I'm not strict about the specific names, so if there's better ideas happy to rename!)

What if there were a section in the site that listed the canonical keys. This way the protobuf doesn't need to change when new keys are added:

I feel this could end up being too fragile - it might be easier to break the list of keys on the website, people might use keys for one reason before the same key might be added to the website for other reasons, etc. I'd prefer this proposal where there's a clear divide between optiona-but-defined-metadata and totally-custom-metadata.

It could still work, I guess, but would not be my preferred option.

@@ -52,6 +53,22 @@ message RelCommon {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}
}

// Optional metadata that the producer may attach to the relation.
// Consumer should hold no expectations about the existance metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: "The consumer should not rely on the metadata for the evaluation of the plan."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might even want to mention that both the consumer and the producer shouldn't rely on this.

In other words:

  • As a consumer, that wants to handle plans from multiple producers, be prepared to handle plans that are missing this metadata
  • As a producer, that wants plans to be run by multiple consumers, understand that your consumer might ignore this metadata

// Any custom metadata that the producer wants to attach to the relation.
// Consumer should hold no expectations about the existence, or lack of,
// or meaning of, of any specific key, unless the producer is known.
map<string, string> custom = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having string values is probably the right choice here as it means you can't stuff arbitrary protobufs here (that's the job on a protobuf extension anyway).

// Consumer should hold no expectations about the existance metadata.
message Metadata {
// Name (alias) for this relation.
// If provided, must be unique across the relations in the plan/root.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it only needs to be unique within the subplan that consumes it. But this is a safer qualification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to bike shed on names but if it must be unique across all relations then would id be a better name to use here than alias?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have a slight preference for "alias" rather than "id", as "alias" is the name used in SQL (eg. https://www.w3schools.com/sql/sql_alias.asp) and also "id" sounds like something people might expect to be able to refer to later on in the (substrait) plan, which we probably don't want. But I can be convinced otherwise :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with alias then.

// Safe to ignore, or can be used for e.g. qualifying the relation (see e.g.
// Spark's SubqueryAlias), or debugging.
// Repeated to allow for multiple levels of naming, e.g. catalog/schema/table.
repeated string alias = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this separate is the only part left of the PR that could be contentious. By allowing this to qualify the relation this is participating in the evaluation of the plan. Naming a relation isn't going to be enough to fix Datafusion's duplicate field errors -- the equivalent of SELECT x+2, x+2 will still trigger them.

@vbarua
Copy link
Member

vbarua commented Jun 12, 2024

To play devil's advocate somewhat.

I ask this because there have been different asks for this kind of "relation metadata" information. For example, some users would like to attach "the SQL representation of the relation" so that they can use it for debugging. Others have wanted to attach a "unique id", again, that they can use for debugging.

With the map<string, string> metadata; field, we're introducing a second mechanism for introducing limited amounts of arbitrary data. The current mechanism we have for data like this are Advanced Extensions. Those are currently limited to optimisations and enhancements, which I don't think aliases map to, but it could potentially be expanded to include arbitrary metadata with a new field for this kind of information.

Having string values is probably the right choice here as it means you can't stuff arbitrary protobufs here

To David's point, giving users an extension point to add arbitrary strings into the system feels like it starts with string names and string uuids and ends with string encoded binary messages.

I do like the idea of having canonical metadata information, like aliases, but maybe they could be baked into advanced extensions like:

message AdvancedExtension {
  // An optimization is helpful information that don't influence semantics. May
  // be ignored by a consumer.
  google.protobuf.Any optimization = 1;

  // An enhancement alter semantics. Cannot be ignored by a consumer.
  google.protobuf.Any enhancement = 2;

  // metadata provides additional information which must not alter semantics
  // metadata can assist with debugging and overall plan readability
  // can be ignore by consumers 
  repeated google.protobuf.Any metadata = 2;
}

and then we can provide messages in the core spec like

message AliasMetadata {
  // Name (alias) for this relation.
  // If provided, must be unique across the relations in the plan/root.

  // Safe to ignore, or can be used for e.g. qualifying the relation (see e.g.
  // Spark's SubqueryAlias), or debugging.
  // Repeated to allow for multiple levels of naming, e.g. catalog/schema/table.
  repeated string alias = 1;
}

Which let's us provide simple metadata but then also allows users (and us) to do things like:

message DebuggingMetadata {
   // Complex debugging info
  ...
}

This also avoid the issues of breaking key changes because there are no keys involved, just messages.

@westonpace
Copy link
Member

westonpace commented Jun 13, 2024

I do like @vbarua 's suggestion. I would argue that we should couple it with changes in the site docs (add a "standard advanced extensions" section or "standard metadata" section). Otherwise it is not clear from the protobuf that AdvancedExtension::metadata and AliasMetadata are related (admittedly, the name is a good indicator, but messages tend to get scattered throughout the proto file and it is very large so it may be hard to link the two)

@EpsilonPrime
Copy link
Member

One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types. We probably should have repeated advanced extensions included wherever we currently have one.

@richtia
Copy link
Member

richtia commented Jun 17, 2024

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

@westonpace
Copy link
Member

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

It should support that already? So the substrait -> datafusion path works ok. It's the datafusion -> substrait -> datafusion path that loses fidelity.

@richtia
Copy link
Member

richtia commented Jun 17, 2024

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

It should support that already? So the substrait -> datafusion path works ok. It's the datafusion -> substrait -> datafusion path that loses fidelity.

I was thinking more along the lines of the duckdb -> substrait -> datafusion path.

@Blizzara
Copy link
Contributor Author

Blizzara commented Jun 19, 2024

the equivalent of SELECT x+2, x+2 will still trigger them
@EpsilonPrime true, but a) that will also fail if the sink is e.g. Parquet, which doesn't allow duplicate column names, so I think it's just a fair difference in behavior, and b) the entity creating the plan can instead decide to write the plan as SELECT x+2 as col1, x+2 as col2 and that'll work in DataFusion as well (there might be plans where it's not that easy, since column names internal to the plan are not maintained), wrt. this current issue where currently the entity creating the plan cannot do anything to make the plan work in DataFusion.

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

@richtia this is only a problem in some specific and not-that-common cases (mainly virtual tables with same column names and self-joins iirc). Plans not involving those would be supported w/o the additional metadata also in the future.

That said, I've been thinking about this and I think I want to propose changes into DataFusion to handle these cases w/o the aliasing.

I still think the aliasing would be useful feature for metadata/logging/debugging reasons, but if I manage to make DF more resilient that decouples this discussion from my immediate needs, so we can take all the time here to find a good solution :)

@Blizzara
Copy link
Contributor Author

Re having the metadata be an AdvancedExtension, that sounds reasonable to me - however I'm not very familiar with those, and I had the same concern as @EpsilonPrime:

One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types.

@westonpace
Copy link
Member

One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types.

This is a concern for optimization and enhancement but not for metadata (as proposed by @vbarua) since Victor added the repeated field for metadata.

I agree that optimization and enhancement should probably be repeated but that seems like a separate PR (to avoid breaking changes in this one and isolate breaking changes in their own PR).

@richtia this is only a problem in some specific and not-that-common cases (mainly virtual tables with same column names and self-joins iirc). Plans not involving those would be supported w/o the additional metadata also in the future.

That said, I've been thinking about this and I think I want to propose changes into DataFusion to handle these cases w/o the aliasing.

Ok, I understand now. Yes, I agree that DataFusion needs to be able to handle self-joins (and other scenarios) without requiring the incoming plan have an alias.

@westonpace
Copy link
Member

westonpace commented Jun 19, 2024

This PR was discussed for quite a while in the community meeting (maybe we can link the video once its up on YouTube). I'll try and summarize (based on my biased and imperfect memory):

  • We definitely want this
  • Where in the plan to put alias info?
    • The comments in the PR currently suggest a new field (metadata) placed in AdvancedExtension. @jacques-n mentioned that we already have optimization and enhancement and we should try and reuse optimization. There should only ever be two categories ("can consumer ignore this? Yes/no"). There is no need for a third "the consumer can ignore this" thing.
  • Which advanced extension?
    • There are several extension points on a Rel (let's call it T). There is T::advanced_extension, T::common::advanced_extension, T::common::hint::advanced_extension, T::common::hint::constraint::advanced_extension, and finally T::common::hint::stats::advanced_extension.
  • Should we make name-based resolution a first-class and required concept in Substrait (followed by some discussion from last meeting about resolving parquet names and the ReadRel::base_schema). The general consensus seemed to be "probably not but let's discuss another time and not worry about it for this feature"
  • Should we add map<string,string> metadata? If so should it be a first-class structured thing (e.g. something like T::common::hint::metadata or an advanced extension? Or should we add alias by itself, without metadata, also should that be a first-class structured item (e.g. something like T::common::hint::alias)? Or should we make alias be an advanced extension all by itself? (we could even add both metadata AND alias)
  • Should we use repeated string for an alias or just string?
    • Do SQL aliases every have nested names?
    • Even if they did could we just use a dotted name?
    • One point brought up is that Spark does this but probably just so there is a common "column identifier" type that aliases can share.
  • Should we make AdvancedExtension::optimization a repeated structure? Yes
  • Should we make AdvancedExtension::enhancement a repeated structure? No

@westonpace
Copy link
Member

My personal opinion is split between several approaches

  • We should make a new field RelCommon::metadata which is map<string, string> and let alias be "canonical metadata"
  • We should make a new field RelCommon::hint::alias which is string and let alias be "first class structured"
  • We should make Alias a message like message Alias { string alias = 1; } and place it in RelCommon::hint::advanced_extension::optimization

Regardless of which approach we take I think we should make AdvancedExtension::optimization a repeated field.

@Blizzara
Copy link
Contributor Author

Should we use repeated string for an alias or just string?
Do SQL aliases every have nested names?

Both Spark and DataFusion allow composite name (Spark iirc has a list, DataFusion has three fields for catalog/schema/name.
From what I see, DuckDB's alias seems to take only a single string. Dunno about others

That said, those make sense when coming from a ReadRel, but dunno if anyone would ever use those within the plan. I'd expect using just a single string would be okay for all usages.

Even if they did could we just use a dotted name?

I'd assume so, yes.

One point brought up is that Spark does this but probably just so there is a common "column identifier" type that aliases can share.

Yep, same for DataFusion. I used the repeated string here since it's just more generalizable (a consumer can always combine the list into dotted name, for example), but I'm happy to go with a single string as well - maybe that makes things simpler.

@Blizzara
Copy link
Contributor Author

We should make a new field RelCommon::hint::alias which is string and let alias be "first class structured"
We should make Alias a message like message Alias { string alias = 1; } and place it in RelCommon::hint::advanced_extension::optimization

I'd vote for one of these. I think having it in hint would make things easier for this purpose, and given that ALIAS is a concept in probably every DB/query system it doesn't feel that wrong, but I also acknowledge that we may not want to include everything inside hint so there is a point in having it in the extension as well 🤷

@Blizzara
Copy link
Contributor Author

Blizzara commented Jun 21, 2024

I updated the PR to be just the hint version, since it sounds like there's no use in adding the extraneous metadata message anyways.

Also, FWIW, apache/datafusion#11049 should resolve the original problem I had on DataFusion side.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This style works for me. We don't want extra fields growing without bounds but I think this property is common enough to be justified.

@westonpace
Copy link
Member

@jacques-n / @vbarua / @EpsilonPrime can one of you take another look at this PR?

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@EpsilonPrime EpsilonPrime merged commit 4cf8108 into substrait-io:main Jul 3, 2024
14 of 15 checks passed
@Blizzara Blizzara deleted the avo/add-name-for-relation branch July 8, 2024 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support alias for rel
6 participants