-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add optional metadata containing field names to RelCommon #696
feat: add optional metadata containing field names to RelCommon #696
Conversation
proto/substrait/algebra.proto
Outdated
@@ -32,6 +34,11 @@ message RelCommon { | |||
repeated int32 output_mapping = 1; | |||
} | |||
|
|||
message Metadata { | |||
// Sets (or resets) the output field names for any relation. | |||
repeated string output_names = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just add this in Hint (since it's something a consumer can ignore), next to string alias = 3
which serves a similar purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either would work for me. The only reason I avoided this was because of the comment:
// Changes to the operation that can influence efficiency/performance but
// should not impact correctness.
In the case of ExpandRel
in Spark, it does impact the correctness.
Other than that, I'm happy to go with whatever the maintainers prefer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, how does it affect correctness? If the expected names are not supplied, you should be able to pass in some generated names, and the plan's correctness should be intact, right? (TBH I don't know how exactly Expand works so I might be missing something, but for all I know #661 (comment) is correct in that only the names at the end of the plan should matter for correctness and all intermediary names can be whatever. (That said, for debugging/readability I do like having correct names also in the middle of the plan, so in general I'm totally for this change!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Spark implementation uses the names to connect up to other parts of the plan then it is using the new field for correctness. If however it is only using them to ensure the roundtrip works (i.e. giving names to something that wouldn't otherwise be there) then it is metadata. I would hope that the Spark implementation would use the second approach (and if it does it would fall under metadata).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done some further testing, and I think I agree this is an artefact of the round-trip test mechanism that the spark
module is using.
I've moved this into the Hint
message, as suggested.
9197b45
to
298fcf1
Compare
proto/substrait/algebra.proto
Outdated
@@ -41,6 +41,8 @@ message RelCommon { | |||
// Name (alias) for this relation. Can be used for e.g. qualifying the relation (see e.g. | |||
// Spark's SubqueryAlias), or debugging. | |||
string alias = 3; | |||
// Sets (or resets) the output field names for any relation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make it clear if the names here are alternative names or a single dotted name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, although I'm not sure what you mean by 'single dotted name'. Would the following comment explain it better?
// Assigns alternative output field names for any relation. For example, if the relation outputs
// three fields (columns) then their names can be changed by assigning each new name to an
// instance of output_names, in the same order in which they are emitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so this is not the name of the subquery but its outputs. Then it should work exactly like root names (so you can name struct subfields as well). A reference to relroot's names field should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. So perhaps...
// Assigns alternative output field names for any relation. Equivalent to the names field
// in RelRoot, but allows the option of naming the fields of any relation within the structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Assigns alternative output field names for any relation. Equivalent to the names field
// in RelRoot but applies to the output of the relation this RelCommon is attached to.
I like that better. Here's an attempt written on my phone.
Allows the output field names of any relation to be set or reset Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com> Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>
298fcf1
to
cd03365
Compare
Does this need a second reviewer, or can it be merged now? |
Thanks for the patience @andrew-coleman ! |
Following the discussion and suggestion proposed in PR #661, this PR introduces a new metadata field in the
RelCommon
proto definition which contains the output field names for any relation.As previously discussed, this is required to support the implementation of the
Expand
relation in thespark
module.