feat: add optional metadata containing field names to RelCommon #696

andrew-coleman · 2024-08-28T10:47:37Z

Following the discussion and suggestion proposed in PR #661, this PR introduces a new metadata field in the RelCommon proto definition which contains the output field names for any relation.

As previously discussed, this is required to support the implementation of the Expand relation in the spark module.

Blizzara · 2024-08-28T10:50:33Z

proto/substrait/algebra.proto

@@ -32,6 +34,11 @@ message RelCommon {
    repeated int32 output_mapping = 1;
  }

+  message Metadata {
+    // Sets (or resets) the output field names for any relation.
+    repeated string output_names = 1;


maybe just add this in Hint (since it's something a consumer can ignore), next to string alias = 3 which serves a similar purpose?

Either would work for me. The only reason I avoided this was because of the comment:

// Changes to the operation that can influence efficiency/performance but
// should not impact correctness.

In the case of ExpandRel in Spark, it does impact the correctness.

Other than that, I'm happy to go with whatever the maintainers prefer :)

Hm, how does it affect correctness? If the expected names are not supplied, you should be able to pass in some generated names, and the plan's correctness should be intact, right? (TBH I don't know how exactly Expand works so I might be missing something, but for all I know #661 (comment) is correct in that only the names at the end of the plan should matter for correctness and all intermediary names can be whatever. (That said, for debugging/readability I do like having correct names also in the middle of the plan, so in general I'm totally for this change!)

If the Spark implementation uses the names to connect up to other parts of the plan then it is using the new field for correctness. If however it is only using them to ensure the roundtrip works (i.e. giving names to something that wouldn't otherwise be there) then it is metadata. I would hope that the Spark implementation would use the second approach (and if it does it would fall under metadata).

I've done some further testing, and I think I agree this is an artefact of the round-trip test mechanism that the spark module is using.
I've moved this into the Hint message, as suggested.

EpsilonPrime · 2024-09-03T23:14:36Z

proto/substrait/algebra.proto

@@ -41,6 +41,8 @@ message RelCommon {
    // Name (alias) for this relation. Can be used for e.g. qualifying the relation (see e.g.
    // Spark's SubqueryAlias), or debugging.
    string alias = 3;
+    // Sets (or resets) the output field names for any relation.


Could we make it clear if the names here are alternative names or a single dotted name?

Sure, although I'm not sure what you mean by 'single dotted name'. Would the following comment explain it better?

// Assigns alternative output field names for any relation. For example, if the relation outputs // three fields (columns) then their names can be changed by assigning each new name to an // instance of output_names, in the same order in which they are emitted.

Oh, so this is not the name of the subquery but its outputs. Then it should work exactly like root names (so you can name struct subfields as well). A reference to relroot's names field should be enough.

Makes sense. So perhaps...

// Assigns alternative output field names for any relation. Equivalent to the names field // in RelRoot, but allows the option of naming the fields of any relation within the structure.

// Assigns alternative output field names for any relation. Equivalent to the names field
// in RelRoot but applies to the output of the relation this RelCommon is attached to.

I like that better. Here's an attempt written on my phone.

Allows the output field names of any relation to be set or reset Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com> Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>

andrew-coleman · 2024-09-09T09:38:17Z

Does this need a second reviewer, or can it be merged now?
Many thanks!

jacques-n · 2024-09-11T23:11:26Z

Thanks for the patience @andrew-coleman !

andrew-coleman requested review from jacques-n, cpcloud, westonpace, EpsilonPrime and vbarua as code owners August 28, 2024 10:47

Blizzara reviewed Aug 28, 2024

View reviewed changes

This was referenced Aug 28, 2024

feat: add output_schema to ExpandRel message #661

Closed

feat: add 'first' function #697

Closed

andrew-coleman force-pushed the relcommon_metadata branch from 9197b45 to 298fcf1 Compare August 30, 2024 11:55

EpsilonPrime previously approved these changes Sep 3, 2024

View reviewed changes

feat: add hint containing field names to RelCommon

cd03365

Allows the output field names of any relation to be set or reset Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com> Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>

andrew-coleman dismissed EpsilonPrime’s stale review via cd03365 September 4, 2024 17:16

andrew-coleman force-pushed the relcommon_metadata branch from 298fcf1 to cd03365 Compare September 4, 2024 17:16

andrew-coleman requested review from EpsilonPrime and Blizzara September 4, 2024 17:17

EpsilonPrime approved these changes Sep 4, 2024

View reviewed changes

jacques-n approved these changes Sep 11, 2024

View reviewed changes

vbarua approved these changes Sep 11, 2024

View reviewed changes

jacques-n merged commit 5a73281 into substrait-io:main Sep 11, 2024
13 checks passed

andrew-coleman deleted the relcommon_metadata branch September 12, 2024 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add optional metadata containing field names to RelCommon #696

feat: add optional metadata containing field names to RelCommon #696

andrew-coleman commented Aug 28, 2024

Blizzara Aug 28, 2024

andrew-coleman Aug 28, 2024

Blizzara Aug 28, 2024

EpsilonPrime Aug 29, 2024

andrew-coleman Aug 30, 2024

EpsilonPrime Sep 3, 2024

andrew-coleman Sep 4, 2024

EpsilonPrime Sep 4, 2024

andrew-coleman Sep 4, 2024

EpsilonPrime Sep 4, 2024 •

edited

Loading

andrew-coleman commented Sep 9, 2024

jacques-n commented Sep 11, 2024

feat: add optional metadata containing field names to RelCommon #696

feat: add optional metadata containing field names to RelCommon #696

Conversation

andrew-coleman commented Aug 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EpsilonPrime Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

andrew-coleman commented Sep 9, 2024

jacques-n commented Sep 11, 2024

EpsilonPrime Sep 4, 2024 •

edited

Loading