Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add optional metadata containing field names to RelCommon #696

Merged
merged 1 commit into from
Sep 11, 2024

Conversation

andrew-coleman
Copy link
Contributor

Following the discussion and suggestion proposed in PR #661, this PR introduces a new metadata field in the RelCommon proto definition which contains the output field names for any relation.

As previously discussed, this is required to support the implementation of the Expand relation in the spark module.

@@ -32,6 +34,11 @@ message RelCommon {
repeated int32 output_mapping = 1;
}

message Metadata {
// Sets (or resets) the output field names for any relation.
repeated string output_names = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just add this in Hint (since it's something a consumer can ignore), next to string alias = 3 which serves a similar purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either would work for me. The only reason I avoided this was because of the comment:

// Changes to the operation that can influence efficiency/performance but
// should not impact correctness.

In the case of ExpandRel in Spark, it does impact the correctness.

Other than that, I'm happy to go with whatever the maintainers prefer :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, how does it affect correctness? If the expected names are not supplied, you should be able to pass in some generated names, and the plan's correctness should be intact, right? (TBH I don't know how exactly Expand works so I might be missing something, but for all I know #661 (comment) is correct in that only the names at the end of the plan should matter for correctness and all intermediary names can be whatever. (That said, for debugging/readability I do like having correct names also in the middle of the plan, so in general I'm totally for this change!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Spark implementation uses the names to connect up to other parts of the plan then it is using the new field for correctness. If however it is only using them to ensure the roundtrip works (i.e. giving names to something that wouldn't otherwise be there) then it is metadata. I would hope that the Spark implementation would use the second approach (and if it does it would fall under metadata).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some further testing, and I think I agree this is an artefact of the round-trip test mechanism that the spark module is using.
I've moved this into the Hint message, as suggested.

EpsilonPrime
EpsilonPrime previously approved these changes Sep 3, 2024
@@ -41,6 +41,8 @@ message RelCommon {
// Name (alias) for this relation. Can be used for e.g. qualifying the relation (see e.g.
// Spark's SubqueryAlias), or debugging.
string alias = 3;
// Sets (or resets) the output field names for any relation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make it clear if the names here are alternative names or a single dotted name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, although I'm not sure what you mean by 'single dotted name'. Would the following comment explain it better?

// Assigns alternative output field names for any relation.  For example, if the relation outputs
// three fields (columns) then their names can be changed by assigning each new name to an 
// instance of output_names, in the same order in which they are emitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so this is not the name of the subquery but its outputs. Then it should work exactly like root names (so you can name struct subfields as well). A reference to relroot's names field should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. So perhaps...

// Assigns alternative output field names for any relation.  Equivalent to the names field
// in RelRoot, but allows the option of naming the fields of any relation within the structure.

Copy link
Member

@EpsilonPrime EpsilonPrime Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Assigns alternative output field names for any relation. Equivalent to the names field
// in RelRoot but applies to the output of the relation this RelCommon is attached to.

I like that better. Here's an attempt written on my phone.

Allows the output field names of any relation to be set or reset

Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>
Signed-off-by: Andrew Coleman <andrew_coleman@uk.ibm.com>
@andrew-coleman
Copy link
Contributor Author

Does this need a second reviewer, or can it be merged now?
Many thanks!

@jacques-n jacques-n merged commit 5a73281 into substrait-io:main Sep 11, 2024
13 checks passed
@jacques-n
Copy link
Contributor

Thanks for the patience @andrew-coleman !

@andrew-coleman andrew-coleman deleted the relcommon_metadata branch September 12, 2024 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants