-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add output_schema to ExpandRel message #661
Conversation
In order to support the conversion of Expand to and from Spark logical plans, the schema describing the resultant columns is required. Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>
Isn't it possible to calculate the output schema from the message itself? E.g. in pseudocode...
|
Yes, we can derive the type information from the field expressions, but we can't determine the names that spark gives the new columns. That's the reason for adding a |
Just wondering if there has been any discussion on this? |
There has not but this ping reminded me to revisit. Substrait has no concept of field names. Spark does. I don't think I see two options (there are probably more, this is just top-of-my-head):
I think I'd prefer the first approach (easier for non-spark consumers to ignore). You might use #649 as inspiration. |
The only names that truly matter on the ones that are emitted by the plan. The root's names allow you to specify these. For all of the intermediate names one shouldn't need to keep track of them. For the text version of the Substrait plan I ended up automatically generating names and using those generated names as references later on in the plan. Since they don't matter the round trip from binary to text and back was just fine. That said, the intermediate names would change if I went from text to binary and back which is the same problem with Spark. But as they're intermediate I'd argue that they don't really matter what they are. If we do add the names they shouldn't be required for the execution to succeed so having something like root names to provide intermediate names as a metadata item seems like the right approach. There are also some similarities to the emit logic that we may be able to leverage. |
I'm going to close this ticket as the specific approach seems to be no inline what how we should solve the underlying problem in Substrait. Suggest @andrew-coleman open a new PR that introduces optional metadata for this per @westonpace comments. |
Many thanks for the suggestion and apologies for the delay in responding (holiday season!). I have opened a new PR #696, which I hope is consistent with your suggestion. |
Background: I’m currently working on improving the test pass rate of the TPC-DS suite for the
spark
module insubstrait-java
. One of the relations that is currently not supported by the spark translator is theExpand
relation. It’s not implemented in thecore
module either. The Spark catalyst query optimiser injectsExpand
into the logical plan when it encounters distinct aggregations, so I’m implementingExpand
insubstrait-java
to support this scenario (and fix a number of test cases).However, the Spark Expand object requires an extra parameter that is currently not available in the Substrait Expand protobuf message. This extra parameter defines the schema of the output that gets generated by applying each of the projections.
This PR proposes an addition to the proto message that would support this.