feat: add output_schema to ExpandRel message #661

andrew-coleman · 2024-07-10T09:43:23Z

Background: I’m currently working on improving the test pass rate of the TPC-DS suite for the spark module in substrait-java. One of the relations that is currently not supported by the spark translator is the Expand relation. It’s not implemented in the core module either. The Spark catalyst query optimiser injects Expand into the logical plan when it encounters distinct aggregations, so I’m implementing Expand in substrait-java to support this scenario (and fix a number of test cases).

However, the Spark Expand object requires an extra parameter that is currently not available in the Substrait Expand protobuf message. This extra parameter defines the schema of the output that gets generated by applying each of the projections.

This PR proposes an addition to the proto message that would support this.

In order to support the conversion of Expand to and from Spark logical plans, the schema describing the resultant columns is required. Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>

westonpace · 2024-07-10T16:04:43Z

Isn't it possible to calculate the output schema from the message itself?

E.g. in pseudocode...

def calculate_output_schema(input, expand_msg):
  output_types = []
  for i, field in enumerate(expand_msg.fields):
    if isinstance(field, SwitchingField):
      is_nullable = False
      output_type = None
      for duplicate in field.duplicates:
        is_nullable |= duplicate.output_type.is_nullable
        output_type = duplicate.output_type
      output_type.is_nullable = is_nullable
      output_types.append(output_type)
  else:
    output_types.append(field.output_type)
  return output_types

andrew-coleman · 2024-07-11T14:43:16Z

Isn't it possible to calculate the output schema from the message itself?

Yes, we can derive the type information from the field expressions, but we can't determine the names that spark gives the new columns. That's the reason for adding a NamedStruct rather than a Struct (which, as you say, would have been redundant). It needs to contain the information for building a spark AttributeReference, similar to a ReadRel message.

andrew-coleman · 2024-07-24T14:52:32Z

Just wondering if there has been any discussion on this?

westonpace · 2024-07-24T15:02:48Z

Just wondering if there has been any discussion on this?

There has not but this ping reminded me to revisit.

Substrait has no concept of field names. Spark does. I don't think ExpandRel is the correct place to solve this. For example, there is no place in ProjectRel to specify the names of the new columns either. This will also be a problem for Spark.

I see two options (there are probably more, this is just top-of-my-head):

Introduce metadata on RelCommon that sets (or resets) the output field names for any relation.
Introduce a new AliasRel which renames fields.

I think I'd prefer the first approach (easier for non-spark consumers to ignore). You might use #649 as inspiration.

EpsilonPrime · 2024-07-24T17:07:35Z

The only names that truly matter on the ones that are emitted by the plan. The root's names allow you to specify these.

For all of the intermediate names one shouldn't need to keep track of them. For the text version of the Substrait plan I ended up automatically generating names and using those generated names as references later on in the plan. Since they don't matter the round trip from binary to text and back was just fine. That said, the intermediate names would change if I went from text to binary and back which is the same problem with Spark. But as they're intermediate I'd argue that they don't really matter what they are.

If we do add the names they shouldn't be required for the execution to succeed so having something like root names to provide intermediate names as a metadata item seems like the right approach. There are also some similarities to the emit logic that we may be able to leverage.

jacques-n · 2024-08-01T01:34:01Z

I'm going to close this ticket as the specific approach seems to be no inline what how we should solve the underlying problem in Substrait. Suggest @andrew-coleman open a new PR that introduces optional metadata for this per @westonpace comments.

andrew-coleman · 2024-08-28T10:51:41Z

Many thanks for the suggestion and apologies for the delay in responding (holiday season!).

I have opened a new PR #696, which I hope is consistent with your suggestion.

feat: add output_schema to ExpandRel message

5b96f0f

In order to support the conversion of Expand to and from Spark logical plans, the schema describing the resultant columns is required. Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>

andrew-coleman requested review from jacques-n, cpcloud, westonpace, EpsilonPrime and vbarua as code owners July 10, 2024 09:43

Merge branch 'main' into expand

c01408c

jacques-n closed this Aug 1, 2024

andrew-coleman mentioned this pull request Aug 28, 2024

feat: add optional metadata containing field names to RelCommon #696

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add output_schema to ExpandRel message #661

feat: add output_schema to ExpandRel message #661

andrew-coleman commented Jul 10, 2024

westonpace commented Jul 10, 2024

andrew-coleman commented Jul 11, 2024

andrew-coleman commented Jul 24, 2024

westonpace commented Jul 24, 2024

EpsilonPrime commented Jul 24, 2024

jacques-n commented Aug 1, 2024

andrew-coleman commented Aug 28, 2024

feat: add output_schema to ExpandRel message #661

feat: add output_schema to ExpandRel message #661

Conversation

andrew-coleman commented Jul 10, 2024

westonpace commented Jul 10, 2024

andrew-coleman commented Jul 11, 2024

andrew-coleman commented Jul 24, 2024

westonpace commented Jul 24, 2024

EpsilonPrime commented Jul 24, 2024

jacques-n commented Aug 1, 2024

andrew-coleman commented Aug 28, 2024