Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add output_schema to ExpandRel message #661

Closed
wants to merge 2 commits into from

Conversation

andrew-coleman
Copy link
Contributor

Background: I’m currently working on improving the test pass rate of the TPC-DS suite for the spark module in substrait-java. One of the relations that is currently not supported by the spark translator is the Expand relation. It’s not implemented in the core module either. The Spark catalyst query optimiser injects Expand into the logical plan when it encounters distinct aggregations, so I’m implementing Expand in substrait-java to support this scenario (and fix a number of test cases).

However, the Spark Expand object requires an extra parameter that is currently not available in the Substrait Expand protobuf message. This extra parameter defines the schema of the output that gets generated by applying each of the projections.

This PR proposes an addition to the proto message that would support this.

In order to support the conversion of Expand to and from Spark logical plans, the schema describing the resultant columns is required.

Signed-off-by: andrew-coleman <andrew_coleman@uk.ibm.com>
@westonpace
Copy link
Member

Isn't it possible to calculate the output schema from the message itself?

E.g. in pseudocode...

def calculate_output_schema(input, expand_msg):
  output_types = []
  for i, field in enumerate(expand_msg.fields):
    if isinstance(field, SwitchingField):
      is_nullable = False
      output_type = None
      for duplicate in field.duplicates:
        is_nullable |= duplicate.output_type.is_nullable
        output_type = duplicate.output_type
      output_type.is_nullable = is_nullable
      output_types.append(output_type)
  else:
    output_types.append(field.output_type)
  return output_types

@andrew-coleman
Copy link
Contributor Author

Isn't it possible to calculate the output schema from the message itself?

Yes, we can derive the type information from the field expressions, but we can't determine the names that spark gives the new columns. That's the reason for adding a NamedStruct rather than a Struct (which, as you say, would have been redundant). It needs to contain the information for building a spark AttributeReference, similar to a ReadRel message.

@andrew-coleman
Copy link
Contributor Author

Just wondering if there has been any discussion on this?

@westonpace
Copy link
Member

Just wondering if there has been any discussion on this?

There has not but this ping reminded me to revisit.

Substrait has no concept of field names. Spark does. I don't think ExpandRel is the correct place to solve this. For example, there is no place in ProjectRel to specify the names of the new columns either. This will also be a problem for Spark.

I see two options (there are probably more, this is just top-of-my-head):

  • Introduce metadata on RelCommon that sets (or resets) the output field names for any relation.
  • Introduce a new AliasRel which renames fields.

I think I'd prefer the first approach (easier for non-spark consumers to ignore). You might use #649 as inspiration.

@EpsilonPrime
Copy link
Member

The only names that truly matter on the ones that are emitted by the plan. The root's names allow you to specify these.

For all of the intermediate names one shouldn't need to keep track of them. For the text version of the Substrait plan I ended up automatically generating names and using those generated names as references later on in the plan. Since they don't matter the round trip from binary to text and back was just fine. That said, the intermediate names would change if I went from text to binary and back which is the same problem with Spark. But as they're intermediate I'd argue that they don't really matter what they are.

If we do add the names they shouldn't be required for the execution to succeed so having something like root names to provide intermediate names as a metadata item seems like the right approach. There are also some similarities to the emit logic that we may be able to leverage.

@jacques-n
Copy link
Contributor

I'm going to close this ticket as the specific approach seems to be no inline what how we should solve the underlying problem in Substrait. Suggest @andrew-coleman open a new PR that introduces optional metadata for this per @westonpace comments.

@andrew-coleman
Copy link
Contributor Author

Many thanks for the suggestion and apologies for the delay in responding (holiday season!).

I have opened a new PR #696, which I hope is consistent with your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants