Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative approach to logical type variations. #25

Merged
merged 3 commits into from
Sep 14, 2021

Conversation

jacques-n
Copy link
Contributor

@jacques-n jacques-n commented Sep 11, 2021

Add new type variation concept with short description
Add sketch example variation definitions
Remove some old unsigned types (cleanup)


This change is Reviewable

@jacques-n jacques-n requested a review from a team September 11, 2021 18:24
enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
message Variation {
Copy link
Contributor

@emkornfield emkornfield Sep 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible more natural way of modeling this in protobuf is to use have different physical type systems define extensions on either the Type or Pair message Did you consider that as an alternative? (I don't think this needs to hold up the PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see much benefit to the extensions approach in this case. Generally, it is an interesting question whether to expose some objects with extensions but I generally feel like any kind of pure black box that an extension would provide would weaken the spec. In the pattern here the spec still has information about how the extended functionality works, even if declared beyond the core project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extensions provide the ability to more strongly model variations. There would still be two classes:

  1. In spec (those checked into this repo) and an extension range would be reserved for these).
  2. Those not checked into the repo (blackbox) and would live in a separate range.

I was thinking have one extension for each system type. e.g. ArrowVarationExtension or AvroVariationExtension. At least for Arrow variations seem like they might stack (largestring and dictionary encoding).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really seem like extensions provide a stronger model. For example, when trying to decide whether a particular function can be bound to a particular type:

  • With the approach provided in the PR, the rules about binding are clear. If the variation inherits from logical, the functions can be bound to any function declared against the base logical type. It also fully defines what a complete type definition is.
  • If using extensions, when trying to bind, can you bind to a function definition for the base logical type against a type that has one or more extensions properties defined?

I also haven't seen extensions successfully in a project in this way (it seems like they are only a bucket for arbitrary as opposed to structured information).

@jacques-n
Copy link
Contributor Author

@westonpace, any thoughts on this approach?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree on "function behavior" but I wonder too if this might be a topic that is best to revisit once we've finished later parts of the spec such as function support and write/read operations.

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no protobuf expert but can variation be made optional? What is the equivalent of SYSTEM_DEFAULT from before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In proto 3, all items are optional. So system default would be an undefined variation (at least that was my thinking).

| Parent Type | The parent type this variation belongs to. Variations can only be expressed for simple types and wild-carded compound types (e.g. i8 or varchar(*)). |
| Name | The name used to reference this type. Should be unique within type variations for this parent type within an organization. |
| Description | A human description of the purpose of this type variation |
| Function Behavior | **Inherits** or **Independent**: Whether this variation supports functions using the canonical variation or whether functions should be resolved independently. For example if one has the function `add(i8,i8)` defined and then defines an i8 variation, can the i8 variation field be bound to the base `add` operation (inherits) or does a specialized version of `add` need to be defined specifically for this type variation (independent). Defaults to inherits. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "function behavior" is independent then I would argue it is not a variation but a separate logical type at that point. I may be misunderstanding the definition here though. My thinking of "inherits" means "these two types have the same semantic meaning". So "independent" would mean that the type has a different semantic meaning.

In other words. I see two possible interpretations (neither of which are very appealing)

  • Let there be a function F. The term inherits means that if x AS VARIATION v1 == x AS VARIATION v2 and F(x) => y then y AS VARIATION v1 == y AS VARIATION v2.
    • I don't like this interpretation because I think this should be true for every variation no matter what.
  • If a consumer supports function F on type X as variation v1 then it MUST support function F on type X as variation v2
    • I'm not a fan of this one because I think there will always be function signatures that aren't supported by one particular engine or another and the fact is unrelated to variations. Plus, a query planner could always discover this via function introspection and presumably cast to a supported variation (or the consumer itself could support this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine a situation where you have a physical implementation function that only works for dictionary encoded strings (but does not work for the "system default" strings). Both values are logically strings but only a subset of operations may be available for one physical representation. That is the purpose of independent. If you're using inherits, it means that a function defined for the base type would be available for the variation.

To me, part of having physical variations is that there will likely be some operations that only work against one of the variations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I agree those situations may exist. I just don't think Substrait should concern itself with defining which situations are valid or invalid. Here are two concrete examples:

  • Arithmetic functions in Arrow work on regular numbers as well as dictionary encoded numbers. This is an example where we support functions for all physical representations.
  • The system default timestamp representation may be a 64 bit integer. An alternative is a packed struct with year/month/day/... Field extraction (e.g. get_hour) on the former is expensive. Field extraction on the latter is trivial. Users typically want multiple fields (e.g. it's rare to just ask for the hour, often you want hour, minutes, and seconds). So, an engine may choose to only implement the field extraction kernel for the packed struct representation, forcing a cast.

Is it an error, in Substrait's eyes, if an engine fails to support arithmetic functions on dictionary numbers? Is it an error if an engine chooses to allow field extraction on the system default representation?

I don't think Substrait should care.


| Property | Description |
| ----------------- | ------------------------------------------------------------ |
| Parent Type | The parent type this variation belongs to. Variations can only be expressed for simple types and wild-carded compound types (e.g. i8 or varchar(*)). |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Logical Type instead of Parent Type? When I see Parent I think tree but there is only one level of nesting here. That being said, I don't feel too strongly about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to avoid using logical since it is overloaded in relational algebra. Maybe "base" instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to base. lmk your thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base works for me

Copy link
Contributor Author

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figure we can revisit when we start discussing functions in earnest. @westonpace and @emkornfield you good for this to be merged as is?

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @emkornfield and @westonpace)

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's definitely an improvement. I'm fine waiting until we start talking more in depth about functions to worry about the inherits/independent issue.

@jacques-n jacques-n merged commit 0374f1e into substrait-io:main Sep 14, 2021
@jacques-n jacques-n mentioned this pull request Sep 14, 2021
@jacques-n jacques-n deleted the type_variations_alternative branch September 17, 2021 20:51
rkondakov pushed a commit to rkondakov/substrait that referenced this pull request Nov 21, 2023
…t-io#25)

* Fix bug causing aggregate function invocation to be dropped
* Add visitor to perform deep traversal across child relations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants