Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative approach to logical type variations. #25

Merged
merged 3 commits into from
Sep 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 27 additions & 170 deletions binary/type.proto
Original file line number Diff line number Diff line change
Expand Up @@ -30,242 +30,104 @@ message Type {
}

message I8 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
}

message U8 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no protobuf expert but can variation be made optional? What is the equivalent of SYSTEM_DEFAULT from before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In proto 3, all items are optional. So system default would be an undefined variation (at least that was my thinking).

}

message I16 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
}

message U16 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
}

message I32 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
}

message U32 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
}

message I64 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
}

message U64 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
}

message FP16 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
}

message FP32 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
}

message FP64 {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
}

message String {
PhysicalType physical_type = 1;
bool dictionary_encoded = 2;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
ARROW_LARGE_STRING = 1;
}
Variation variation = 1;
}

message Binary {
PhysicalType physical_type = 1;
bool dictionary_encoded = 2;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
ARROW_LARGE_BINARY = 1;
}
Variation variation = 1;
}

message Timestamp {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
}

message Date {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
}

message Time {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
}

message IntervalYear {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 1;
}

message IntervalDay {
PhysicalType physical_type = 1;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
ARROW_MONTH_DAY_NANO = 1;
}
Variation variation = 1;
}

// Start compound types.
message FixedChar {
int32 length = 1;
PhysicalType physical_type = 2;
bool dictionary_encoded = 3;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 2;
}

message VarChar {
int32 length = 1;
PhysicalType physical_type = 2;
bool dictionary_encoded = 3;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 2;
}

message FixedBinary {
int32 length = 1;
PhysicalType physical_type = 2;
bool dictionary_encoded = 3;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 2;
}

message Decimal {
int32 scale = 1;
int32 precision = 2;
PhysicalType physical_type = 3;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
ARROW_128 = 1;
}
Variation variation = 3;
}

message Struct {
repeated Type types = 1;
PhysicalType physical_type = 2;
bool dictionary_encoded = 3;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 2;
}

message NamedStruct {

repeated Pair pairs = 1;
PhysicalType physical_type = 2;
bool dictionary_encoded = 3;
Variation variation = 2;

message Pair {
string name = 1;
Type type = 2;
}

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
}

message List {
Type type = 1;
PhysicalType physical_type = 2;
bool dictionary_encoded = 3;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 2;
}

message Map {

repeated KeyValue key_values = 1;
PhysicalType physical_type = 2;
bool dictionary_encoded = 3;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
UTF8_ORDERED_KEYS = 1;
}
Variation variation = 2;

message KeyValue {
Type key = 1;
Expand All @@ -275,22 +137,17 @@ message Type {

message TimestampMicroTZ {
string timezone = 1;

PhysicalType physical_type = 2;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
Variation variation = 2;
}

message TimestampTZ {
string timezone = 1;
Variation variation = 2;
}

PhysicalType physical_type = 2;

enum PhysicalType {
SYSTEM_DEFAULT = 0;
}
message Variation {
Copy link
Contributor

@emkornfield emkornfield Sep 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible more natural way of modeling this in protobuf is to use have different physical type systems define extensions on either the Type or Pair message Did you consider that as an alternative? (I don't think this needs to hold up the PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see much benefit to the extensions approach in this case. Generally, it is an interesting question whether to expose some objects with extensions but I generally feel like any kind of pure black box that an extension would provide would weaken the spec. In the pattern here the spec still has information about how the extended functionality works, even if declared beyond the core project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extensions provide the ability to more strongly model variations. There would still be two classes:

  1. In spec (those checked into this repo) and an extension range would be reserved for these).
  2. Those not checked into the repo (blackbox) and would live in a separate range.

I was thinking have one extension for each system type. e.g. ArrowVarationExtension or AvroVariationExtension. At least for Arrow variations seem like they might stack (largestring and dictionary encoding).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really seem like extensions provide a stronger model. For example, when trying to decide whether a particular function can be bound to a particular type:

  • With the approach provided in the PR, the rules about binding are clear. If the variation inherits from logical, the functions can be bound to any function declared against the base logical type. It also fully defines what a complete type definition is.
  • If using extensions, when trying to bind, can you bind to a function definition for the base logical type against a type that has one or more extensions properties defined?

I also haven't seen extensions successfully in a project in this way (it seems like they are only a bucket for arbitrary as opposed to structured information).

int32 organization = 1;
string name = 2;
}

message UserDefined {
Expand Down
18 changes: 18 additions & 0 deletions extensions/type_variations.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
types:
- string:
- name: dict4
description: a four-byte dictionary encoded string
functions: inherits
- name: bigoffset
description: The arrow large string representation of strings, still restricted to the default string size defined in Substrait.
functions: separate
- struct:
- name: avro
description: an avro encoded struct
functions: separate
- name: cstruct
description: a cstruct representation of the struct
functions: separate
- name: dict2
description: a 2-byte dictionary encoded string.
functions: inherit
6 changes: 3 additions & 3 deletions site/docs/spec/specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ Once all portions of the specification have been moved to commit (or eliminated)

| Priority | Status | Section | Description |
| -------- | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| 1 | sketch | [Simple Logical Types](/types/simple_logical_types) | A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound). |
| | sketch | [Compound Logical Types](/types/compound_logical_types) | Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc. |
| | sketch | [Physical Types](/types/physical_types) | Physical extensions to logical types. |
| 1 | sketch | [Simple Types](/types/simple_logical_types) | A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound). |
| | sketch | [Compound Types](/types/compound_logical_types) | Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc. |
| | sketch | [Type Variations](/types/type_variations.md) | Physical variations to base types. |
| | sketch | [User Defined Types](/types/user_defined_types) | Extensions that can be defined for specific IR producers/consumers. |
| 2 | sketch | [Field References](/expressions/field_references) | Expressions to identify which portions of a record should be |
| 3 | sketch | [Scalar Functions](/expressions/scalar_functions) | Description of how functions are specified. Concepts include arguments, variadic functions, output type derivation, etc. |
Expand Down
2 changes: 1 addition & 1 deletion site/docs/types/_config
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
arrange:
- simple_logical_types.md
- compound_logical_types.md
- physical_types.md
- type_variations.md
- user_defined_types.md
2 changes: 1 addition & 1 deletion site/docs/types/compound_logical_types.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Logical Compound Types
# Compound Types

Compound types include any type that is configurable including complex types as well as configurable scalar types.

Expand Down
37 changes: 0 additions & 37 deletions site/docs/types/physical_types.md

This file was deleted.

2 changes: 1 addition & 1 deletion site/docs/types/simple_logical_types.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Simple Logical Types
# Simple Types

Substrait tries to cover the most common types used in data manipulation. Simple types are those that don't support any form of configuration. For simplicity, any generic type that has only a small number of discrete implementations is declared directly (as opposed to via configuration).

Expand Down
11 changes: 11 additions & 0 deletions site/docs/types/type_variations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Type Variations

Since Substrait is designed to work in both logical and physical contexts, there is a need to support extended attributes in the physical context. Different consumers may have multiple ways to present the same logical type. For example, an engine might support dictionary encoding a string or using either a row-wise or columnar representation of a struct. As such, there is the facility for specification users to express additional type variations for each logical type. These variations are expected to have the same logical properties as the canonical variation and are defined for each organization. The key properties of these variations are:

| Property | Description |
| ----------------- | ------------------------------------------------------------ |
| Base Type | The base type this variation belongs to. Variations can only be expressed for simple types and wild-carded compound types (e.g. i8 or varchar(*)). |
| Name | The name used to reference this type. Should be unique within type variations for this parent type within an organization. |
| Description | A human description of the purpose of this type variation |
| Function Behavior | **Inherits** or **Independent**: Whether this variation supports functions using the canonical variation or whether functions should be resolved independently. For example if one has the function `add(i8,i8)` defined and then defines an i8 variation, can the i8 variation field be bound to the base `add` operation (inherits) or does a specialized version of `add` need to be defined specifically for this type variation (independent). Defaults to inherits. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "function behavior" is independent then I would argue it is not a variation but a separate logical type at that point. I may be misunderstanding the definition here though. My thinking of "inherits" means "these two types have the same semantic meaning". So "independent" would mean that the type has a different semantic meaning.

In other words. I see two possible interpretations (neither of which are very appealing)

  • Let there be a function F. The term inherits means that if x AS VARIATION v1 == x AS VARIATION v2 and F(x) => y then y AS VARIATION v1 == y AS VARIATION v2.
    • I don't like this interpretation because I think this should be true for every variation no matter what.
  • If a consumer supports function F on type X as variation v1 then it MUST support function F on type X as variation v2
    • I'm not a fan of this one because I think there will always be function signatures that aren't supported by one particular engine or another and the fact is unrelated to variations. Plus, a query planner could always discover this via function introspection and presumably cast to a supported variation (or the consumer itself could support this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine a situation where you have a physical implementation function that only works for dictionary encoded strings (but does not work for the "system default" strings). Both values are logically strings but only a subset of operations may be available for one physical representation. That is the purpose of independent. If you're using inherits, it means that a function defined for the base type would be available for the variation.

To me, part of having physical variations is that there will likely be some operations that only work against one of the variations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I agree those situations may exist. I just don't think Substrait should concern itself with defining which situations are valid or invalid. Here are two concrete examples:

  • Arithmetic functions in Arrow work on regular numbers as well as dictionary encoded numbers. This is an example where we support functions for all physical representations.
  • The system default timestamp representation may be a 64 bit integer. An alternative is a packed struct with year/month/day/... Field extraction (e.g. get_hour) on the former is expensive. Field extraction on the latter is trivial. Users typically want multiple fields (e.g. it's rare to just ask for the hour, often you want hour, minutes, and seconds). So, an engine may choose to only implement the field extraction kernel for the packed struct representation, forcing a cast.

Is it an error, in Substrait's eyes, if an engine fails to support arithmetic functions on dictionary numbers? Is it an error if an engine chooses to allow field extraction on the system default representation?

I don't think Substrait should care.