Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose update for simple output type expressions. #40

Merged
merged 11 commits into from
Oct 2, 2021

Conversation

jacques-n
Copy link
Contributor

@jacques-n jacques-n commented Sep 22, 2021

Proposal for simple type expressions to support type parameterization for output type derivation.


This change is Reviewable


### Direct Return Types

A direct return type is one that is fully known at function definition type. Example simple direct return types would be things such as i32, fp32. For compound types, a direct return type must be fully declared. Example fully defined types: VARCHAR(20), DECIMAL(25,5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A direct return type is one that is fully known at function definition type.

Should the last type be time?

Also, I find this language confusing. A function definition with a parameterized type is fully known at function definition time. It won't necessarily have been monomorphized at definition time, is that what you're referring to here? Maybe I'm being a bit pedantic but my thinking is that we shouldn't invent new words for things :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to propose alternative language here.

Copy link
Member

@westonpace westonpace Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is function definition time? The point when the function is written down in the spec? Or the point when the query plan is created?

If the former, perhaps simply:

A direct return type is one that does not depend on any parameterized input type.

If the latter:

A direct return type is one that is not know until the query plan is materialized.

I assume from the context that we are talking about the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the former. I'll update in next patch.

* `equal(Type a, Type b) => boolean`
* `if(boolean a) then (integer) else (integer)`
* `if(boolean a) then (type) else (type)`
* `not_bindable()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's !bindable and not_bindable(). Are these the same thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is not_bindable() what is produced by the syntax !bindable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this is syntax, just concepts. Syntax would be delegated to the serialization formats. In the abbreviated section I try to make it easy to understand what the ops are. The section below is blown out to better describe the parameters, etc.

* `multiply(integer a, integer b) => integer`
* `divide(integer a, integer b) => integer`
* `add(integer a, integer b) => integer`
* `substract(integer a, integer b) => integer`
Copy link
Contributor

@cpcloud cpcloud Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to be that guy and suggest that we don't leave integer ambiguous here and give it a specific integer type. I'd vote for i64 since it's the widest type, supporting the most use cases. Hopefully this isn't an area of the spec where we need to be extremely concerned about integer sizes :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, substract has an extra s in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could make some kind of witty joke that anything sub* in the substrait project must be subst* and that is why I mispelled... :D .fixed now.

Wrt to i64, I agree and actually already specified that just above.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partly I worry we are over-learning the decimal promotion case. My gut tells me that there will be more complex logic than we are going to be able to express in a simple type expression. That being said, I don't think they harm anything.

Or, to put it another way, we express the semantics of add (what should happen to the values) in prose. I think there will be cases where type derivation is complex enough that prose is the best we can do and that should be ok.

I'm not sure these changes are extensive enough to capture what I had in mind. I've made a PR to try and express my thoughts but even that is just a rough start. I've got a pretty good test case (I will describe more in discussions) for this language. I might run through things this weekend to see if I can create something that satisfies my test case.

site/docs/expressions/scalar_functions.md Show resolved Hide resolved
Comment on lines 102 to 103
| Add item to list | `add(<List<T>, T>) => List<T>` |
| Decimal Division | `divide(Decimal(P1,S1), Decimal(P2,S2)) => Decimal(P1 -S1 + S2 + MAX(6, S1 + P2 + 1), MAX(6, S1 + P2 + 1))` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it List<T> (angle brackets) but Decimal(P1,S1) (parentheses). Is there a significance here? Or is it purely stylistic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the common way these are expressed in SQL land. Brackets for type parameters. Parenthesis for integer parameters.

Comment on lines 22 to 24
oneof derivation_type {
Type direct = 9;
CustomDerivationPolicy derivation_policy = 10;
}
TypeExpression output_type = 9;

repeated Implementation implementations = 11;
repeated Implementation implementations = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is flexible enough to represent something like f(K,K) => boolean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it will be. You would declare use of the same parameter name on each of the args and then a return type of boolean.

binary/function.proto Show resolved Hide resolved
@westonpace
Copy link
Member

westonpace commented Sep 24, 2021

(all syntax in this comment is pseduocode / for example purposes only)

I think one thing missing for me at the moment is the ability to specify a range of types for an input argument. For a simple scalar function example consider LIST_VALUE_LENGTH<T>(T)=>uint32 which accepts list arrays and returns the length of each element in the input array. In this case T is only valid for "lists".

Initially my thinking was that I didn't need type constraints because I could simply list out all possible permutations. This works well for the numeric kernels (e.g. ADD) because there are a finite number of numeric types. Unfortunately, there are infinitely many list types (e.g. lists of structs). Some possible ways to represent this:

  • Let an argument be a type expression
    • E.g. LIST_VALUE_LENGTH<T>(List<T>)=>uint32
  • Allow type paremeters to take a "type condition expression" which evaluates to a boolean
    • E.g. LIST_VALUE_LENGTH<T>(T)=>uint32 : where T satisfies IS_LIST(T)

Copy link
Contributor Author

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(all syntax in this comment is pseduocode / for example purposes only)

I think the biggest thing missing for me at the moment is the ability to specify a range of types for an input argument. For a simple scalar function example consider LIST_VALUE_LENGTH(T)=>uint32 and returns the length of each element in the input array. In this case T is only valid for "lists".

Initially my thinking was that I didn't need type constraints because I could simply list out all possible permutations. This works well for the numeric kernels (e.g. ADD) because there are a finite number of numeric types. Unfortunately, there are infinitely many list types (e.g. lists of structs). Some possible ways to represent this:

Let an argument be a type expression
E.g. LIST_VALUE_LENGTH(List)=>uint32
Allow type paremeters to take a "type condition expression" which evaluates to a boolean
E.g. LIST_VALUE_LENGTH(T)=>uint32 : where T satisfies IS_LIST(T)

In my proposal, we allow type parameters in arguments and type expressions for output derivation. So in your example LIST_VALUE_LENGTH<T>(List<T>)=>uint32.

Comment on lines 22 to 24
oneof derivation_type {
Type direct = 9;
CustomDerivationPolicy derivation_policy = 10;
}
TypeExpression output_type = 9;

repeated Implementation implementations = 11;
repeated Implementation implementations = 10;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it will be. You would declare use of the same parameter name on each of the args and then a return type of boolean.

binary/function.proto Show resolved Hide resolved

### Direct Return Types

A direct return type is one that is fully known at function definition type. Example simple direct return types would be things such as i32, fp32. For compound types, a direct return type must be fully declared. Example fully defined types: VARCHAR(20), DECIMAL(25,5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the former. I'll update in next patch.

site/docs/expressions/scalar_functions.md Show resolved Hide resolved
Comment on lines 102 to 103
| Add item to list | `add(<List<T>, T>) => List<T>` |
| Decimal Division | `divide(Decimal(P1,S1), Decimal(P2,S2)) => Decimal(P1 -S1 + S2 + MAX(6, S1 + P2 + 1), MAX(6, S1 + P2 + 1))` |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the common way these are expressed in SQL land. Brackets for type parameters. Parenthesis for integer parameters.

Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r1, 1 of 1 files at r6.
Reviewable status: 2 of 3 files reviewed, 18 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @rymurr, and @westonpace)


binary/type.proto, line 37 at r6 (raw file):

        // A special type that can only be used for parameterized function arguments.
        // This allows a user to express things like List<T>
        Parameter parameter = 33;

Putting Parameter and TypeExpression here seems like it would allow folks to generate invalid IR, such as UDF that returns a type parameter or one that returns a type expression. Is that actually intended?


binary/type.proto, line 172 at r6 (raw file):

        oneof integer_type {
            int32 literal = 1;
            String parameter_name = 2;

Curious why this isn't a plain string


binary/type.proto, line 215 at r6 (raw file):

        enum OpType {
            UNKNOWN = 0;

Is UNKNOWN a useful variant? Isn't it already defined by simply being anything that isn't in the OpType enum?


site/docs/expressions/scalar_functions.md, line 58 at r1 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

Yes, the former. I'll update in next patch.

Should we unify the language here and use "concrete type" for both input and output types?


site/docs/expressions/scalar_functions.md, line 10 at r6 (raw file):

| ------------------ | ------------------------------------------------------------ | ----------------------------------- |
| Name               | One or more user friendly utf8 strings that are used to reference this function in languages | At least one value is required.     |
| List of arguments  | Argument properties are defined below. Arguments can be fully defined, parameterized or wildcarded. See further details below. | Optional, defaults to niladic.      |

Should wildcarded still be here?


site/docs/expressions/scalar_functions.md, line 52 at r6 (raw file):

Input arguments can be declared in one of two ways: materialized or parameterized.

* Materialized Type: Materialized types are either simple types or compound types with all parameters fully defined (without any parameters). Examples include i32, fp32, VARCHAR(20), List&lt;fp32&gt;, etc.

How about concrete type instead of materialized? How about "Concrete types are types that have no unbound type parameters. Examples include ..."

Type direct = 9;
CustomDerivationPolicy derivation_policy = 10;
}
TypeExpression output_type = 9;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to add a flag that says whether function has default null behavior, e.g. null in any input is guaranteed to produce a null result.

Type direct = 9;
CustomDerivationPolicy derivation_policy = 10;
}
TypeExpression output_type = 9;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason intermediate_type is a concrete type rather than TypeExpression?

Also, what is max_set? In general, how do you feel about adding comments to describe various fields here to help the reader?

// This allows a user to express things like List<T>
Parameter parameter = 33;

// A special type that can only be used in function return type declarations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes it special? Can you give an example?


message IfElse {
TypeExpression if_condition = 1;
TypeExpression if_return = 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the type of "then" expressions? Why is it not called then_return? Shouldn't the type of "then" and "else" expressions be the same? If so, why do we need both if_return and else_return?

Copy link
Contributor Author

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @cpcloud and @westonpace from 7 discussions.
Reviewable status: 2 of 3 files reviewed, 16 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @mbasmanova, @rymurr, and @westonpace)


binary/function.proto, line 22 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…

It might be helpful to add a flag that says whether function has default null behavior, e.g. null in any input is guaranteed to produce a null result.

Great question. I'm torn on this one. I actually had that in an earlier version. I removed it as it felt like an function semantics detail as opposed to something that the plan itself needs to express. What do you think a substrait producer or consumer could do with this additional information that they wouldn't have already from another source? (For example, I'd expect that engines would already know this when they were doing code generation against the plan based on properties of their particular implementation).

My biggest argument for it is that probably makes sense for any kind of common structured description of a function semantic to be expressed just to make things easier to understand/validate. For example, I also had a error behavior property original that could be: [overflow, nullify, throw]. It also felt like something outside the realm of substrait and so I removed it but it probably should be added back if we add a null-if-null concept.


binary/type.proto, line 37 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Putting Parameter and TypeExpression here seems like it would allow folks to generate invalid IR, such as UDF that returns a type parameter or one that returns a type expression. Is that actually intended?

Yeah, I'm torn on this. The other option is to create separate idl object hierachies for the two other categories of types.

Concrete Type
Parameterized Type
Type Expression

Since you can embed parameters in the pieces of a concrete type and the same for type expression in parameterized types, I took the shorter path of just having a single representation that will then be validated later. The problems is things like List<Struct<foo:K, bar:string>>. Either we implement three impls of list and struct: one that only accepts concrete, one that accepts concrete + parameter and one that accepts concrete + parameter + type expression or we do this pattern.

As I wrote more about it, maybe it isn't so bad to just have three separate unions. I was think all types would need to be duplicated but there are only a small number of types that actually need duplication (the ones with type parameters). Definitely makes the idl safer/reduces the needs for additional validation. Further thoughts?


binary/type.proto, line 172 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Curious why this isn't a plain string

Yeah, weird. I'm confused on how it even compiled :/


binary/type.proto, line 215 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Is UNKNOWN a useful variant? Isn't it already defined by simply being anything that isn't in the OpType enum?

Protobuf3ism- unknown is the same as unset. Not useful directly but clarifies if the data was missing. If you use it for something proper you can't tell whether the field was forgotten or was set to that value.


binary/type.proto, line 39 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…

What makes it special? Can you give an example?

This is used in return types to do things like

fn(DECIMAL(P,S)) => DECIMAL(min(38,p+1))

Basically the output type is a calculation based on the input types.


binary/type.proto, line 192 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…

Is this the type of "then" expressions? Why is it not called then_return? Shouldn't the type of "then" and "else" expressions be the same? If so, why do we need both if_return and else_return?

I'm not sure I'm understanding the question/thought. An example here might be:

Imagine fn(DECIMAL(P, S)) => IF P < 18 THEN DECIMAL(18,S) ELSE DECIMAL(38,S).

Not a great example but the return type of the two branches are different. So in this case:

if_condition = P < 18
if_return = DECIMAL(18,S)
else_return = DECIMAL(38,S)


site/docs/expressions/scalar_functions.md, line 53 at r1 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

I've updated to remove wildcards.

addressed in patch.


site/docs/expressions/scalar_functions.md, line 54 at r1 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

Good point on the equal. I've updated to remove the distinction between wildcard and parameterized.

addressed in patch.


site/docs/expressions/scalar_functions.md, line 58 at r1 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Should we unify the language here and use "concrete type" for both input and output types?

Will do.


site/docs/expressions/scalar_functions.md, line 10 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Should wildcarded still be here?

will fix


site/docs/expressions/scalar_functions.md, line 52 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

How about concrete type instead of materialized? How about "Concrete types are types that have no unbound type parameters. Examples include ..."

sounds good, will update.


site/docs/expressions/scalar_functions.md, line 103 at r6 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

This is the common way these are expressed in SQL land. Brackets for type parameters. Parenthesis for integer parameters.

any particular thoughts on the answer here @westonpace ?

Copy link
Contributor Author

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 3 files reviewed, 16 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @mbasmanova, @rymurr, and @westonpace)


binary/function.proto, line 36 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…

Any particular reason intermediate_type is a concrete type rather than TypeExpression?

Also, what is max_set? In general, how do you feel about adding comments to describe various fields here to help the reader?

You're right, intermediate type should be a type expression. Will update. max_set is the maximum number of values this function will accept. E.g. if you have LIST_AGG and state that you only allow up to 1000 items in each list, you'd set this 1000.

Yes, we should definitely add comments. It's currently pre-consensus so we've been focused more on the spec than the specific serialization (although I've been trying to keep the proto up to date so people can see how might be serialized). Once we get the core spec components better clarified, we plan on doing some consensus work around the serialization formats so I wasn't working super hard on making the proto consumable outside as a learning aid against the spec.

@mbasmanova
Copy link

Reviewable status: 2 of 3 files reviewed, 16 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @mbasmanova, @rymurr, and @westonpace)

binary/function.proto, line 36 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…
You're right, intermediate type should be a type expression. Will update. max_set is the maximum number of values this function will accept. E.g. if you have LIST_AGG and state that you only allow up to 1000 items in each list, you'd set this 1000.

Yes, we should definitely add comments. It's currently pre-consensus so we've been focused more on the spec than the specific serialization (although I've been trying to keep the proto up to date so people can see how might be serialized). Once we get the core spec components better clarified, we plan on doing some consensus work around the serialization formats so I wasn't working super hard on making the proto consumable outside as a learning aid against the spec.

Thank you for a quick reply. I was hoping to compile this into C++ and try to map to Velox's PlanNode representation to see what's missing if anything. Would that be possible at all at this stage?

@jacques-n
Copy link
Contributor Author

Thank you for a quick reply. I was hoping to compile this into C++ and try to map to Velox's PlanNode representation to see what's missing if anything. Would that be possible at all at this stage?

It should be possible. I'd expect most queries to be expressible. It would also be a lot of help to validate the design, find weaknesses, etc. I'm sure there are things missing and we're still changing things but for sure worth doing. FYI, I'm exploring doing the other side: using Calcite to produce a substrait plan. It really helps to exercise/better evaluate design decisions.

@jacques-n
Copy link
Contributor Author

Thank you for a quick reply. I was hoping to compile this into C++ and try to map to Velox's PlanNode representation to see what's missing if anything. Would that be possible at all at this stage?

It should be possible. I'd expect most queries to be expressible. It would also be a lot of help to validate the design, find weaknesses, etc. I'm sure there are things missing and we're still changing things but for sure worth doing. FYI, I'm exploring doing the other side: using Calcite to produce a substrait plan. It really helps to exercise/better evaluate design decisions.

I just remembered that the proto sketch for relations is pretty minimal right now. (The spec is more complete.) I'll add some more sketches there.

Update type system to add bounds
Separate type, parameterized type and type expressions to increase clarity of use.
Define a better centralized structure for extensions and use for types and functions.
Also add nullability to types and basic concept of a "plan".
westonpace
westonpace previously approved these changes Sep 30, 2021
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clear improvement in my mind but I don't know that it solves everything just yet. I didn't review the proto files very much as I'd rather review that with actual usage.

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved
Substrait supports a number of functions. A scalar function signature includes the following properties:
| Property | Description | Required |
| ---------------------- | ------------------------------------------------------------ | ----------------------------------- |
| Name | One or more user friendly utf8 strings that are used to reference this function in languages | At least one value is required. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean if there are multiple names? Are they aliases of each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they are aliases.

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved
Comment on lines +9 to +10
| Deterministic | Whether this function is expected to reproduce the same output when it is invoked multiple times with the same input. This informs a plan consumer on whether it can constant reduce the defined function. An example would be a random() function, which is typically expected to be evaluated repeatedly despite having the same set of inputs. | Optional, defaults to true. |
| Session Dependent | Whether this function is influenced by the session context it is invoked within. For example, a function may be influenced by a user who is invoking the function, the time zone of a session, or some other non-obvious parameter. This can inform caching systems on whether a particular function is cacheable. | Optional, defaults to false. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question. Do we really need both session dependent and deterministic or can session dependent just be one particular case of non-deterministic? Will a system be able to optimize more efficiently knowing a function is deterministic but not session dependent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They play in different spaces/influence different optimization opportunities.

Deterministic describes whether or not a value could be reduced to a constant in some expressions. If it is non-deterministic, you must execute for each record. Select rand(), a,b,c from foo is a good example. You wouldn't expect rand() to be the same value on any record. If you did the same thing as add(1,1), you could replace with the literal 2.

Session dependent describes whether or not a plan's interpretation is impacted by session properties and thus means you can't share something like a plan or data cache between different sessions (without also keying session properties). A function that returns username is a prime example.

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved
Comment on lines +83 to +101
Fully defined with argument types:

* `type_parameter(string name) => type`
* `integer_parameter(string name) => integer`
* `not(boolean x) => boolean`
* `and(boolean a, boolean b) => boolean`
* `or(boolean a, boolean b) => boolean`
* `multiply(integer a, integer b) => integer`
* `divide(integer a, integer b) => integer`
* `add(integer a, integer b) => integer`
* `subtract(integer a, integer b) => integer`
* `min(integer a, integer b) => integer`
* `max(integer a, integer b) => integer`
* `equal(integer a, integer b) => boolean`
* `greater_than(integer a, integer b) => boolean`
* `less_than(integer a, integer b) => boolean`
* `covers(Type a, Type b) => boolean`Covers means that type b matches type A for as much as type B is defined. For example, if type A is `VARCHAR(20)` and type B is `VARCHAR(N)`, type B would be considered covering. Similarlily if type A was `List<Struct<a:f32, b:f32>>`and type B was `List<Struct<>>`, it would be considered covering. Note that this is directional "as in B covers A" or "B can be further enhanced to match the definition A".
* `if(boolean a) then (integer) else (integer)`
* `if(boolean a) then (type) else (type)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It declares the argument inputs/outputs for the available operations. The lines 77-80 were to make it easy to get a gist. The lines 83-101 were trying to be full definitions. (That was the intention, anyway.)

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved
site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved
site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved
| Decimal Division | `divide(Decimal(P1,S1), Decimal(P2,S2)) => Decimal(P1 -S1 + S2 + MAX(6, S1 + P2 + 1), MAX(6, S1 + P2 + 1))` |
| Do regex on only string maps to return values. | `extract_values(Map<K,V>) => List<V> WHERE K IN [STRING, VARCHAR(N), FIXEDCHAR(N)]` |
| Concatenate two fixed sized character strings | `concat(FIXEDCHAR(A), FIXEDCHAR(B)) => CHAR(A+B)` |
| Make a struct of a set of fields and a struct definition. | `make_struct(<type> T, K...) => T` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the field names expressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the type T

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of K... here? Does it mean every K[0] up to K[n - 1] are possibly distinct types? Or does it mean that every argument must be the same type K?

Also, what is the <type> T syntax? Isn't that covered by declaring type parameters in the function signature like make_struct<T, K>(T, K...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of K... here?
I called this out elsewhere in the doc. In a variadic case, the last argument declaration can be either marked as consistent or inconsistent (I need to add this to the protobuf). If consistent, must resolve the same. If inconsistent, each instance can be a different value.

Also, what is the T syntax?

It's all pseudo syntax to help explain the concepts for the spec. We need to define an actual syntax.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 38 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @mbasmanova, @rymurr, and @westonpace)


binary/function.proto, line 22 at r7 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

Great question. I'm torn on this one. I actually had that in an earlier version. I removed it as it felt like an function semantics detail as opposed to something that the plan itself needs to express. What do you think a substrait producer or consumer could do with this additional information that they wouldn't have already from another source? (For example, I'd expect that engines would already know this when they were doing code generation against the plan based on properties of their particular implementation).

My biggest argument for it is that probably makes sense for any kind of common structured description of a function semantic to be expressed just to make things easier to understand/validate. For example, I also had a error behavior property original that could be: [overflow, nullify, throw]. It also felt like something outside the realm of substrait and so I removed it but it probably should be added back if we add a null-if-null concept.

It would certainly be useful in automated property-based testing of query engines but I agree that may not fall into goals here. I do worry a bit it would be a slippery slope. For example, in addition to your "error behavior" there is "all nulls -> null" (e.g. maybe some nulls doesn't guarantee null but all nulls does) and "transitive", "never errors", etc.


site/docs/expressions/scalar_functions.md, line 103 at r6 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

any particular thoughts on the answer here @westonpace ?

I don't like it but I won't fight it :) I think I'd rather it be all angle brackets to avoid confusion with functions.

Co-authored-by: Weston Pace <weston.pace@gmail.com>
@jacques-n jacques-n merged commit eb2cc62 into substrait-io:main Oct 2, 2021
@jacques-n jacques-n deleted the function_updates branch October 2, 2021 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants