Propose update for simple output type expressions. #40

jacques-n · 2021-09-22T20:05:15Z

Proposal for simple type expressions to support type parameterization for output type derivation.

This change is

site/docs/expressions/scalar_functions.md

cpcloud · 2021-09-22T21:00:01Z

site/docs/expressions/scalar_functions.md

+
+### Direct Return Types
+
+A direct return type is one that is fully known at function definition type. Example simple direct return types would be things such as i32, fp32. For compound types, a direct return type must be fully declared. Example fully defined types: VARCHAR(20), DECIMAL(25,5)


A direct return type is one that is fully known at function definition type.

Should the last type be time?

Also, I find this language confusing. A function definition with a parameterized type is fully known at function definition time. It won't necessarily have been monomorphized at definition time, is that what you're referring to here? Maybe I'm being a bit pedantic but my thinking is that we shouldn't invent new words for things :)

Feel free to propose alternative language here.

What is function definition time? The point when the function is written down in the spec? Or the point when the query plan is created?

If the former, perhaps simply:

A direct return type is one that does not depend on any parameterized input type.

If the latter:

A direct return type is one that is not know until the query plan is materialized.

I assume from the context that we are talking about the former.

Yes, the former. I'll update in next patch.

site/docs/expressions/scalar_functions.md

cpcloud · 2021-09-23T12:32:31Z

site/docs/expressions/scalar_functions.md

+* `equal(Type a, Type b) => boolean`
+* `if(boolean a) then (integer) else (integer)`
+* `if(boolean a) then (type) else (type)`
+* `not_bindable()`


It looks like there's !bindable and not_bindable(). Are these the same thing?

Oh, is not_bindable() what is produced by the syntax !bindable?

None of this is syntax, just concepts. Syntax would be delegated to the serialization formats. In the abbreviated section I try to make it easy to understand what the ops are. The section below is blown out to better describe the parameters, etc.

cpcloud · 2021-09-23T12:36:15Z

site/docs/expressions/scalar_functions.md

+* `multiply(integer a, integer b) => integer`
+* `divide(integer a, integer b) => integer`
+* `add(integer a, integer b) => integer`
+* `substract(integer a, integer b) => integer`


I'm going to be that guy and suggest that we don't leave integer ambiguous here and give it a specific integer type. I'd vote for i64 since it's the widest type, supporting the most use cases. Hopefully this isn't an area of the spec where we need to be extremely concerned about integer sizes :)

Also, substract has an extra s in it.

I guess I could make some kind of witty joke that anything sub* in the substrait project must be subst* and that is why I mispelled... :D .fixed now.

Wrt to i64, I agree and actually already specified that just above.

westonpace

Partly I worry we are over-learning the decimal promotion case. My gut tells me that there will be more complex logic than we are going to be able to express in a simple type expression. That being said, I don't think they harm anything.

Or, to put it another way, we express the semantics of add (what should happen to the values) in prose. I think there will be cases where type derivation is complex enough that prose is the best we can do and that should be ok.

I'm not sure these changes are extensive enough to capture what I had in mind. I've made a PR to try and express my thoughts but even that is just a rough start. I've got a pretty good test case (I will describe more in discussions) for this language. I might run through things this weekend to see if I can create something that satisfies my test case.

site/docs/expressions/scalar_functions.md

westonpace · 2021-09-24T02:10:10Z

site/docs/expressions/scalar_functions.md

+| Add item to list                              | `add(<List<T>, T>) => List<T>`                               |
+| Decimal Division                              | `divide(Decimal(P1,S1), Decimal(P2,S2)) => Decimal(P1 -S1 + S2 + MAX(6, S1 + P2 + 1), MAX(6, S1 + P2 + 1))` |


Why is it List<T> (angle brackets) but Decimal(P1,S1) (parentheses). Is there a significance here? Or is it purely stylistic?

This is the common way these are expressed in SQL land. Brackets for type parameters. Parenthesis for integer parameters.

westonpace · 2021-09-24T02:16:19Z

binary/function.proto

-        oneof derivation_type {
-            Type direct = 9;
-            CustomDerivationPolicy derivation_policy = 10;
-        }
+        TypeExpression output_type = 9;

-        repeated Implementation implementations = 11;
+        repeated Implementation implementations = 10;


I'm not sure this is flexible enough to represent something like f(K,K) => boolean

I believe it will be. You would declare use of the same parameter name on each of the args and then a return type of boolean.

binary/function.proto

westonpace · 2021-09-24T12:21:44Z

(all syntax in this comment is pseduocode / for example purposes only)

I think one thing missing for me at the moment is the ability to specify a range of types for an input argument. For a simple scalar function example consider LIST_VALUE_LENGTH<T>(T)=>uint32 which accepts list arrays and returns the length of each element in the input array. In this case T is only valid for "lists".

Initially my thinking was that I didn't need type constraints because I could simply list out all possible permutations. This works well for the numeric kernels (e.g. ADD) because there are a finite number of numeric types. Unfortunately, there are infinitely many list types (e.g. lists of structs). Some possible ways to represent this:

Let an argument be a type expression
- E.g. LIST_VALUE_LENGTH<T>(List<T>)=>uint32
Allow type paremeters to take a "type condition expression" which evaluates to a boolean
- E.g. LIST_VALUE_LENGTH<T>(T)=>uint32 : where T satisfies IS_LIST(T)

jacques-n

(all syntax in this comment is pseduocode / for example purposes only)

I think the biggest thing missing for me at the moment is the ability to specify a range of types for an input argument. For a simple scalar function example consider LIST_VALUE_LENGTH(T)=>uint32 and returns the length of each element in the input array. In this case T is only valid for "lists".

Initially my thinking was that I didn't need type constraints because I could simply list out all possible permutations. This works well for the numeric kernels (e.g. ADD) because there are a finite number of numeric types. Unfortunately, there are infinitely many list types (e.g. lists of structs). Some possible ways to represent this:

Let an argument be a type expression
E.g. LIST_VALUE_LENGTH(List)=>uint32
Allow type paremeters to take a "type condition expression" which evaluates to a boolean
E.g. LIST_VALUE_LENGTH(T)=>uint32 : where T satisfies IS_LIST(T)

In my proposal, we allow type parameters in arguments and type expressions for output derivation. So in your example LIST_VALUE_LENGTH<T>(List<T>)=>uint32.

jacques-n · 2021-09-24T17:25:36Z

binary/function.proto

-        oneof derivation_type {
-            Type direct = 9;
-            CustomDerivationPolicy derivation_policy = 10;
-        }
+        TypeExpression output_type = 9;

-        repeated Implementation implementations = 11;
+        repeated Implementation implementations = 10;


I believe it will be. You would declare use of the same parameter name on each of the args and then a return type of boolean.

binary/function.proto

jacques-n · 2021-09-24T17:27:13Z

site/docs/expressions/scalar_functions.md

+
+### Direct Return Types
+
+A direct return type is one that is fully known at function definition type. Example simple direct return types would be things such as i32, fp32. For compound types, a direct return type must be fully declared. Example fully defined types: VARCHAR(20), DECIMAL(25,5)


Yes, the former. I'll update in next patch.

site/docs/expressions/scalar_functions.md

jacques-n · 2021-09-24T17:32:15Z

site/docs/expressions/scalar_functions.md

+| Add item to list                              | `add(<List<T>, T>) => List<T>`                               |
+| Decimal Division                              | `divide(Decimal(P1,S1), Decimal(P2,S2)) => Decimal(P1 -S1 + S2 + MAX(6, S1 + P2 + 1), MAX(6, S1 + P2 + 1))` |


This is the common way these are expressed in SQL land. Brackets for type parameters. Parenthesis for integer parameters.

cpcloud

Reviewed 2 of 3 files at r1, 1 of 1 files at r6.
Reviewable status: 2 of 3 files reviewed, 18 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @rymurr, and @westonpace)

binary/type.proto, line 37 at r6 (raw file):

        // A special type that can only be used for parameterized function arguments.
        // This allows a user to express things like List<T>
        Parameter parameter = 33;

Putting Parameter and TypeExpression here seems like it would allow folks to generate invalid IR, such as UDF that returns a type parameter or one that returns a type expression. Is that actually intended?

binary/type.proto, line 172 at r6 (raw file):

        oneof integer_type {
            int32 literal = 1;
            String parameter_name = 2;

Curious why this isn't a plain string

binary/type.proto, line 215 at r6 (raw file):

        enum OpType {
            UNKNOWN = 0;

Is UNKNOWN a useful variant? Isn't it already defined by simply being anything that isn't in the OpType enum?

site/docs/expressions/scalar_functions.md, line 58 at r1 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

Yes, the former. I'll update in next patch.

Should we unify the language here and use "concrete type" for both input and output types?

site/docs/expressions/scalar_functions.md, line 10 at r6 (raw file):

| ------------------ | ------------------------------------------------------------ | ----------------------------------- |
| Name               | One or more user friendly utf8 strings that are used to reference this function in languages | At least one value is required.     |
| List of arguments  | Argument properties are defined below. Arguments can be fully defined, parameterized or wildcarded. See further details below. | Optional, defaults to niladic.      |

Should wildcarded still be here?

site/docs/expressions/scalar_functions.md, line 52 at r6 (raw file):

Input arguments can be declared in one of two ways: materialized or parameterized.

* Materialized Type: Materialized types are either simple types or compound types with all parameters fully defined (without any parameters). Examples include i32, fp32, VARCHAR(20), List&lt;fp32&gt;, etc.

How about concrete type instead of materialized? How about "Concrete types are types that have no unbound type parameters. Examples include ..."

mbasmanova · 2021-09-29T20:59:51Z

binary/function.proto

-            Type direct = 9;
-            CustomDerivationPolicy derivation_policy = 10;
-        }
+        TypeExpression output_type = 9;


It might be helpful to add a flag that says whether function has default null behavior, e.g. null in any input is guaranteed to produce a null result.

mbasmanova · 2021-09-29T21:02:12Z

binary/function.proto

-            Type direct = 9;
-            CustomDerivationPolicy derivation_policy = 10;
-        }
+        TypeExpression output_type = 9;


Any particular reason intermediate_type is a concrete type rather than TypeExpression?

Also, what is max_set? In general, how do you feel about adding comments to describe various fields here to help the reader?

mbasmanova · 2021-09-29T21:03:05Z

binary/type.proto

+        // This allows a user to express things like List<T>
+        Parameter parameter = 33;
+
+        // A special type that can only be used in function return type declarations.


What makes it special? Can you give an example?

mbasmanova · 2021-09-29T21:04:42Z

binary/type.proto

+
+    message IfElse {
+        TypeExpression if_condition = 1;
+        TypeExpression if_return = 2;


Is this the type of "then" expressions? Why is it not called then_return? Shouldn't the type of "then" and "else" expressions be the same? If so, why do we need both if_return and else_return?

jacques-n

Dismissed @cpcloud and @westonpace from 7 discussions.
Reviewable status: 2 of 3 files reviewed, 16 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @mbasmanova, @rymurr, and @westonpace)

binary/function.proto, line 22 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…

It might be helpful to add a flag that says whether function has default null behavior, e.g. null in any input is guaranteed to produce a null result.

Great question. I'm torn on this one. I actually had that in an earlier version. I removed it as it felt like an function semantics detail as opposed to something that the plan itself needs to express. What do you think a substrait producer or consumer could do with this additional information that they wouldn't have already from another source? (For example, I'd expect that engines would already know this when they were doing code generation against the plan based on properties of their particular implementation).

My biggest argument for it is that probably makes sense for any kind of common structured description of a function semantic to be expressed just to make things easier to understand/validate. For example, I also had a error behavior property original that could be: [overflow, nullify, throw]. It also felt like something outside the realm of substrait and so I removed it but it probably should be added back if we add a null-if-null concept.

binary/type.proto, line 37 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Putting Parameter and TypeExpression here seems like it would allow folks to generate invalid IR, such as UDF that returns a type parameter or one that returns a type expression. Is that actually intended?

Yeah, I'm torn on this. The other option is to create separate idl object hierachies for the two other categories of types.

Concrete Type
Parameterized Type
Type Expression

Since you can embed parameters in the pieces of a concrete type and the same for type expression in parameterized types, I took the shorter path of just having a single representation that will then be validated later. The problems is things like List<Struct<foo:K, bar:string>>. Either we implement three impls of list and struct: one that only accepts concrete, one that accepts concrete + parameter and one that accepts concrete + parameter + type expression or we do this pattern.

As I wrote more about it, maybe it isn't so bad to just have three separate unions. I was think all types would need to be duplicated but there are only a small number of types that actually need duplication (the ones with type parameters). Definitely makes the idl safer/reduces the needs for additional validation. Further thoughts?

binary/type.proto, line 172 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Curious why this isn't a plain string

Yeah, weird. I'm confused on how it even compiled :/

binary/type.proto, line 215 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Is UNKNOWN a useful variant? Isn't it already defined by simply being anything that isn't in the OpType enum?

Protobuf3ism- unknown is the same as unset. Not useful directly but clarifies if the data was missing. If you use it for something proper you can't tell whether the field was forgotten or was set to that value.

binary/type.proto, line 39 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…

What makes it special? Can you give an example?

This is used in return types to do things like

fn(DECIMAL(P,S)) => DECIMAL(min(38,p+1))

Basically the output type is a calculation based on the input types.

binary/type.proto, line 192 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…

Is this the type of "then" expressions? Why is it not called then_return? Shouldn't the type of "then" and "else" expressions be the same? If so, why do we need both if_return and else_return?

I'm not sure I'm understanding the question/thought. An example here might be:

Imagine fn(DECIMAL(P, S)) => IF P < 18 THEN DECIMAL(18,S) ELSE DECIMAL(38,S).

Not a great example but the return type of the two branches are different. So in this case:

if_condition = P < 18
if_return = DECIMAL(18,S)
else_return = DECIMAL(38,S)

site/docs/expressions/scalar_functions.md, line 53 at r1 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

I've updated to remove wildcards.

addressed in patch.

site/docs/expressions/scalar_functions.md, line 54 at r1 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

Good point on the equal. I've updated to remove the distinction between wildcard and parameterized.

addressed in patch.

site/docs/expressions/scalar_functions.md, line 58 at r1 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Should we unify the language here and use "concrete type" for both input and output types?

Will do.

site/docs/expressions/scalar_functions.md, line 10 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

Should wildcarded still be here?

will fix

site/docs/expressions/scalar_functions.md, line 52 at r6 (raw file):

Previously, cpcloud (Phillip Cloud) wrote…

How about concrete type instead of materialized? How about "Concrete types are types that have no unbound type parameters. Examples include ..."

sounds good, will update.

site/docs/expressions/scalar_functions.md, line 103 at r6 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

This is the common way these are expressed in SQL land. Brackets for type parameters. Parenthesis for integer parameters.

any particular thoughts on the answer here @westonpace ?

jacques-n

Reviewable status: 2 of 3 files reviewed, 16 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @mbasmanova, @rymurr, and @westonpace)

binary/function.proto, line 36 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…

Any particular reason intermediate_type is a concrete type rather than TypeExpression?

Also, what is max_set? In general, how do you feel about adding comments to describe various fields here to help the reader?

You're right, intermediate type should be a type expression. Will update. max_set is the maximum number of values this function will accept. E.g. if you have LIST_AGG and state that you only allow up to 1000 items in each list, you'd set this 1000.

Yes, we should definitely add comments. It's currently pre-consensus so we've been focused more on the spec than the specific serialization (although I've been trying to keep the proto up to date so people can see how might be serialized). Once we get the core spec components better clarified, we plan on doing some consensus work around the serialization formats so I wasn't working super hard on making the proto consumable outside as a learning aid against the spec.

mbasmanova · 2021-09-29T21:46:22Z

Reviewable status: 2 of 3 files reviewed, 16 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @mbasmanova, @rymurr, and @westonpace)

binary/function.proto, line 36 at r7 (raw file):

Previously, mbasmanova (Maria Basmanova) wrote…
You're right, intermediate type should be a type expression. Will update. max_set is the maximum number of values this function will accept. E.g. if you have LIST_AGG and state that you only allow up to 1000 items in each list, you'd set this 1000.

Yes, we should definitely add comments. It's currently pre-consensus so we've been focused more on the spec than the specific serialization (although I've been trying to keep the proto up to date so people can see how might be serialized). Once we get the core spec components better clarified, we plan on doing some consensus work around the serialization formats so I wasn't working super hard on making the proto consumable outside as a learning aid against the spec.

Thank you for a quick reply. I was hoping to compile this into C++ and try to map to Velox's PlanNode representation to see what's missing if anything. Would that be possible at all at this stage?

jacques-n · 2021-09-29T21:56:26Z

Thank you for a quick reply. I was hoping to compile this into C++ and try to map to Velox's PlanNode representation to see what's missing if anything. Would that be possible at all at this stage?

It should be possible. I'd expect most queries to be expressible. It would also be a lot of help to validate the design, find weaknesses, etc. I'm sure there are things missing and we're still changing things but for sure worth doing. FYI, I'm exploring doing the other side: using Calcite to produce a substrait plan. It really helps to exercise/better evaluate design decisions.

jacques-n · 2021-09-29T22:42:55Z

Thank you for a quick reply. I was hoping to compile this into C++ and try to map to Velox's PlanNode representation to see what's missing if anything. Would that be possible at all at this stage?

It should be possible. I'd expect most queries to be expressible. It would also be a lot of help to validate the design, find weaknesses, etc. I'm sure there are things missing and we're still changing things but for sure worth doing. FYI, I'm exploring doing the other side: using Calcite to produce a substrait plan. It really helps to exercise/better evaluate design decisions.

I just remembered that the proto sketch for relations is pretty minimal right now. (The spec is more complete.) I'll add some more sketches there.

Update type system to add bounds Separate type, parameterized type and type expressions to increase clarity of use. Define a better centralized structure for extensions and use for types and functions. Also add nullability to types and basic concept of a "plan".

westonpace

This is a clear improvement in my mind but I don't know that it solves everything just yet. I didn't review the proto files very much as I'd rather review that with actual usage.

site/docs/expressions/scalar_functions.md

westonpace · 2021-09-30T00:09:36Z

site/docs/expressions/scalar_functions.md

-Substrait supports a number of functions. A scalar function signature includes the following properties:
+| Property               | Description                                                  | Required                            |
+| ---------------------- | ------------------------------------------------------------ | ----------------------------------- |
+| Name                   | One or more user friendly utf8 strings that are used to reference this function in languages | At least one value is required.     |


What does it mean if there are multiple names? Are they aliases of each other?

Yeah, they are aliases.

site/docs/expressions/scalar_functions.md

westonpace · 2021-09-30T00:15:25Z

site/docs/expressions/scalar_functions.md

+| Deterministic          | Whether this function is expected to reproduce the same output when it is invoked multiple times with the same input. This informs a plan consumer on whether it can constant reduce the defined function. An example would be a random() function, which is typically expected to be evaluated repeatedly despite having the same set of inputs. | Optional, defaults to true.         |
+| Session Dependent      | Whether this function is influenced by the session context it is invoked within. For example, a function may be influenced by a user who is invoking the function, the time zone of a session, or some other non-obvious parameter. This can inform caching systems on whether a particular function is cacheable. | Optional, defaults to false.        |


Naive question. Do we really need both session dependent and deterministic or can session dependent just be one particular case of non-deterministic? Will a system be able to optimize more efficiently knowing a function is deterministic but not session dependent?

They play in different spaces/influence different optimization opportunities.

Deterministic describes whether or not a value could be reduced to a constant in some expressions. If it is non-deterministic, you must execute for each record. Select rand(), a,b,c from foo is a good example. You wouldn't expect rand() to be the same value on any record. If you did the same thing as add(1,1), you could replace with the literal 2.

Session dependent describes whether or not a plan's interpretation is impacted by session properties and thus means you can't share something like a plan or data cache between different sessions (without also keying session properties). A function that returns username is a prime example.

site/docs/expressions/scalar_functions.md

westonpace · 2021-09-30T00:52:15Z

site/docs/expressions/scalar_functions.md

+Fully defined with argument types:
+
+* `type_parameter(string name) => type`
+* `integer_parameter(string name) => integer`
+* `not(boolean x) => boolean` 
+* `and(boolean a, boolean b) => boolean` 
+* `or(boolean a, boolean b) => boolean` 
+* `multiply(integer a, integer b) => integer`
+* `divide(integer a, integer b) => integer`
+* `add(integer a, integer b) => integer`
+* `subtract(integer a, integer b) => integer`
+* `min(integer a, integer b) => integer`
+* `max(integer a, integer b) => integer`
+* `equal(integer a, integer b) => boolean`
+* `greater_than(integer a, integer b) => boolean`
+* `less_than(integer a, integer b) => boolean`
+* `covers(Type a, Type b) => boolean`Covers means that type b matches type A for as much as type B is defined. For example, if type A is `VARCHAR(20)` and type B is `VARCHAR(N)`, type B would be considered covering. Similarlily if type A was `List<Struct<a:f32, b:f32>>`and type B was `List<Struct<>>`, it would be considered covering. Note that this is directional "as in B covers A" or "B can be further enhanced to match the definition A". 
+* `if(boolean a) then (integer) else (integer)`
+* `if(boolean a) then (type) else (type)`


I don't understand the purpose of this section.

It declares the argument inputs/outputs for the available operations. The lines 77-80 were to make it easy to get a gist. The lines 83-101 were trying to be full definitions. (That was the intention, anyway.)

site/docs/expressions/scalar_functions.md

westonpace · 2021-09-30T00:56:16Z

site/docs/expressions/scalar_functions.md

+| Decimal Division                                          | `divide(Decimal(P1,S1), Decimal(P2,S2)) => Decimal(P1 -S1 + S2 + MAX(6, S1 + P2 + 1), MAX(6, S1 + P2 + 1))` |
+| Do regex on only string maps to return values.            | `extract_values(Map<K,V>) => List<V> WHERE K IN [STRING, VARCHAR(N), FIXEDCHAR(N)]` |
+| Concatenate two fixed sized character strings             | `concat(FIXEDCHAR(A), FIXEDCHAR(B)) => CHAR(A+B)`            |
+| Make a struct of a set of fields and a struct definition. | `make_struct(<type> T, K...) => T`                           |


Where are the field names expressed?

In the type T

What is the meaning of K... here? Does it mean every K[0] up to K[n - 1] are possibly distinct types? Or does it mean that every argument must be the same type K?

Also, what is the <type> T syntax? Isn't that covered by declaring type parameters in the function signature like make_struct<T, K>(T, K...)?

What is the meaning of K... here?
I called this out elsewhere in the doc. In a variadic case, the last argument declaration can be either marked as consistent or inconsistent (I need to add this to the protobuf). If consistent, must resolve the same. If inconsistent, each instance can be a different value.

Also, what is the T syntax?

It's all pseudo syntax to help explain the concepts for the spec. We need to define an actual syntax.

westonpace

Reviewable status: 0 of 8 files reviewed, 38 unresolved discussions (waiting on @cpcloud, @emkornfield, @jacques-n, @mbasmanova, @rymurr, and @westonpace)

binary/function.proto, line 22 at r7 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

Great question. I'm torn on this one. I actually had that in an earlier version. I removed it as it felt like an function semantics detail as opposed to something that the plan itself needs to express. What do you think a substrait producer or consumer could do with this additional information that they wouldn't have already from another source? (For example, I'd expect that engines would already know this when they were doing code generation against the plan based on properties of their particular implementation).

My biggest argument for it is that probably makes sense for any kind of common structured description of a function semantic to be expressed just to make things easier to understand/validate. For example, I also had a error behavior property original that could be: [overflow, nullify, throw]. It also felt like something outside the realm of substrait and so I removed it but it probably should be added back if we add a null-if-null concept.

It would certainly be useful in automated property-based testing of query engines but I agree that may not fall into goals here. I do worry a bit it would be a slippery slope. For example, in addition to your "error behavior" there is "all nulls -> null" (e.g. maybe some nulls doesn't guarantee null but all nulls does) and "transitive", "never errors", etc.

site/docs/expressions/scalar_functions.md, line 103 at r6 (raw file):

Previously, jacques-n (Jacques Nadeau) wrote…

any particular thoughts on the answer here @westonpace ?

I don't like it but I won't fight it :) I think I'd rather it be all angle brackets to avoid confusion with functions.

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Propose update for simple output type expressions.

66c9771

jacques-n requested review from westonpace, cpcloud, emkornfield and rymurr September 22, 2021 20:05

cpcloud reviewed Sep 22, 2021

View reviewed changes

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved

cpcloud reviewed Sep 22, 2021

View reviewed changes

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved

cpcloud reviewed Sep 22, 2021

View reviewed changes

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved

cpcloud reviewed Sep 22, 2021

View reviewed changes

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved

cpcloud reviewed Sep 22, 2021

View reviewed changes

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved

cpcloud reviewed Sep 22, 2021

View reviewed changes

site/docs/expressions/scalar_functions.md Outdated Show resolved Hide resolved

jacques-n added 4 commits September 22, 2021 14:26

minor fixes

98325cd

Add more complex condition.

4145cb3

double to fix

05d7d15

Remove concept of wildcards.

f996e58

cpcloud reviewed Sep 23, 2021

View reviewed changes

Fix subtract mispelling.

242e105

westonpace requested changes Sep 24, 2021

View reviewed changes

jacques-n commented Sep 24, 2021

View reviewed changes

jacques-n added 2 commits September 28, 2021 20:06

Update post discussion about general concepts. Proto still to come

497de43

further refinements

d72320c

cpcloud reviewed Sep 29, 2021

View reviewed changes

mbasmanova reviewed Sep 29, 2021

View reviewed changes

jacques-n commented Sep 29, 2021

View reviewed changes

westonpace previously approved these changes Sep 30, 2021

View reviewed changes

westonpace reviewed Sep 30, 2021

View reviewed changes

Apply suggestions from code review

59d44ff

Co-authored-by: Weston Pace <weston.pace@gmail.com>

jacques-n dismissed westonpace’s stale review via 59d44ff September 30, 2021 17:58

Add proto support for variadic behavior with parameterized types.

0aa2db1

jacques-n merged commit eb2cc62 into substrait-io:main Oct 2, 2021

jacques-n deleted the function_updates branch October 2, 2021 00:10


		### Direct Return Types

		A direct return type is one that is fully known at function definition type. Example simple direct return types would be things such as i32, fp32. For compound types, a direct return type must be fully declared. Example fully defined types: VARCHAR(20), DECIMAL(25,5)

		\| Add item to list \| `add(<List<T>, T>) => List<T>` \|
		\| Decimal Division \| `divide(Decimal(P1,S1), Decimal(P2,S2)) => Decimal(P1 -S1 + S2 + MAX(6, S1 + P2 + 1), MAX(6, S1 + P2 + 1))` \|

		\| Deterministic \| Whether this function is expected to reproduce the same output when it is invoked multiple times with the same input. This informs a plan consumer on whether it can constant reduce the defined function. An example would be a random() function, which is typically expected to be evaluated repeatedly despite having the same set of inputs. \| Optional, defaults to true. \|
		\| Session Dependent \| Whether this function is influenced by the session context it is invoked within. For example, a function may be influenced by a user who is invoking the function, the time zone of a session, or some other non-obvious parameter. This can inform caching systems on whether a particular function is cacheable. \| Optional, defaults to false. \|

Propose update for simple output type expressions. #40

Propose update for simple output type expressions. #40

Conversation

jacques-n commented Sep 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace Sep 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Sep 24, 2021 • edited Loading

jacques-n left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n left a comment

Choose a reason for hiding this comment

jacques-n left a comment

Choose a reason for hiding this comment

mbasmanova commented Sep 29, 2021

jacques-n commented Sep 29, 2021

jacques-n commented Sep 29, 2021

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

jacques-n commented Sep 22, 2021 •

edited

Loading

westonpace Sep 24, 2021 •

edited

Loading

cpcloud Sep 23, 2021 •

edited

Loading

westonpace left a comment •

edited

Loading

westonpace commented Sep 24, 2021 •

edited

Loading

jacques-n left a comment •

edited

Loading