What are "operations" in the vision statement #7

westonpace · 2021-09-08T22:40:15Z

westonpace
Sep 8, 2021
Maintainer

The vision statement currently reads...

Create a well-defined, cross-language specification for data compute operations. This includes a declaration of common operations, custom operations and one or more serialized representations of this specification. The spec focuses on the semantics of each operation and a consistent way to describe.

What is an "operation"? For example:

Option 1: operation => relational algebra operator

In this definition an operation would be something like "select", "project", "union", "order by", "udf", etc. Given such a definition, you could even envision a specification that doesn't have any type system at all. A type is just a tuple of (unique identifier, is_sortable, is_hashable). Cast would simply be another udf. The unique identifier could include physical representation, if needed by an implementation. This would be a smaller and simpler vision.

Option 2: operation => relational algebra operator OR function

In this definition operations would include things like "add", "upper", "ucase", "lcase", "round". In such a definition you would need a type system because you would need to be able to formally define the semantics of an operation. What behavior happens if integer addition overflows? What expectations are there are the precision of floating point operations? Does F & null => null or F? Can an interval be multiplied by an integer of floating point number? Option 2 has a much broader vision.

Options 3+: ???

My interpretation is that this spec is aimed at option 2 but I wanted to clarify because it also might be somewhere in between options 1 and 2 and I think a very clear idea of the scope is going to be important.

jacques-n · 2021-09-08T23:04:11Z

jacques-n
Sep 8, 2021
Maintainer

Thanks for bringing this up!

My intention was definitely option 2+. The plus is because function is pretty generic and some kinds of things (say batch level columnar transformations) are included within what I'd like to cover but I don't think most people will think of them when they initially say "function". A good example is when working with geo, a contains operation may actually take a set, build an S2 geometry version of the data and do contains operations using that. It's kind of a function except that it has some kind of intermediate set-level state. I think that's a type of operation that should also be covered.

Do you have some proposed language to help clarify things?

0 replies

westonpace · 2021-09-09T00:38:22Z

westonpace
Sep 9, 2021
Maintainer Author

Do you have some proposed language to help clarify things?

No, and it probably doesn't have to be solved in the vision statement itself as long as it is expanded somewhere else on the site. Perhaps it could be covered by another use case. Something like one of...

Allow a plan to be executed against different execution engines without altering the semantics.
Reduce incompatibilities between execution engines by agreeing on a common semantic interpretation of functions, operators, and other transformations.

7 replies

cpcloud Sep 13, 2021
Maintainer

Reduce incompatibilities between execution engines by agreeing on a common semantic interpretation of functions, operators, and other transformations.

Is this referring to a common semantic interpretation of specific functions, or one of the structure of functions, operators and transformations themselves?

jacques-n Sep 13, 2021
Maintainer

I think we should primarily be focused on the first but I believe that naturally leads to us at least defining a common vocabulary for the second (although not necessarily a formal specification). For example we may define a scalar function as add(i64, i64) => i64. In so doing we define not only the specific function but also need to define the abstract concept of a "scalar function". To me the goal is really to define the specific things and only define as many general concepts as are helpful to define the specific things.

westonpace Sep 13, 2021
Maintainer Author

I was thinking about facts like...

(function) Should AND(False, Null) be False or Null?
(function) What behavior should be expected if an ADD overflows (promote, raise, overflow predictably, undefined)?
(operator) Should a UNION operator remove duplicates (SQL UNION) or not (SQL UNION ALL)?

As for general concepts like "scalar function" I'm not sure. It isn't what I was thinking. It could help to define the logical concept of a "scalar function" to group semantic properties (for example, rather than writing "if f([x_n]) => f([y_n]) then f([x_1, x_2]) => f([y_1, y_2])" for add, multiply, divide, etc. we could define the logical concept "scalar function" and then say add, multiply, divide, etc. are "scalar functions". Then if a consumer never had anything called "scalar function" in its implementation that would be perfectly fine (and "scalar function" need not be a discoverable property in a function registry, if such a thing exists).

jacques-n Sep 14, 2021
Maintainer

(function) Should AND(False, Null) be False or Null?
(function) What behavior should be expected if an ADD overflows (promote, raise, overflow predictably, undefined)?
(operator) Should a UNION operator remove duplicates (SQL UNION) or not (SQL UNION ALL)?

I believe you're saying these would be things that substrait would specify. If so, I agree. Not that a user couldn't specify alternatives but there would be clear semantics for the things defined in substrait. That being said, there may be alternatives that are closely related. It is very likely, for example that you might have two definitions of add(i64,i64), one which overflows and one which will error. A producer would decide which behavior they want and then pick the specific one that has that behavior. As of yet, I haven't sketched a formal relationship to achieve this in the spec. We should consider whether it makes sense to build this concept in as first class in the context of formalizing the functions section once we get there.

As for general concepts like "scalar function" I'm not sure...

While I don't really care about the specific name, there are classes of functions that allowed or disallowed in specific contexts. For example, I would expect that we would define rules that state one can use a scalar functions in a join condition but cannot use aggregate functions in a join condition. The class or category of functions are important for the contexts where their semantics make sense.

westonpace Sep 15, 2021
Maintainer Author

Forgot to reply but yes, I agree with what you have said here. I did not mean to imply there is going to be a single simple definition. There may be many variations. Also, I see what you mean by "scalar function" now. That makes sense to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are "operations" in the vision statement #7

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

What are "operations" in the vision statement #7

westonpace Sep 8, 2021 Maintainer

Replies: 2 comments · 7 replies

jacques-n Sep 8, 2021 Maintainer

westonpace Sep 9, 2021 Maintainer Author

cpcloud Sep 13, 2021 Maintainer

jacques-n Sep 13, 2021 Maintainer

westonpace Sep 13, 2021 Maintainer Author

jacques-n Sep 14, 2021 Maintainer

westonpace Sep 15, 2021 Maintainer Author

westonpace
Sep 8, 2021
Maintainer

Replies: 2 comments 7 replies

jacques-n
Sep 8, 2021
Maintainer

westonpace
Sep 9, 2021
Maintainer Author

cpcloud Sep 13, 2021
Maintainer

jacques-n Sep 13, 2021
Maintainer

westonpace Sep 13, 2021
Maintainer Author

jacques-n Sep 14, 2021
Maintainer

westonpace Sep 15, 2021
Maintainer Author