Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatefulFunctions #340

Open
alamb opened this issue May 14, 2021 · 1 comment
Open

StatefulFunctions #340

alamb opened this issue May 14, 2021 · 1 comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 14, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
On a PR that added what postgres would term a stable function (something that is not the same from transaction to transaction, but something that not a function of its inputs either), namely now(), @jorgecarleitao suggested adding a concept of a StatefulFunction to use for functions that needed state, unlike ScalarFunction which is designed to not have state.

There is a lot of discussion on #288 (comment) and I will try to summarize a bunch of that;

@jorgecarleitao :

AFAIK current_* are all derived from now; imo the differentiator aspect here is that there is some state X that is being shared.

It seems to me that the use-case here is that we want to preserve state across nodes, so that their execution depends on said state. NOW is an example, but in reality, random is also an example; we "cheated" a bit by not allowing users to select a seed. If they want that, we hit the same problem as NOW.

IMO a natural construct here is something like struct StatefulFunction<T: Send + Sync>, where T is the state, and Arc is inside of it, and that implements PhysicalExpr. During planning, the initial state is passed to it from the planner, and we are ready to fly.

The ScalarFunction construct was meant to be stateless because it makes it very easy to develop, and it also makes it obvious that is stateless. Trying to couple execution state to them is imo going beyond its scope.

@returnString

In Postgres, this sort of corresponds to the function volatility categories (https://www.postgresql.org/docs/13/xfunc-volatility.html) which might be a useful basis for any future definition of different function types.

immutable: pure function, can only use arguments and internal constants (example: basic math ops). Optimiser can do lots here
stable: can refer to shared state but must return the same value for the same arguments within a given statement (example: now). Optimiser is allowed to unify all references into one call per unique set of arguments
volatile: no rules, no optimiser potential! Must always be evaluated exactly as initially planned (example: random)

...
Off the top of my head I think it'll open up some potential for generalised optimisation passes over function usage in queries according to function class, i.e. the optimiser rule used for the initial implementation of this PR but applicable to arbitrary functions provided they indicate themselves to be "stable".

cc @returnString @jorgecarleitao @msathis @Dandandan

@alamb alamb added the enhancement New feature or request label May 14, 2021
@alamb
Copy link
Contributor Author

alamb commented May 14, 2021

My personal take is that adding some way to mark a ScalarFunction as being immutable, stable or volatile would be valuable for query optimization (e.g. we could inline/fold immutable functions in logical plans, inline/fold stable functions in physical plans, and never inline volatile functions)

@alamb alamb added the datafusion Changes in the datafusion crate label May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant