Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(udf): support function that returns multiple columns #8644

Merged
merged 11 commits into from
Mar 27, 2023
Merged

Conversation

wangrunji0408
Copy link
Contributor

@wangrunji0408 wangrunji0408 commented Mar 20, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR adds returning multiple columns support for both UDF and UDTF.

For scalar functions, the result type should be defined as struct. The python function can return a tuple:

@udf(input_types=['BINARY'], result_type='STRUCT<src_ip VARCHAR, dst_ip VARCHAR, src_port SMALLINT, dst_port SMALLINT>')
def extract_tcp_info(tcp_packet: bytes):
    # ...
    return src_addr, dst_addr, src_port, dst_port
# returns a single column when called in select list
dev=> select extract_tcp_info(E'\\x45000034a8a8400040065b8ac0a8000ec0a80001035d20b6d971b900000000080020200493310000020405b4' :: bytea);
          extract_tcp_info           
-------------------------------------
 (192.168.0.14,192.168.0.1,861,8374)
(1 row)

# returns multiple columns with `select ().*`
dev=> select (extract_tcp_info(E'\\x45000034a8a8400040065b8ac0a8000ec0a80001035d20b6d971b900000000080020200493310000020405b4' :: bytea)).*;
    src_ip    |   dst_ip    | src_port | dst_port 
--------------+-------------+----------+----------
 192.168.0.14 | 192.168.0.1 |      861 |     8374
(1 row)

For table functions:

@udtf(input_types=['BINARY'], result_types=['VARCHAR', 'VARCHAR', 'SMALLINT', 'SMALLINT'])
def extract_tcp_info(tcp_packet: bytes):
    # ...
    yield src_addr, dst_addr, src_port, dst_port
# returns a single column when called in select list
dev=> select extract_tcp_info(E'\\x45000034a8a8400040065b8ac0a8000ec0a80001035d20b6d971b900000000080020200493310000020405b4' :: bytea);
          extract_tcp_info           
-------------------------------------
 (192.168.0.14,192.168.0.1,861,8374)
(1 row)

# returns multiple columns when called in from-clause
dev=> select * from extract_tcp_info(E'\\x45000034a8a8400040065b8ac0a8000ec0a80001035d20b6d971b900000000080020200493310000020405b4' :: bytea);
    src_ip    |   dst_ip    | src_port | dst_port 
--------------+-------------+----------+----------
 192.168.0.14 | 192.168.0.1 |      861 |     8374
(1 row)

Some discussions:

  • Should extract_tcp_info be defined as a table function? or a scalar function that returns struct type?
  • Can the current implementation be called table function? or are they just set returning functions?

Checklist For Contributors

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

  • I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • SQL commands, functions, and operators

Release note

User-defined table functions can return multiple columns now.

Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
@github-actions github-actions bot added type/feature user-facing-changes Contains changes that are visible to users labels Mar 20, 2023
Signed-off-by: Runji Wang <wangrunji0408@163.com>
@codecov
Copy link

codecov bot commented Mar 20, 2023

Codecov Report

Merging #8644 (c26fcf2) into main (f111bfb) will decrease coverage by 0.01%.
The diff coverage is 54.76%.

@@            Coverage Diff             @@
##             main    #8644      +/-   ##
==========================================
- Coverage   71.23%   71.22%   -0.01%     
==========================================
  Files        1155     1155              
  Lines      191247   191271      +24     
==========================================
+ Hits       136231   136240       +9     
- Misses      55016    55031      +15     
Flag Coverage Δ
rust 71.22% <54.76%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/common/src/array/data_chunk.rs 95.71% <0.00%> (-1.22%) ⬇️
src/common/src/catalog/schema.rs 73.48% <0.00%> (-3.85%) ⬇️
src/expr/src/table_function/user_defined.rs 0.00% <0.00%> (ø)
src/sqlparser/src/parser.rs 90.80% <0.00%> (-0.07%) ⬇️
src/common/src/array/arrow.rs 63.88% <25.00%> (+0.05%) ⬆️
src/frontend/src/expr/table_function.rs 46.36% <50.00%> (ø)
src/frontend/src/binder/relation/mod.rs 74.69% <76.31%> (-0.23%) ⬇️
src/batch/src/executor/table_function.rs 72.97% <85.71%> (+0.24%) ⬆️
.../src/optimizer/plan_node/logical_table_function.rs 64.06% <87.50%> (-0.46%) ⬇️
src/common/src/array/struct_array.rs 87.21% <100.00%> (ø)

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@st1page
Copy link
Contributor

st1page commented Mar 20, 2023

Should extract_tcp_info be defined as a table function? or a scalar function that returns struct type?

I think so, knowing the cardinality does not changes, the optimizer can optimize the Project better than ProjectSet. Maybe we should note this to the user in the doc.

Can the current implementation be called table function? or are they just set returning functions?

I think it is the set returning functions. Because our current table function implementation does not give a way to notice the input is over.

@xxchan
Copy link
Member

xxchan commented Mar 20, 2023

Can the current implementation be called table function? or are they just set returning functions?

I think it is the set returning functions. Because our current table function implementation does not give a way to notice the input is over.

How do you define "table function" here? In my dictionary, TF = SRF 😅

Signed-off-by: Runji Wang <wangrunji0408@163.com>
@st1page
Copy link
Contributor

st1page commented Mar 20, 2023

Can the current implementation be called table function? or are they just set returning functions?

I think it is the set returning functions. Because our current table function implementation does not give a way to notice the input is over.

How do you define "table function" here? In my dictionary, TF = SRF sweat_smile

just like,
https://www.postgresql.org/docs/7.3/xfunc-tablefunctions.html
https://cloud.google.com/bigquery/docs/reference/standard-sql/table-functions
and the "M to N" table function described by https://docs.snowflake.com/en/sql-reference/functions-table#functions-in-which-each-output-row-depends-upon-multiple-input-rows

in that case, the output rows can depend on any row of the input, so it might can only output the result when all the input comes. e,g, implement a table function that can sort the original input relation with some keys and assign the rank to a new column. it means that

  1. it is not friendly for our current UDF design, the udf processing can not be pipelined parallelized and the UDF server should buffer all input data.
  2. the function can only be defined in limited input and we must notice the function if the input has been over. So our streaming execution can not (maybe with a watermark in the future)

@xxchan
Copy link
Member

xxchan commented Mar 20, 2023

Latest PG doesn't have the term TF any more. Bigquery's TF looks the same as PG SRF to me. Therefore, I still think it's only an alias.

"M to N" is indeed a different case, but I think it's "1-N" TF vs "M-N" TF, instead of TF vs SRF.

(Correct me if I'm wrong.)

I sincerely hope to have unified term usages, bacause it confused me a lot. 🥹

@xxchan
Copy link
Member

xxchan commented Mar 20, 2023

Didn't think much about it, but "M-N" TF sounds just like OVER windowing?

Signed-off-by: Runji Wang <wangrunji0408@163.com>
@st1page
Copy link
Contributor

st1page commented Mar 20, 2023

I sincerely hope to have unified term usages, bacause it confused me a lot.

me too. so my starting point just is that because TF looks a vague concept but we all know what is SRF, maybe use SRF instead of TF is better?

@st1page
Copy link
Contributor

st1page commented Mar 20, 2023

Didn't think much about it, but "M-N" TF sounds just like OVER windowing?

I think no, at least the over window function will give the same number of output rows with the input 🤔 but the m-n TF do not have that restriction

@xxchan
Copy link
Member

xxchan commented Mar 20, 2023

TF looks a vague concept but we all know what is SRF, maybe use SRF instead of TF is better?

"we all know what is SRF" I don't think so:

  • I still need to verify what is SRF in PG from time to time. I can't confidently say I know what is SRF..
  • Literally speaking, isn't "Table (Valued) Function" exactly the same as "Set Returning Function"?

Initially I called it TF when implementing ProjectSet because it sounds slightly better for people without any background 🤪. And it seems more commonly used in other systems than PG. SRF seems not used outside PG.

But being consistent with PG might be an accepatable argument to me. So I'm OK if someone insists on that.


One nit BTW, shouldn't it be "Bag Returning" instead of "Set Returning"?? 🤣


Another point: PG's sytax is

CREATE FUNCTION RETURNS [rettype | TABLE]

, instead of RETURNS SET 🤔

https://www.postgresql.org/docs/current/sql-createfunction.html

But mentioned in the doc, there's also a RETURNS SETOF:

  • The SETOF modifier is a PostgreSQL extension. (is a rettype)

Considering this, "Table function" is a better name. Don't know whether the name "SRF" is a historical issue.

@xxchan
Copy link
Member

xxchan commented Mar 20, 2023

Didn't think much about it, but "M-N" TF sounds just like OVER windowing?

I think no, at least the over window function will give the same number of output rows with the input 🤔 but the m-n TF do not have that restriction

Do you have examples for m-n TF?

@xxchan
Copy link
Member

xxchan commented Mar 20, 2023

BTW, snowflake UDTF is only 1-n 🤔

A tabular function (UDTF) returns a tabular value for each input row. In the handler for a UDTF, you write methods that conform to an interface required by Snowflake.

https://docs.snowflake.com/en/sql-reference/udf-overview#scalar-and-tabular-functions

@xxchan
Copy link
Member

xxchan commented Mar 20, 2023

BTW, our discussion might be off-topic in the PR 🤣

@st1page
Copy link
Contributor

st1page commented Mar 20, 2023

well, let us maintain the Table Funtion name before we really need to implement the m-n table function

@st1page
Copy link
Contributor

st1page commented Mar 20, 2023

Do you have examples for m-n TF?

In fact, Any standard or not standard relational operator. after some investigation I think all system I know implement it with sql 🤔

Signed-off-by: Runji Wang <wangrunji0408@163.com>
@wangrunji0408 wangrunji0408 changed the title feat(udf): support table function that returns multiple columns feat(udf): support functions that returns multiple columns Mar 21, 2023
@wangrunji0408 wangrunji0408 changed the title feat(udf): support functions that returns multiple columns feat(udf): support function that returns multiple columns Mar 21, 2023
@@ -22,16 +24,25 @@ def gcd3(x: int, y: int, z: int) -> int:
return gcd(gcd(x, y), z)


@udf(input_types=['BINARY'], result_type='STRUCT<src_ip VARCHAR, dst_ip VARCHAR, src_port SMALLINT, dst_port SMALLINT>')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Is it possible to use Python3 type hints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to infer SQL type from function signatures?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, just like how @dataclass do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. It seems feasible.

I can think of one more benefit: a python function can be compatible with multiple SQL functions.
For example: def gcd(x: int, y: int): int can be used as SQL functions:

  • gcd(INT2, INT2) -> INT2
  • gcd(INT4, INT4) -> INT4
  • gcd(INT8, INT8) -> INT8
  • ...

@wangrunji0408 wangrunji0408 added this pull request to the merge queue Mar 27, 2023
Merged via the queue into main with commit 43c81b9 Mar 27, 2023
@wangrunji0408 wangrunji0408 deleted the wrj/udtf branch March 27, 2023 09:12
@CharlieSYH CharlieSYH added the 📖✓ Covered or will be covered in the user docs. label Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature user-facing-changes Contains changes that are visible to users 📖✓ Covered or will be covered in the user docs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants