Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(udf): add initial support for WASM-based Rust UDF #14271

Merged
merged 35 commits into from
Jan 12, 2024
Merged

Conversation

wangrunji0408
Copy link
Contributor

@wangrunji0408 wangrunji0408 commented Dec 29, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR adds an initial support for Rust UDF, based on @xxchan's #10910.

Please see src/expr/udf/README.md for usage and current limitations.

The core implementation of UDF is maintained in an independent crate arrow-udf. It depends on arrow-rs only and is intended to be shared by the Rust community. However, its API design is mostly inherited from our built-in functions and will be kept as consistent as possible. Example:

#[function("gcd(int, int) -> int")]
fn gcd(mut a: i32, mut b: i32) -> i32 {
    while b != 0 {
        (a, b) = (b, a % b);
    }
    a
}

On RW side, we provide two ways to create functions from WASM module:

  1. You can embed WASM binaries into SQL using base64 encoding. RW will upload the binary into object store with the path defined in the parameter wasm_storage_url.
CREATE FUNCTION gcd(int, int) RETURNS int LANGUAGE wasm
USING BASE64 '$encoded';
  1. You can also directly specify a path in object store, if you want to reuse a previous uploaded module or upload it by yourself.
CREATE FUNCTION gcd(int, int) RETURNS int LANGUAGE wasm
USING LINK 's3://bucket/path/to/wasm';

You may have noticed that AS <identifier> is no longer required. Because this time we define a unique identifier for each combination of function name and data types. These identifiers are base64 encoded as function symbols, and can be decoded by UDF runtime when loading WASM modules.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

Add experimental support for Rust UDF.

See src/expr/udf/README.md for a draft document.

xxchan and others added 17 commits October 31, 2023 10:09
commit b52a004
Author: xxchan <xxchan22f@gmail.com>
Date:   Thu Oct 26 14:25:18 2023 +0800

    update arrow-ipc

commit e94feeb
Author: xxchan <xxchan@users.noreply.github.com>
Date:   Thu Oct 26 06:21:34 2023 +0000

    Fix "cargo-hakari"

commit 08a5601
Merge: 56e6fc4 942e99d
Author: xxchan <xxchan22f@gmail.com>
Date:   Thu Oct 26 14:19:34 2023 +0800

    Merge branch 'main' into xxchan/wasm-udf

commit 942e99d
Author: Yufan Song <33971064+yufansong@users.noreply.github.com>
Date:   Wed Oct 25 22:10:31 2023 -0700

    fix(nats-connector): change stream into optional string, add replace stream name logic (#13024)

commit 90fb4a3
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Oct 26 04:25:11 2023 +0000

    chore(deps): Bump comfy-table from 7.0.1 to 7.1.0 (#13049)

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit b724be7
Author: jinser <aimer@purejs.icu>
Date:   Thu Oct 26 00:26:15 2023 +0800

    feat: add `comment on` clause support (#12849)

    Co-authored-by: Richard Chien <stdrc@outlook.com>
    Co-authored-by: August <pin@singularity-data.com>

commit 7f791d6
Author: August <pin@singularity-data.com>
Date:   Wed Oct 25 20:29:16 2023 +0800

    feat: move model_v2 and model_migration into a separate crates (#13058)

commit 7f82929
Author: Noel Kwan <47273164+kwannoel@users.noreply.github.com>
Date:   Wed Oct 25 16:57:45 2023 +0800

    fix(meta): persist internal tables of `CREATE TABLE` (#13039)

commit 09a67ab
Author: Noel Kwan <47273164+kwannoel@users.noreply.github.com>
Date:   Wed Oct 25 16:49:08 2023 +0800

    fix: `WAIT` should return error if timeout (#13045)

commit e48547d
Author: Runji Wang <wangrunji0408@163.com>
Date:   Wed Oct 25 16:41:16 2023 +0800

    refactor(type): switch jsonb to flat representation (#12952)

    Signed-off-by: Runji Wang <wangrunji0408@163.com>

commit 56e6fc4
Author: xxchan <xxchan22f@gmail.com>
Date:   Wed Oct 25 15:33:36 2023 +0800

    fix merge issue

commit c644361
Merge: fcd6992 2d428b1
Author: xxchan <xxchan22f@gmail.com>
Date:   Wed Oct 25 15:23:44 2023 +0800

    Merge remote-tracking branch 'origin/main' into xxchan/wasm-udf

commit fcd6992
Author: xxchan <xxchan22f@gmail.com>
Date:   Wed Oct 25 14:28:53 2023 +0800

    fix s3 stuck

commit 21e9740
Author: xxchan <xxchan22f@gmail.com>
Date:   Wed Oct 25 12:47:24 2023 +0800

    Revert "fix s3 stuck (why?)"

    This reverts commit f19a6b4.

commit f19a6b4
Author: xxchan <xxchan22f@gmail.com>
Date:   Wed Sep 13 14:32:28 2023 +0800

    fix s3 stuck (why?)

commit 019f309
Author: xxchan <xxchan22f@gmail.com>
Date:   Tue Sep 12 15:29:52 2023 +0800

    ON_ERROR_STOP=1

commit 6e4ee3c
Author: xxchan <xxchan22f@gmail.com>
Date:   Tue Sep 12 15:09:58 2023 +0800

    generate-config

commit b63a1c3
Merge: 2b0cc96 53611bf
Author: xxchan <xxchan22f@gmail.com>
Date:   Tue Sep 12 14:53:10 2023 +0800

    Merge remote-tracking branch 'origin/main' into xxchan/wasm-udf

commit 2b0cc96
Author: xxchan <xxchan22f@gmail.com>
Date:   Sat Sep 9 23:49:43 2023 +0800

    fix conflicts

commit 6b13fe3
Author: xxchan <xxchan22f@gmail.com>
Date:   Sat Sep 9 23:35:50 2023 +0800

    update system param default

commit a273943
Merge: cc34bfe f649aa6
Author: xxchan <xxchan22f@gmail.com>
Date:   Sat Sep 9 23:33:38 2023 +0800

    Merge remote-tracking branch 'origin/main' into xxchan/wasm-udf

commit cc34bfe
Author: xxchan <xxchan22f@gmail.com>
Date:   Tue Aug 1 17:47:42 2023 +0200

    use count_char as the example

commit f913f63
Merge: 53bf8e0 2637dbd
Author: xxchan <xxchan22f@gmail.com>
Date:   Tue Aug 1 17:22:13 2023 +0200

    Merge branch 'main' into xxchan/wasm-udf

commit 53bf8e0
Author: xxchan <xxchan22f@gmail.com>
Date:   Mon Jul 31 14:20:07 2023 +0200

    minor update

commit 70cee42
Author: xxchan <xxchan22f@gmail.com>
Date:   Mon Jul 17 14:53:29 2023 +0200

    fix arrow_schema into -> try_into

commit a7d172d
Author: xxchan <xxchan22f@gmail.com>
Date:   Fri Jul 14 16:31:20 2023 +0200

    buf format

commit 43a3290
Author: xxchan <xxchan22f@gmail.com>
Date:   Thu Jul 13 23:04:16 2023 +0200

    add tinygo example & turn on wasi support

commit 61a4998
Author: xxchan <xxchan22f@gmail.com>
Date:   Wed Jul 12 11:40:56 2023 +0200

    cleanup

commit 165d4d9
Author: xxchan <xxchan22f@gmail.com>
Date:   Wed Jul 12 11:02:44 2023 +0200

    use object store to store wasm

commit 88979e4
Author: xxchan <xxchan22f@gmail.com>
Date:   Tue Jul 11 15:32:52 2023 +0200

    add wasm_storage_url system param

commit a897320
Author: xxchan <xxchan22f@gmail.com>
Date:   Thu Jul 6 20:04:45 2023 +0200

    Load compiled wasm module in expr 🚀🚀🚀

commit 63b3523
Author: xxchan <xxchan22f@gmail.com>
Date:   Sun Jul 2 19:27:22 2023 +0200

    it works (although very slow)
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
@@ -182,3 +182,4 @@ backup_storage_url = "memory"
backup_storage_directory = "backup"
max_concurrent_creating_streaming_jobs = 1
pause_on_next_bootstrap = false
wasm_storage_url = "fs://.risingwave/data"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what the default path should be in a production environment. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default to none?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path .risingwave/data should be specific to RiseDev. 🤔 Maybe you want to set it with risedev d?

Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Comment on lines +252 to +253
/// Runtimes returned by this function are cached inside for at least 60 seconds.
/// Later calls with the same link will reuse the same runtime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we essentially trying to reuse the same runtime for different actors/tasks for the same job? BTW, what's the benefits or overheads for reusing or not?

Copy link
Contributor Author

@wangrunji0408 wangrunji0408 Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The runtime supports parallel calls from multiple threads. A WASM instance can only be called from a single thread (for now, multi-thread support is on the way), but we can use an instance pool to share and recycle instances across threads. The benefits of reuse are primarily saving memory, reducing the overhead of duplicate JITs and repeated downloads from object store.

src/expr/core/src/expr/expr_udf.rs Outdated Show resolved Hide resolved
Signed-off-by: Runji Wang <wangrunji0408@163.com>
@stdrc stdrc requested review from jetjinser and stdrc January 2, 2024 03:51
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
@TennyZhuang
Copy link
Contributor

I prefer to merge it without documenting it now.

@wangrunji0408 wangrunji0408 added this pull request to the merge queue Jan 12, 2024
@wangrunji0408
Copy link
Contributor Author

I prefer to merge it without documenting it now.

So we will maintain the doc inside this repo but won't publish it into risingwave.dev?

Merged via the queue into main with commit e8f1eb9 Jan 12, 2024
29 of 30 checks passed
@wangrunji0408 wangrunji0408 deleted the wrj/wasm-udf branch January 12, 2024 06:29
@fuyufjh
Copy link
Member

fuyufjh commented Jan 12, 2024

You can embed WASM binaries into SQL using base64 encoding. RW will upload the binary into object store with the path defined in the parameter wasm_storage_url.

Why not just storing them in meta store? In most cases the WASM won't be very big and meta store (will be migrated to RDBMS e.g. PostgresQL) should be good enough. I prefer to make it simple and avoid introduce new dependent systems.

A related discussion: #12982. Also recommend meta store for it in my opinion.

@wangrunji0408
Copy link
Contributor Author

The wasm binary in e2e test is about 1.5MB after strip, 400KB after compression. It sounds okay to store them into meta.

@fuyufjh
Copy link
Member

fuyufjh commented Jan 15, 2024

The wasm binary in e2e test is about 1.5MB after strip, 400KB after compression. It sounds okay to store them into meta.

Let me note it down as an issue as a reminder.

Little-Wallace pushed a commit that referenced this pull request Jan 20, 2024
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Co-authored-by: xxchan <xxchan22f@gmail.com>
Co-authored-by: wangrunji0408 <wangrunji0408@users.noreply.github.com>
wangrunji0408 added a commit that referenced this pull request Jan 22, 2024
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Co-authored-by: xxchan <xxchan22f@gmail.com>
Co-authored-by: wangrunji0408 <wangrunji0408@users.noreply.github.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
wangrunji0408 added a commit that referenced this pull request Jan 23, 2024
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Co-authored-by: xxchan <xxchan22f@gmail.com>
Co-authored-by: wangrunji0408 <wangrunji0408@users.noreply.github.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
wangrunji0408 added a commit that referenced this pull request Jan 23, 2024
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Co-authored-by: xxchan <xxchan22f@gmail.com>
Co-authored-by: wangrunji0408 <wangrunji0408@users.noreply.github.com>
}

/// Convert a data type to string used in identifier.
fn datatype_name(ty: &DataType) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't we use DataType's Display? 🤪

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type names defined in WASM module are slightly different than RW's type names, as the former follows Arrow specification while the latter follows postgres'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think arrow doesn't specified things like T[], and thus the "string representation of a datatype" is still kind of arbitrary IMO.

So does the signature matter in the arrow-udf-wasm crate, or it can be decided by the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think arrow doesn't specified things like T[], and thus the "string representation of a datatype" is still kind of arbitrary IMO.

Sure. T[] is still pg style and should be changed to list<T> in my opinion.

So does the signature matter in the arrow-udf-wasm crate, or it can be decided by the caller?

The signature format is a part of arrow-udf API and may vary between versions. (e.g. from API version 2 to 3, the type name of bigint was changed from int8 to int64, in order to be consistent with Arrow and avoid ambiguity) The caller needs to use the correct name according to the API version declared by the WASM module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants