Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expression plugin function with multiple columns as input #49

Closed
Macfly opened this issue Nov 26, 2023 · 7 comments · Fixed by #69
Closed

expression plugin function with multiple columns as input #49

Macfly opened this issue Nov 26, 2023 · 7 comments · Fixed by #69

Comments

@Macfly
Copy link

Macfly commented Nov 26, 2023

I've looked as the example but the haversine one is a bit confusing.
It looks like the python example gives 5 parameters to the Rust function so the last one is ignored as the functions only takes 4 parameters.

What is the best way to implement a function that takes multiple columns as parameters and still use all the parallelism and optimization from Polars?

@oscar6echo
Copy link

Title

I concur.

The definition of the column is haversine=pl.col("floats").dist.haversine("floats", "floats", "floats", "floats") with sample column floats [5.6, -1245.8, 242.224].
And the unpacking of the data is:

#[polars_expr(output_type_func=haversine_output)]
fn haversine(inputs: &[Series]) -> PolarsResult<Series> {
    let out = match inputs[0].dtype() {
        DataType::Float32 => {
            let start_lat = inputs[0].f32().unwrap();
            let start_long = inputs[1].f32().unwrap();
            let end_lat = inputs[2].f32().unwrap();
            let end_long = inputs[3].f32().unwrap();
// etc

So from the other examples, we understand that inputs[0] is the column on which the plugin function is applied, and inputs[1, 2, 3] are the first 3 column names, and the last one is not used ! If this is correct, this is a bit misleading, isn't it ?

The pig_latinnify and pig_latinnify_with_parallelism make sense, work as expected (I cannot judge the parallelism) and are didactic demos, and so is the kwargs example.

But I would argue a a potentially very generic sample function with several columns as input and several columns as ouput would be more helpful, and, in terms of API it would quite readable to have the inputs and outputs as pl.struct.

Here is the python shell, super generic:

@pl.api.register_expr_namespace("demo")
class Demo:
    def __init__(self, expr: pl.Expr):
        self._expr = expr

    def demo_calc(self) -> pl.Expr:
        return self._expr._register_plugin(
            lib=lib,
            symbol="demo_calc",
            is_elementwise=True,
            kwargs={},
        )

A vaguely worded rust part would be:

// struct input column
struct FuncStructIn {
    a_float: bool,
    a_int: i64,
    a_string: String,
}

// struct output column
struct FuncStructOut {
    a_valid: bool,
    a_one: f64,
    a_two: f64,
}

// output type is a struct with schema FuncStructOut
#[polars_expr(output_type=FuncStructOut)]
fn demo_calc(inputs: &[Series], kwargs: PigLatinKwargs) -> PolarsResult<Series> {
    
    // unpack struct to individual variables, with all the relevant error checking
    // obviously this does not work but you get the idea
    let s_in: FuncStructIn = inputs[0].struct();

    let a_float : &ChunkedArray<T>= =    s_in.a_float;
    let a_int = s_in.a_int;
    let a_string = s_in.a_string;

    // impl_demo_calc contains the actuall processing
    let s_out: FuncStructOut = impl_demo_calc(a_float, a_int, a_string)

    // returns result
    // again syntax is super approximative
    Ok(s_out.into_series())
}

// approx syntax
pub(super) fn impl_demo_calc(
    a_float: &ChunkedArray<Float64Type>,
    a_int: &ChunkedArray<Int64Type>,
    a_string: &ChunkedArray<Utf8Type>,
) -> PolarsResult<ChunkedArray<FuncStructOut>> {
    // can be anything but return a FuncStructOut shaped struct
    // unless compute error due to bad input
}

// once the this is done, in a second stage, probably a _with parallelism version can be derived.

Hopefully the intention is clear.

Unfortunately, I don't know enough about polars-rs, and to be honest rust in general, at the moment.
The unpacking/re packing in particular is obscure to me..

If there is no serious drawback to this approach - eg. perf cost creating the input struct - then, once the unpacking/repacking and error management is demonstrated it can become a useful boilerplate to start from for many use cases/people.

Whether maintainers are willing to improve this part of the doc or not, I would be grateful for pointers to continue exploring.

@Macfly
Copy link
Author

Macfly commented Feb 17, 2024

looks like the dev has been done:
870da1e

But I'm still not sure how we can pass multiple columns to a function.

@oscar6echo
Copy link

looks like the dev has been done:
870da1e

I think this solves a use case brought forward by polars extension plugin author Ion Koutsouris in this issue #58.

In the meantime I've learnt some Rust and done some research and existing comunity plugins are the best information sources. There I could find exemples with input as a single column of type list or struct, and output as struct. Aggregating several columns into a col of type struct is easy (and cheap in lazy mode, it seems). Same for unpacking a struct into several columns.

I am not far from testing it and if confirmed, then my naive question above will be answered.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Feb 17, 2024

Does this help? https://marcogorelli.github.io/polars-plugins-tutorial/sum/

I think you're right about the haversine example though, will make a pr to clear it up

@Macfly
Copy link
Author

Macfly commented Feb 18, 2024

Does this help? https://marcogorelli.github.io/polars-plugins-tutorial/sum/

I think you're right about the haversine example though, will make a pr to clear it up

yes thanks a lot.
One more thing that is not clear. it looks like "pig_latinnify_with_paralellism" is not used anywhere, just defined. When do we need to use a library like Rayon to speed things up as one of the advantage of using shared library plugins is to have "Parallelism and optimizations (are) managed by the default polars runtime"?

@MarcoGorelli
Copy link
Contributor

it looks like "pig_latinnify_with_paralellism" is not used anywhere, just defined

I have the same question :)

@oscar6echo
Copy link

Does this help? https://marcogorelli.github.io/polars-plugins-tutorial/sum/

YES it does !
I read through it all and implemented along the way - not done the suggested extra work yet.
But it is a very valuable help 🙏

For somebody intermediate in rust, beginner in rust polarrs, proficient in python polars, the following sections, in particular, got my attention:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants