Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export minimum C API and examples for C, Ruby and Python #2622

Closed
wants to merge 5 commits into from

Conversation

kou
Copy link
Member

@kou kou commented May 26, 2022

Which issue does this PR close?

Closes #1113.

Rationale for this change

See #1113.

What changes are included in this PR?

This exports minimum C API to write the following Rust code in C:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  // register the table
  let mut ctx = ExecutionContext::new();

  // create a plan to run a SQL query
  let df = ctx.sql("SELECT 1").await?;

  // execute and print results
  df.show().await?;
  Ok(())
}

See datafusion/c/examples/sql.c for C version. You can build and run
datafusion/c/examples/sql.c by the following command lines:

$ cargo build
$ cc -o target/debug/sql datafusion/c/examples/sql.c -Idatafusion/c/include -Ltarget/debug -Wl,--rpath=target/debug -ldatafusion_c
$ target/debug/sql
+----------+
| Int64(1) |
+----------+
| 1        |
+----------+

This implementation doesn't export Future like
datafusion-python. Async functions are block_on()-ed in exported
API. But I think that we can export Future in follow-up tasks.

Follow-up tasks:

  • Add support for testing by "cargo test"
  • Add support for building and running examples by "cargo ..."
  • Add support for installing datafusion.h
  • Add documentation

Are there any user-facing changes?

Users can use DataFusion from C and/or FFI.

Does this PR break compatibility with Ballista?

No.

@kou kou changed the title Add minimum C API Export minimum C API May 26, 2022
Closes apache#1113

This exports minimum C API to write the following Rust code in C:

    use datafusion::prelude::*;

    #[tokio::main]
    async fn main() -> datafusion::error::Result<()> {
      // register the table
      let mut ctx = ExecutionContext::new();

      // create a plan to run a SQL query
      let df = ctx.sql("SELECT 1").await?;

      // execute and print results
      df.show().await?;
      Ok(())
    }

See datafusion/c/examples/sql.c for C version. You can build and run
datafusion/c/examples/sql.c by the following command lines:

    $ cargo build
    $ cc -o target/debug/sql datafusion/c/examples/sql.c -Idatafusion/c/include -Ltarget/debug -Wl,--rpath=target/debug -ldatafusion_c
    $ target/debug/sql
    +----------+
    | Int64(1) |
    +----------+
    | 1        |
    +----------+

This implementation doesn't export Future like
datafusion-python. Async functions are block_on()-ed in exported
API. But I think that we can export Future in follow-up tasks.

Follow-up tasks:

  * Add support for testing by "cargo test"
  * Add support for building and running examples by "cargo ..."
  * Add support for installing datafusion.h
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label May 26, 2022
@kou kou mentioned this pull request May 26, 2022
@kou
Copy link
Member Author

kou commented May 27, 2022

I've added examples that use the C API from Python and Ruby with FFI library.

@andygrove
Copy link
Member

Hi @kou and thanks for the contribution!

This looks really interesting but I would like to understand more about the motivation and context for this. Making DataFusion accessible from C makes sense but I am wondering if we should create a separate repository for this in https://github.com/datafusion-contrib/. This is where we have the Java and Python bindings for DataFusion. I'm also curious why we would want to go from Python -> C -> Rust rather than just Python -> Rust directly as we do in https://github.com/datafusion-contrib/datafusion-python

I am also concerned that adding C code as part of the default build may be problematic for some users. I assume there are some minimum requirements for having this work on all platforms?

@alamb alamb changed the title Export minimum C API Export minimum C API and examples for C, Ruby and Python May 28, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over the code and tried it out. Thank you very much @kou .

(arrow_dev) alamb@MacBook-Pro-6:~/Software/arrow-datafusion$ cc   -o target/debug/sql   -I datafusion/c/include   datafusion/c/examples/sql.c   -L target/debug     -ldatafusion_c
(arrow_dev) alamb@MacBook-Pro-6:~/Software/arrow-datafusion$ ./target/debug/sql 
+----------+
| Int64(1) |
+----------+
| 1        |
+----------+

I agree with @andygrove that this code could also reasonably live in another crate / repo rather than the core datafusion one.

Some suggestions:

  1. It would be nice to put a minimal readme in datafusion/c (saying, for example, that the directory contains the C API, see examples/README.md for more details). Or maybe we could move datafusion/c/examples/README.md to datafusion/c/README.md to make it more discoverable
  2. I think it would be a good idea to track planned follow on items as individual tasks

API. But I think that we can export Future in follow-up tasks.

I am not familiar how to interface Rust async functions with C -- I would assume it looks like callbacks somehow, but I can imagine how it gets very tricky very quickly

cc @houqp and @jimexist

Ok(value) => Some(value),
Err(e) => {
if !error.is_null() {
let c_string_message = match CString::new(format!("{}", e)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let c_string_message = match CString::new(format!("{}", e)) {
let c_string_message = match CString::new(e.to_string()) {

}

fn block_on<F: Future>(future: F) -> F::Output {
tokio::runtime::Runtime::new().unwrap().block_on(future)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokio::runtime::Runtime::new().unwrap().block_on(future)
tokio::runtime::Runtime::new().expect("Can not create tokio runtime").block_on(future)

@kou
Copy link
Member Author

kou commented May 30, 2022

Regarding repository, I'm OK with it. Could you create https://github.com/datafusion-contrib/datafusion-c or something? Or should I create my repository for this?

I'm also curious why we would want to go from Python -> C -> Rust rather than just Python -> Rust directly as we do in https://github.com/datafusion-contrib/datafusion-python

Sorry for confusing you. I didn't want to suggest that we should use this for Python bindings. I just wanted to show that we can use FFI library for bindings as an use case of this C API. I should have used Julia or something rather than Python.

  • It would be nice to put a minimal readme in datafusion/c (saying, for example, that the directory contains the C API, see examples/README.md for more details). Or maybe we could move datafusion/c/examples/README.md to datafusion/c/README.md to make it more discoverable

  • I think it would be a good idea to track planned follow on items as individual tasks

They make sense. I'll do them in new repository for this.

I am not familiar how to interface Rust async functions with C -- I would assume it looks like callbacks somehow, but I can imagine how it gets very tricky very quickly

I will not use callbacks like the following:

int
main(void)
{
  DFSessionContext *context = df_session_context_new();
  DFError *error = NULL;

  DFDataFrameFuture *sql_future =
    df_session_context_sql_async(context, "SELECT 1;", &error);
  if (error) {
    printf("failed to start SQL: %s\n", df_error_get_message(error));
    df_error_free(error);
    df_session_context_free(context);
    return EXIT_FAILURE;
  }

  DFDataFrame *data_frame = df_data_frame_future_await(sql_future, &error);
  if (error) {
    printf("failed to run SQL: %s\n", df_error_get_message(error));
    df_error_free(error);
    df_session_context_free(context);
    return EXIT_FAILURE;
  }

  DFFuture *show_future = df_data_frame_show_async(data_frame, &error);
  if (error) {
    printf("failed to start showing data frame: %s\n",
           df_error_get_message(error));
    df_error_free(error);
    df_data_frame_free(data_frame);
    return EXIT_FAILURE;
  }

  df_future_await(show_future, &error);
  if (error) {
    printf("failed to show data frame: %s\n",
           df_error_get_message(error));
    df_error_free(error);
    df_data_frame_free(data_frame);
    return EXIT_FAILURE;
  }

  df_data_frame_free(data_frame);
  df_session_context_free(context);
  return EXIT_SUCCESS;
}

But I may change my mind.

Thanks for suggestions for my Rust code! This is my first Rust program. So suggestions are very welcome. :-)

@andygrove
Copy link
Member

Thanks @kou. I created a new repo https://github.com/datafusion-contrib/datafusion-c where you can PR this work. Thanks again for adding another language binding!

@jimexist
Copy link
Member

I agree with either creating a separate repo or merge as is but thanks for the good work now that I guess Java binding can use this c interface and possibly with the new jextract tool as well

@loic-sharma
Copy link
Contributor

loic-sharma commented May 30, 2022

Hello, I've been watching DataFusion from the sidelines and am interested in using it in a Zig project. As a disclaimer, I'm new to Rust and am not that familiar with its async implementation.

DataFusion's "query engine as a library" is similar to SQLite's "database as a library". SQLite's success is in part due to how easy it is to integrate it into any project, regardless of the language it uses. It'd be wonderful if DataFusion was also as easy to use in all projects!

Today, a key challenge is Rust's async. While using blocking tricks is a great short-term solution, it is inefficient (it requires multiple threads) and can cause deadlock issues. Callbacks are also not a perfect solution (what if my language runtime requires stack unwinding?).

In my opinion this would be helped by:

  1. Introducing sync APIs - Rust's async is a barrier to integrating with other languages. A subset of features, like querying data that is entirely in-memory, would be supported without requiring async or sync-over-async. This would mean avoiding async where possible, due to the virality of async APIs.
  2. An excellent C API - Most languages have tooling to integrate with C, so, a C API makes it easier to use DataFusion in other languages. These C APIs would leverage DataFusion's sync APIs where possible.

@houqp
Copy link
Member

houqp commented May 30, 2022

Very cool demo @kou ! Agree that it would be better to manage it through the datafusion-c repo in the contrib github org.

With regards to async, i think we can use tokio runtime block on to hide all the async apis behind a set of sync apis similar to what we do in our python binding.

@alamb
Copy link
Contributor

alamb commented May 31, 2022

Introducing sync APIs - Rust's async is a barrier to integrating with other languages. A subset of features, like querying data that is entirely in-memory, would be supported without requiring async or sync-over-async. This would mean avoiding async where possible, due to the virality of async APIs.

Yes I think this is the most reasonable solution suggestion -- don't expose any async APIs and have DataFusion do its thread pool / IO management internally. If people want the additional performance or resource control they could use the Rust APIs directly.

@andygrove andygrove marked this pull request as draft May 31, 2022 16:03
@andygrove
Copy link
Member

Moving this to draft to avoid accidental merge

@kou
Copy link
Member Author

kou commented Jun 3, 2022

I close this in favor of datafusion-contrib/datafusion-c#1 .

@kou kou closed this Jun 3, 2022
@kou kou deleted the c-api branch June 3, 2022 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Export C API
6 participants