Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow::util::pretty::pretty_format_batches missing #769

Closed
yuribudilov opened this issue Jul 22, 2021 · 7 comments · Fixed by #772
Closed

arrow::util::pretty::pretty_format_batches missing #769

yuribudilov opened this issue Jul 22, 2021 · 7 comments · Fixed by #772

Comments

@yuribudilov
Copy link

Hello
My apologies for novice Arrow question.
I am not able to compile the code sample due to missing "pretty" function in arrow util.
Using Rust 1.53.0 Stable.
Toml is:
[package]
name = "test_arrow"
version = "0.1.0"
edition = "2018"
[dependencies]
arrow = "5.0.0"
datafusion = "4.0.0"
tokio = "1.8.2"

// compilation can not find this:
use arrow::util::pretty::print_batches;
// also this fails to compile:
let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?;

Error: cannot find 'pretty' in util.

What am I doing wrong please?

thank you very much

@alamb
Copy link
Contributor

alamb commented Jul 22, 2021

Hi @yuribudilov -- you need to enable the "prettyprint" feature for arrow.

So instead of

arrow = "5.0.0"

try using

arrow = { version = "5.0", features = ["prettyprint"] }

@yuribudilov
Copy link
Author

thank you.

One compilation error is now gone but replaced by another 2 compilation errors, one step forward, two steps back.

Repro:

on https://github.com/apache/arrow-datafusion there is Rust code sample given (quote), which does not compile:

use arrow::record_batch::RecordBatch;
use arrow::util::pretty::print_batches;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// register the table
let mut ctx = ExecutionContext::new();
ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;

// create a plan to run a SQL query
let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?;

// execute and print results
let results: Vec<RecordBatch> = df.collect().await?; // error 1 here
print_batches(&results)?; // error 2 here
Ok(())

}

The TOML on the link only shows one line: datafusion = "4.0.0-SNAPSHOT"

This TOML does not work because there is no arrow and no tokio dependency in TOML.
So I added those myself.

Here is what I have now, which still does not work:
[package]
name = "test_arrow"
version = "0.1.0"
edition = "2018"
[dependencies]

arrow = "5.0.0"

datafusion = "4.0.0"
tokio = "1.8.2"
arrow = { version = "5.0", features = ["prettyprint"] }

I still have 2 compilation errors based on above:

15 | let results: Vec = df.collect().await?; // error 1
| ^^^^^^^^^^^^^^^^^^^ expected struct arrow::record_batch::RecordBatch, found a different struct arrow::record_batch::RecordBatch
|
= note: expected struct Vec<arrow::record_batch::RecordBatch> (struct arrow::record_batch::RecordBatch)
found struct Vec<arrow::record_batch::RecordBatch> (struct arrow::record_batch::RecordBatch)
= note: perhaps two different versions of crate arrow are being used?
note: return type inferred to be Vec<arrow::record_batch::RecordBatch> here
--> src\main.rs:9:5
|
9 | ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0277]: ? couldn't convert the error to DataFusionError
--> src\main.rs:16:28
|
16 | print_batches(&results)?; // error 2
| ^ the trait From<arrow::error::ArrowError> is not implemented for DataFusionError
|
= note: the question mark operation (?) implicitly performs a conversion on the error value using the From trait
= help: the following implementations were found:
<DataFusionError as Fromarrow::error::ArrowError>
<DataFusionError as Fromparquet::errors::ParquetError>
<DataFusionError as Fromsqlparser::parser::ParserError>
<DataFusionError as Fromstd::io::Error>
= note: required by from

error: aborting due to 2 previous errors

First one can be "covered up" by letting Rust infer data type like so (which is very odd given it infers the same Vec !
let results = df.collect().await?;

The second error indicated something is wrong with TOML documentation:
print_batches(&results)?;

Can you please point me to documentation how to use this product from Rust?
Many thanks.

@yuribudilov
Copy link
Author

OK, I fixed it, thanks to Rust compiler (what a fantastic language!!)

Rust errors "suggested" different version of arrow crate were used.

So I tried using an earlier arrow version in TOML:

arrow = { version = "4.4.0", features = ["prettyprint", "default"] }

This compiles and builds and runs correctly !! Phew! Happy days.

May I humbly suggest there is likely to be a buglet in either datafusion 4.0.0 or in arrow 5.0 or both ?

May I also suggest to update datafusion documentation to list more complete TOML dependencies because those of us who are new to arrow/datafusion but would like to learn could use more help and reliable and accessible documentation is all we have.

Many thanks for reading thus far, it looks like a fantastic product you have been building!
Please feel free to close this issue.

@alamb
Copy link
Contributor

alamb commented Jul 23, 2021

arrow = { version = "4.4.0", features = ["prettyprint", "default"] }

Yes, this is the version of arrow that the (released) datafusion version 4.0 works with. 👍

The fact that we haven't released a new version of datafusion to crates.io that works with arrow 5 is a problem which we should rectify.

DataFusion (at least on master) also includes a "public export" of its arrow dependency, so perhaps we should change the example from

use arrow::record_batch::RecordBatch;
use arrow::util::pretty::print_batches;

to

use datafusion::arrow::record_batch::RecordBatch;
use datafusion::arrow::util::pretty::print_batches;

Many thanks for reading thus far, it looks like a fantastic product you have been building!

Thanks! Kudos go to the whole team (there are many people whose work goes into making it)

@alamb
Copy link
Contributor

alamb commented Jul 23, 2021

I made #772 to try and improve the docs a little bit

@yuribudilov
Copy link
Author

I appreciate your support, wonderful and quick!
FWIW - I have used Apache Spark heavily for a couple of years and I am of the opinion that Rust implementation of the great "Spark concept" should be the new ideal for the future. Most of the Spark issues I faced were related to JVM, OO memory overheads, vast memory bloat, many job crashes due to memory exhaustion and GC related issues. The performance often was far from great too. All of those issues should, in theory, disappear when Rust/Arrow/Datafusion/Ballista is running the Spark show. Bring it on. Thank you.

@alamb
Copy link
Contributor

alamb commented Jul 24, 2021

All of those issues should, in theory, disappear when Rust/Arrow/Datafusion/Ballista is running the Spark show

Indeed! I think this is @andygrove 's vision as well.

Thanks for the kind words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants