Library Guide: Add Using the DataFrame API #8319

Veeupup · 2023-11-25T09:13:37Z

Signed-off-by: veeupup code@tanweime.com

Which issue does this PR close?

Closes #7305

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: veeupup <code@tanweime.com>

docs/source/library-user-guide/using-the-dataframe-api.md

alamb

Thank you @Veeupup -- this is a great addition. I left a few suggestions, but I also think we could make those changes as a follow on PR.

alamb · 2023-11-28T11:11:29Z

docs/source/library-user-guide/using-the-dataframe-api.md

+
+You can also serialize `DataFrame` to a file. For now, `Datafusion` supports write `DataFrame` to `csv`, `json` and `parquet`.
+
+Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file.


I don't think the DataFrame API calls collect -- instead I think it uses the streaming APIs

Suggested change

Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file.

When writing a file, DataFusion will execute the DataFrame and stream the results to a file.

alamb · 2023-11-28T11:12:14Z

docs/source/library-user-guide/using-the-dataframe-api.md

+
+## Transform between LogicalPlan and DataFrame
+
+As it is showed above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them.


Suggested change

As it is showed above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them.

As shown above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them.

alamb · 2023-11-28T11:27:19Z

docs/source/library-user-guide/using-the-dataframe-api.md

-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over LogicalPlan.


Suggested change

`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over LogicalPlan.

`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface, and is a thin wrapper over LogicalPlan that adds functionality for building and executing those plans.

alamb · 2023-11-28T11:28:29Z

docs/source/library-user-guide/using-the-dataframe-api.md

+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such as:


Suggested change

For both `DataFrame` and `LogicalPlan`, you can build the query manually, such as:

You can build up `DataFrame`s using its methods, similarly to building `LogicalPlan`s using `LogicalPlanBuilder`:

alamb · 2023-11-28T11:29:09Z

docs/source/library-user-guide/using-the-dataframe-api.md

+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?


Suggested change

let new_df = df.select(vec![col("id"), col("bank_account")])?

// Create a new DataFrame sorted by `id`, `bank_account`

let new_df = df.select(vec![col("id"), col("bank_account")])?

alamb · 2023-11-28T11:30:47Z

docs/source/library-user-guide/using-the-dataframe-api.md

+
+You can manually call the `DataFrame` API or automatically generate a `DataFrame` through the SQL query planner just like:
+
+use `sql` to construct `DataFrame`:


Suggested change

use `sql` to construct `DataFrame`:

For example, to use `sql` to construct `DataFrame`:

alamb · 2023-11-28T11:31:05Z

docs/source/library-user-guide/using-the-dataframe-api.md

+let dataframe = ctx.sql("SELECT * FROM users;").await?;
+```
+
+construct `DataFrame` manually


Suggested change

construct `DataFrame` manually

To construct `DataFrame` using the API:

alamb · 2023-11-28T11:35:17Z

docs/source/library-user-guide/using-the-dataframe-api.md

+
+## Collect / Streaming Exec
+
+When you have a `DataFrame`, you may want to access the results of the internal `LogicalPlan`. You can do this by using `collect` to retrieve all outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.


Suggested change

When you have a `DataFrame`, you may want to access the results of the internal `LogicalPlan`. You can do this by using `collect` to retrieve all outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.

DataFusion `DataFrame`s are "lazy", meaning they do not do any processing until they are executed, which allows for additional optimizations.

When you have a `DataFrame`, you can run it in one of three ways:

1. `collect` which executes the query and buffers all the output into a `Vec<RecordBatch>`

2. `streaming_exec`, which begins executions and returns a `SendableRecordBatchStream` which incrementally computes output on each call to `next()`

3. `cache` which executes the query and buffers the output into a new in memory DataFrame.

alamb · 2023-11-28T11:35:45Z

docs/source/library-user-guide/using-the-dataframe-api.md

+let batches = df.collect().await?;
+```
+
+You can also use stream output to iterate the `RecordBatch`


Suggested change

You can also use stream output to iterate the `RecordBatch`

You can also use stream output to incrementally generate output one `RecordBatch` at a time

alamb · 2023-11-28T11:36:19Z

docs/source/library-user-guide/using-the-dataframe-api.md

+
+Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file.
+
+For example, if you write it to a csv_file


Suggested change

For example, if you write it to a csv_file

For example, to write a csv_file

Signed-off-by: veeupup <code@tanweime.com>

Veeupup · 2023-11-28T11:50:06Z

Thank you for your detailed comments! Very helpful! @alamb

I have changed its content as your comments : )

alamb · 2023-11-28T12:56:03Z

Thanks again @Veeupup

Library Guide: Add Using the DataFrame API

aaf1d1c

Signed-off-by: veeupup <code@tanweime.com>

Veeupup force-pushed the doc_dataframe branch from dfc074d to aaf1d1c Compare November 25, 2023 09:15

andygrove reviewed Nov 26, 2023

View reviewed changes

docs/source/library-user-guide/using-the-dataframe-api.md Outdated Show resolved Hide resolved

Weijun-H reviewed Nov 26, 2023

View reviewed changes

docs/source/library-user-guide/using-the-dataframe-api.md Outdated Show resolved Hide resolved

fix comments

a2e0748

Veeupup requested a review from andygrove November 28, 2023 11:29

alamb approved these changes Nov 28, 2023

View reviewed changes

alamb mentioned this pull request Nov 28, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 #8329

Closed

8 tasks

fix comments

d252143

Signed-off-by: veeupup <code@tanweime.com>

alamb added documentation Improvements or additions to documentation devrel labels Nov 28, 2023

alamb merged commit f1dbb2d into apache:main Nov 28, 2023
4 checks passed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library Guide: Add Using the DataFrame API #8319

Library Guide: Add Using the DataFrame API #8319

Veeupup commented Nov 25, 2023

alamb left a comment

alamb Nov 28, 2023

alamb Nov 28, 2023

alamb Nov 28, 2023

alamb Nov 28, 2023

alamb Nov 28, 2023

alamb Nov 28, 2023

alamb Nov 28, 2023

alamb Nov 28, 2023

alamb Nov 28, 2023

alamb Nov 28, 2023

Veeupup commented Nov 28, 2023

alamb commented Nov 28, 2023


		You can also serialize `DataFrame` to a file. For now, `Datafusion` supports write `DataFrame` to `csv`, `json` and `parquet`.

		Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file.

	Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file.
	When writing a file, DataFusion will execute the DataFrame and stream the results to a file.


		## Transform between LogicalPlan and DataFrame

		As it is showed above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them.

	As it is showed above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them.
	As shown above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them.

	`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over LogicalPlan.
	`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface, and is a thin wrapper over LogicalPlan that adds functionality for building and executing those plans.

	For both `DataFrame` and `LogicalPlan`, you can build the query manually, such as:
	You can build up `DataFrame`s using its methods, similarly to building `LogicalPlan`s using `LogicalPlanBuilder`:

	let new_df = df.select(vec![col("id"), col("bank_account")])?
	// Create a new DataFrame sorted by `id`, `bank_account`
	let new_df = df.select(vec![col("id"), col("bank_account")])?


		You can manually call the `DataFrame` API or automatically generate a `DataFrame` through the SQL query planner just like:

		use `sql` to construct `DataFrame`:

	use `sql` to construct `DataFrame`:
	For example, to use `sql` to construct `DataFrame`:

	construct `DataFrame` manually
	To construct `DataFrame` using the API:


		## Collect / Streaming Exec

		When you have a `DataFrame`, you may want to access the results of the internal `LogicalPlan`. You can do this by using `collect` to retrieve all outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.

-When you have a `DataFrame`, you may want to access the results of the internal `LogicalPlan`. You can do this by using `collect` to retrieve all outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.
+DataFusion `DataFrame`s are "lazy", meaning they do not do any processing until they are executed, which allows for additional optimizations.
+When you have a `DataFrame`, you can run it in one of three ways:
+.  `collect` which executes the query and buffers all the output into a `Vec<RecordBatch>`
+. `streaming_exec`, which begins executions and returns a `SendableRecordBatchStream` which incrementally computes output on each call to `next()`
+. `cache` which executes the query and buffers the output into a new in memory DataFrame.

	You can also use stream output to iterate the `RecordBatch`
	You can also use stream output to incrementally generate output one `RecordBatch` at a time


		Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file.

		For example, if you write it to a csv_file

	For example, if you write it to a csv_file
	For example, to write a csv_file

Library Guide: Add Using the DataFrame API #8319

Library Guide: Add Using the DataFrame API #8319

Conversation

Veeupup commented Nov 25, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Veeupup commented Nov 28, 2023

alamb commented Nov 28, 2023