Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 Add expected answers to DataFrame method examples #12564

Merged
merged 3 commits into from
Sep 22, 2024

Conversation

Eason0729
Copy link
Contributor

Which issue does this PR close?

Closes #12527.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Sep 21, 2024
@@ -225,7 +225,12 @@ impl DataFrame {
/// # async fn main() -> Result<()> {
/// let ctx = SessionContext::new();
/// let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
/// let df = df.select_columns(&["a", "b"])?;
/// df.select_columns(&["a", "b"])?.show().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about using assert_batches_eq instead so that the output is automatically checked as part of the doc tests rather than relying on us manually keeping it up to date?

It is of similar readability I think. Here is an example:

let results = ctx
.sql("SELECT * FROM test WHERE c1 = $1")
.await?
.with_param_values(vec![ScalarValue::from(3i32)])?
.collect()
.await?;
let expected = vec![
"+----+----+-------+",
"| c1 | c2 | c3 |",
"+----+----+-------+",
"| 3 | 1 | false |",
"| 3 | 10 | true |",
"| 3 | 2 | true |",
"| 3 | 3 | false |",
"| 3 | 4 | true |",
"| 3 | 5 | false |",
"| 3 | 6 | true |",
"| 3 | 7 | false |",
"| 3 | 8 | true |",
"| 3 | 9 | false |",
"+----+----+-------+",
];
assert_batches_sorted_eq!(expected, &results);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thought is to simplify large long column on doctest(less readable), like this: https://github.com/Eason0729/datafusion/blob/4dd44c5a2b0d9810d7e9163689afab227c58d542/datafusion/core/src/dataframe/mod.rs#L734

Therefore, I simplify example_long.csv in the next commit to the point which is just enough to showcase most method on dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the expect behavior on describe method is more complex, which might require dedicated csv, so I decided to leave as it is.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @Eason0729 for this contribution 🙏

I have started the CI to check this code

I think the PR would be even better if the output was verified (so it is guaranteed to stay in sync). I left a suggestion on how to do this. Let me know what you think

@@ -184,7 +184,7 @@ impl DataFrame {
}

/// Creates logical expression from a SQL query text.
/// The expression is created and processed againt the current schema.
/// The expression is created and processed against the current schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@comphead
Copy link
Contributor

Thanks @Eason0729 for your contribution.
I'm feeling we need to have something to create DF from rows rather in addition to creating DF from data files.
Example can be

let schema = 
DataFrame::from(schema, data)

Underneath the method can call ctx.read_batch(record_batch). The batch can be created with RecordBatch::try_from_iter

Co-authored-by: Oleks V <comphead@users.noreply.github.com>
@alamb
Copy link
Contributor

alamb commented Sep 22, 2024

I'm feeling we need to have something to create DF from rows rather in addition to creating DF from data files.

This seems reasonable to me -- perhaps you can file a ticket to track the idea and we can as our resident DataFrame experts like @timsaucer and @Omega359

@alamb
Copy link
Contributor

alamb commented Sep 22, 2024

Thanks again @Eason0729

@comphead comphead merged commit 4549168 into apache:main Sep 22, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 23, 2024

📝

bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024
…2564)

* docs: 📝 Add expected answers to `DataFrame` method examples

* test: 📝 use assert_batches_sorted_eq and simplify example_long.csv

* Update datafusion/core/src/dataframe/mod.rs

Co-authored-by: Oleks V <comphead@users.noreply.github.com>

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add expected answers to DataFrame method examples
3 participants