Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CsvExecBuilder for creating CsvExec #11633

Merged
merged 5 commits into from
Jul 25, 2024
Merged

Conversation

connec
Copy link
Contributor

@connec connec commented Jul 24, 2024

Which issue does this PR close?

Closes #11565.

Rationale for this change

As part of #11533, the number of arguments required for CsvExec::new grew long enough (8) to trigger a clippy lint. In the interests of getting the PR merged we just suppressed the lint, with the intention of replacing it with a builder in a follow-up PR (👋).

What changes are included in this PR?

This PR introduces a CsvExecBuilder, along with plumbing similar to the ParquetExecBuilder. CsvExec::new has been marked as deprecated, and all usages of it in the repo have been replaced with CsvExec::builder.

Are these changes tested?

A document test has been added to CsvExec which exercises (some of) the builder public API. This is analogous to the doc test for ParquetExec.

Additionally, CsvExec::new was used itself in several tests and these have all been migrated to the builder.

Are there any user-facing changes?

  • Add public struct datafusion::datasource::physical_plan::CsvExecBuilder.
  • Add public method datafusion::datasource::physical_plan::CsvExec::builder.
  • Deprecate public method datafusion::datasource::physical_plan::CsvExec::new.

This adds the `CsvExecBuilder` struct for building a `CsvExec` instance,
and deprecates the `CsvExec::new` method which has grown too large.

There are some `TODO`s related to the duplication of formatting options
and their defaults coming from multiple places. Uses of the deprecated
`new` method have not been updated yet.
@github-actions github-actions bot added the core Core DataFusion crate label Jul 24, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so nice -- thank you @connec

I left a response to the TODO in the code. We could also address it as a follow on PR as well

tumblr_5e7d753668e703ca050fe577c07e3f58_bc74ad9b_500

@@ -49,7 +49,27 @@ use object_store::{GetOptions, GetResultPayload, ObjectStore};
use tokio::io::AsyncWriteExt;
use tokio::task::JoinSet;

/// Execution plan for scanning a CSV file
/// Execution plan for scanning a CSV file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️


impl CsvExec {
/// Create a new CSV reader execution plan provided base and specific configurations
#[deprecated(since = "41.0.0", note = "use `CsvExec::builder` or `CsvExecBuilder`")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for keeping (and deprecating the old API) to make migration / upgrade easier.

false,
file_compression_type.to_owned(),
);
let csv = CsvExec::builder(config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍 that is so much nicer 👍

pub fn new(file_scan_config: FileScanConfig) -> Self {
Self {
file_scan_config,
// TODO: these defaults are duplicated from `CsvOptions` - should they be computed?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing we could do is to add a test that ensures the options are the same rather than having to compute them each time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, a test that makes sure that the default settings in a CsvExec matches the default CsvOptions? I'm happy to add that. I likely won't get to it in the next ~24h so happy to either hold this PR or follow-up after.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, a test that makes sure that the default settings in a CsvExec matches the default CsvOptions? I'm happy to add that. I likely won't get to it in the next ~24h so happy to either hold this PR or follow-up after.

Yes, that is what I was thinking -- that way if the default CsvOptions are ever changed the test will fail until we also update the deafult values in CsvExecBuilder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, a test that makes sure that the default settings in a CsvExec matches the default CsvOptions? I'm happy to add that. I likely won't get to it in the next ~24h so happy to either hold this PR or follow-up after.

Thanks @connec -- no worries! I actually had this PR open locally so I took the liberty of writing the test in 99d158d and pushing that to the branch

@connec
Copy link
Contributor Author

connec commented Jul 24, 2024

Thank you @alamb!

There's one other TODO in the code that I wondered if you have thoughts on:

// TODO: it seems like these format options could be reused across all the various CSV config
has_header: bool,
delimiter: u8,
quote: u8,
escape: Option<u8>,
comment: Option<u8>,
newlines_in_values: bool,

I imagine this may be at least somewhat desirable/intentional as part of ensuring crate boundaries? An alternative could be to, e.g., use CsvOptions (or csv::Format, or...) rather than "spreading" their fields into so many places. It's just boilerplate at the end of the day though, so I'm also happy to just remove that TODO.

pub struct CsvExecBuilder {
file_scan_config: FileScanConfig,
file_compression_type: FileCompressionType,
// TODO: it seems like these format options could be reused across all the various CSV config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could possibly use https://docs.rs/datafusion/latest/datafusion/config/struct.CsvOptions.html directly

That seems to be what the parquet writer does https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html#method.table_parquet_options (aka the ParquetExec directly uses the options struct)

Perhaps as a follow on PR

@alamb alamb changed the title CSV exec builder Add CsvExecBuilder for creating CsvExec Jul 24, 2024
@alamb alamb merged commit fab7e23 into apache:main Jul 25, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Thanks again @connec

@connec connec deleted the csv-exec-builder branch July 25, 2024 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix clippy lint for the number of arguments to CsvExec::new()
2 participants