-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CsvExecBuilder
for creating CsvExec
#11633
Conversation
This adds the `CsvExecBuilder` struct for building a `CsvExec` instance, and deprecates the `CsvExec::new` method which has grown too large. There are some `TODO`s related to the duplication of formatting options and their defaults coming from multiple places. Uses of the deprecated `new` method have not been updated yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so nice -- thank you @connec
I left a response to the TODO in the code. We could also address it as a follow on PR as well
@@ -49,7 +49,27 @@ use object_store::{GetOptions, GetResultPayload, ObjectStore}; | |||
use tokio::io::AsyncWriteExt; | |||
use tokio::task::JoinSet; | |||
|
|||
/// Execution plan for scanning a CSV file | |||
/// Execution plan for scanning a CSV file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
|
||
impl CsvExec { | ||
/// Create a new CSV reader execution plan provided base and specific configurations | ||
#[deprecated(since = "41.0.0", note = "use `CsvExec::builder` or `CsvExecBuilder`")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for keeping (and deprecating the old API) to make migration / upgrade easier.
false, | ||
file_compression_type.to_owned(), | ||
); | ||
let csv = CsvExec::builder(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍 that is so much nicer 👍
pub fn new(file_scan_config: FileScanConfig) -> Self { | ||
Self { | ||
file_scan_config, | ||
// TODO: these defaults are duplicated from `CsvOptions` - should they be computed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing we could do is to add a test that ensures the options are the same rather than having to compute them each time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, a test that makes sure that the default settings in a CsvExec
matches the default CsvOptions
? I'm happy to add that. I likely won't get to it in the next ~24h so happy to either hold this PR or follow-up after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, a test that makes sure that the default settings in a CsvExec matches the default CsvOptions? I'm happy to add that. I likely won't get to it in the next ~24h so happy to either hold this PR or follow-up after.
Yes, that is what I was thinking -- that way if the default CsvOptions
are ever changed the test will fail until we also update the deafult values in CsvExecBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, a test that makes sure that the default settings in a CsvExec matches the default CsvOptions? I'm happy to add that. I likely won't get to it in the next ~24h so happy to either hold this PR or follow-up after.
Thanks @connec -- no worries! I actually had this PR open locally so I took the liberty of writing the test in 99d158d and pushing that to the branch
Thank you @alamb! There's one other datafusion/datafusion/core/src/datasource/physical_plan/csv.rs Lines 97 to 103 in debfd19
I imagine this may be at least somewhat desirable/intentional as part of ensuring crate boundaries? An alternative could be to, e.g., use |
pub struct CsvExecBuilder { | ||
file_scan_config: FileScanConfig, | ||
file_compression_type: FileCompressionType, | ||
// TODO: it seems like these format options could be reused across all the various CSV config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could possibly use https://docs.rs/datafusion/latest/datafusion/config/struct.CsvOptions.html directly
That seems to be what the parquet writer does https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html#method.table_parquet_options (aka the ParquetExec directly uses the options struct)
Perhaps as a follow on PR
Thanks again @connec |
Which issue does this PR close?
Closes #11565.
Rationale for this change
As part of #11533, the number of arguments required for
CsvExec::new
grew long enough (8) to trigger a clippy lint. In the interests of getting the PR merged we just suppressed the lint, with the intention of replacing it with a builder in a follow-up PR (👋).What changes are included in this PR?
This PR introduces a
CsvExecBuilder
, along with plumbing similar to theParquetExecBuilder
.CsvExec::new
has been marked as deprecated, and all usages of it in the repo have been replaced withCsvExec::builder
.Are these changes tested?
A document test has been added to
CsvExec
which exercises (some of) the builder public API. This is analogous to the doc test forParquetExec
.Additionally,
CsvExec::new
was used itself in several tests and these have all been migrated to the builder.Are there any user-facing changes?
datafusion::datasource::physical_plan::CsvExecBuilder
.datafusion::datasource::physical_plan::CsvExec::builder
.datafusion::datasource::physical_plan::CsvExec::new
.