Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose parquet reader settings using normal DataFusion ConfigOptions #3822

Merged
merged 3 commits into from
Oct 19, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 13, 2022

Which issue does this PR close?

Re #3821

Rationale for this change

I want to test out the parquet filter pushdown on real datasets using datafusion-cli so we can enable it by default -- #3463

I also want an easy way to disable the feature if users find they are getting wrong results

I want to be able to do so via datafusion-cli like:

$ target/debug/datafusion-cli
DataFusion CLI v13.0.0
❯ show all;
+-------------------------------------------------+---------+
| name                                            | setting |
+-------------------------------------------------+---------+
| datafusion.execution.time_zone                  | UTC     |
| datafusion.execution.parquet.pushdown_filters   | false   | <---- Note the option is now visible here
| datafusion.explain.physical_plan_only           | false   |
| datafusion.execution.coalesce_target_batch_size | 4096    |
| datafusion.execution.batch_size                 | 8192    |
| datafusion.execution.coalesce_batches           | true    |
| datafusion.explain.logical_plan_only            | false   |
| datafusion.optimizer.skip_failed_rules          | true    |
| datafusion.optimizer.filter_null_join_keys      | false   |
+-------------------------------------------------+---------+

And then set them like:

$ DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true target/debug/datafusion-cli
DataFusion CLI v13.0.0
❯ show all;
+-------------------------------------------------+---------+
| name                                            | setting |
+-------------------------------------------------+---------+
| datafusion.execution.batch_size                 | 8192    |
| datafusion.execution.coalesce_batches           | true    |
| datafusion.explain.logical_plan_only            | false   |
| datafusion.optimizer.filter_null_join_keys      | false   |
| datafusion.execution.parquet.enable_page_index  | false   |
| datafusion.optimizer.skip_failed_rules          | true    |
| datafusion.explain.physical_plan_only           | false   |
| datafusion.execution.time_zone                  | UTC     |
| datafusion.execution.coalesce_target_batch_size | 4096    |
| datafusion.execution.parquet.pushdown_filters   | true    | <---- Note the option is set to true here!!!!
| datafusion.execution.parquet.reorder_filters    | false   |
+-------------------------------------------------+---------+

What changes are included in this PR?

  1. Add three new config settings to ConfigOptions
  2. Thread ConfigOptions down into the FileScanConfig
  3. Remove ParquetScanOptions in favor of these new configs (will comment on the rationale here)

Are there any user-facing changes?

YES: If you used ParquetScanOptions (which I know @thinkharderdev does) the API has changed.

Also, the settings are now visible in the user level documentation

@github-actions github-actions bot added the core Core DataFusion crate label Oct 13, 2022
@alamb alamb changed the title Expose parquet reader settings as DataFusion config settings Expose parquet reader settings using normal DataFusion ConfigOptions Oct 13, 2022
@@ -109,6 +111,13 @@ async fn main() -> Result<()> {
Ok(())
}

#[derive(Debug, Clone)]
struct ParquetScanOptions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was replicated for the benchmark code as I felt such a struct was the easiest to understand for this matrix strategy

4096,
),
ConfigDefinition::new_string(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to be with the other settings

/// Create new ConfigOptions struct, taking values from
/// environment variables where possible.
///
/// For example, setting `DATAFUSION_EXECUTION_BATCH_SIZE` will
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some documentation about this to the datafusion-cli docs as I couldn't find it when I was looking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -69,43 +72,6 @@ use parquet::file::{
};
use parquet::schema::types::ColumnDescriptor;

#[derive(Debug, Clone, Default)]
/// Specify options for the parquet scan
pub struct ParquetScanOptions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key change to this PR is removing this structure and instead reading the values from a ConfigOptions that is threaded down.

You can see in this PR there is already a structure for configuring parquet reading (ParquetReadOptions) so I actually think this will make the code less confusing to work with going forward.

/// `ParquetRecordBatchStream`. These filters are applied by the
/// parquet decoder to skip unecessairly decoding other columns
/// which would not pass the predicate. Defaults to false
pub fn with_pushdown_filters(self, pushdown_filters: bool) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just uses the slightly messier config API to get/set the settings

let reorder_predicates = self.scan_options.reorder_predicates;
let pushdown_filters = self.scan_options.pushdown_filters;
let enable_page_index = self.scan_options.enable_page_index;
let reorder_predicates = self.reorder_filters;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also took the opportunity to change to consistently use the word filters rather than filters and predciates

.with_pushdown_filters(true)
.with_reorder_predicates(true),
);
parquet_exec = parquet_exec
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a good illustration of how the API changes (I think for the better)

@@ -160,10 +170,12 @@ pub struct ParquetReadOptions<'a> {
pub table_partition_cols: Vec<String>,
/// Should DataFusion parquet reader use the predicate to prune data,
/// overridden by value on execution::context::SessionConfig
// TODO move this into ConfigOptions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this as a follow on PR

@thinkharderdev
Copy link
Contributor

Nice! I will review this this evening or tomorrow depending on how the day goes

@andygrove andygrove added the api change Changes the API exposed to users of the crate label Oct 14, 2022
Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few comments about why we need a shareable ConfigOptions down at the exec layer but overall this is a great improvement

/// minimize the cost of filter evaluation by reordering the
/// predicate [`Expr`]s. If false, the predicates are applied in
/// the same order as specified in the query. Defaults to false.
pub fn with_reorder_filters(self, reorder_filters: bool) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, wrapping the options in a Arc<RwLock<_>> seems strange since this is already essentially an owned value.

@@ -698,6 +699,7 @@ mod tests {
projection,
statistics,
table_partition_cols,
config_options: ConfigOptions::new().into_shareable(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why we need the into_shareable here. Seems like this should just be an owned ConfigOptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is non obvious -- ConfigOptions was changed to be shareable by @waitingkuo in #3455

Basically the usecase there was so that the configuration was owned by SessionContext but other parts could read it if necessary -- specifically, information_schema.df_settings / SHOW ALL initially. This PR extends the concept so that the settings can be read by the parquet reader

What would you think about moving the mutability handling into ConfigOption so this looks like

            config_options: ConfigOptions::new(),

That would hide the details of shared ownership more nicely I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. If ConfigOptions is meant to be shareable then it can just hold a Arc<RwLock<HashMap<_>>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do as a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #3886

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR to clean up: #3909

@alamb
Copy link
Contributor Author

alamb commented Oct 18, 2022

I plan to merge this after it passes CI to keep the process going. I really like the ConfigOptions structure

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @alamb

@alamb alamb force-pushed the alamb/expose_parquet_settings branch 2 times, most recently from cc6c3db to 900e15f Compare October 18, 2022 17:18
@alamb alamb merged commit 6e0097d into apache:master Oct 19, 2022
@alamb alamb deleted the alamb/expose_parquet_settings branch November 5, 2022 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants