Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structify ConfigOptions (#4517) #4771

Merged
merged 14 commits into from
Jan 3, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 29, 2022

Which issue does this PR close?

Closes #4517
Closes #3500
Closes #4752
Relates to #3887

Rationale for this change

The current approach is:

  • Not type-safe
  • Not documented by rustdoc
  • Not understood by code completion
  • Expensive to clone
  • Defaulting logic is not centralised
  • Inconsistently serializes to TaskContext

What changes are included in this PR?

This reworks ConfigOptions to instead be a struct, using a declarative macro to auto-generate serialization logic

Are these changes tested?

Are there any user-facing changes?

Yes

@github-actions github-actions bot added the core Core DataFusion crate label Dec 29, 2022
@@ -236,9 +230,7 @@ impl ParquetExec {
/// Return the value described in [`Self::with_enable_page_index`]
fn enable_page_index(&self, config_options: &ConfigOptions) -> bool {
self.enable_page_index
.or_else(|| config_options.get_bool(OPT_PARQUET_ENABLE_PAGE_INDEX))
// default to false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer have implicit defaults all over the place 😄

.finish()
/// A type-safe container for [`ConfigExtension`]
#[derive(Debug, Default, Clone)]
pub struct Extensions(BTreeMap<&'static str, ExtensionBox>);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formulation is specifically designed to be able to replace the AnyMap currently on SessionConfig. As an added bonus it no longer requires types to be interior-mutable

pub round_robin_repartition: bool, default = true

/// Parquet options
pub parquet: ParquetOptions, default = Default::default()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nesting is a source of non-trivial additional complexity, but I made it work so at least we have support for this in the future.

@@ -1489,6 +1381,15 @@ impl SessionConfig {
}
}

impl From<ConfigOptions> for SessionConfig {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lays the ground-word for eventually removing SessionConfig altogether #3887

self.config_options().get(variable)
// TOOD: Move ConfigOptions into common crate
match variable {
"datafusion.execution.time_zone" => self
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nasty hack, I hope to move ConfigOptions into core in a subsequent PR so that it can be used directly instead of via ContextProvider

@tustvold tustvold marked this pull request as ready for review December 29, 2022 18:52
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 29, 2022
@@ -103,9 +103,10 @@ query R
SHOW ALL
----
datafusion.catalog.create_default_catalog_and_schema true
datafusion.catalog.format NULL
datafusion.catalog.has_header false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option was missing from the definitions

datafusion.catalog.information_schema true
datafusion.catalog.location NULL
datafusion.catalog.type NULL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type is a reserved keyword in rust, and the name was pretty overloaded anyway, so I opted to rename this to format. Amusingly this was already what the local variables were named as

let config = SessionConfig::new()
.with_batch_size(TEST_BATCH_SIZE)
.set_u64(
"datafusion.execution.coalesce_target_batch_size",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missed in #4757 the new config returns an error for invalid keys and so caught this

self.config_options
.get_bool(OPT_ENABLE_ROUND_ROBIN_REPARTITION)
.unwrap_or_default()
map
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two were added in #4757. They're no longer necessary and so I just opted to remove them

};
let mut config = ConfigOptions::new();
for (k, v) in task_props {
let _ = config.set(&k, &v);
Copy link
Contributor Author

@tustvold tustvold Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows deserializing from a HashMap, as an added bonus this will now work for all keys and not just a specific subset

Longer-term we probably want to change this signature to accept ConfigOptions directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to #4758

self.set(key, ScalarValue::Utf8(Some(value.into())))
}
/// Returns the [`ConfigEntry`] stored within this [`ConfigOptions`]
pub fn entries(&self) -> Vec<ConfigEntry> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the mechanism by which ConfigOptions can be both serialized and documentation generated

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how the options are now typed / type checked at compile time rather than runtime.

cc @yahoNanJing @andygrove and @thinkharderdev

@@ -63,9 +64,10 @@ async fn main() -> Result<()> {

async fn group_by(opt: &GroupBy) -> Result<()> {
let path = opt.path.to_str().unwrap();
let config = SessionConfig::from_env().with_batch_size(65535);
let mut config = ConfigOptions::from_env()?;
config.built_in.execution.batch_size = 65535;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of cool that it is a struct format that is checked by the compiler

datafusion/core/src/physical_optimizer/enforcement.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it

The created structure works well in my editor.

Thank you @tustvold . This is the one missing

It think we should leave this open at least a full work day after approval (aka don't merge until next Monday)

Screenshot 2022-12-30 at 10 13 01 AM

@@ -1153,7 +1103,7 @@ pub struct SessionConfig {
/// due to `resolve_table_ref` which passes back references)
default_schema: String,
/// Configuration options
config_options: ConfigOptions,
options: ConfigOptions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎 for the change to the more overloaded options name, even though I realize it is more concise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fustfmt was being obtuse and formatting things over 7 lines and this made it happy. SessionConfig isn't long for this world so I think it is fine

/// replaced by [`config_options`].
///
/// [`config_options`]: SessionContext::config_option
pub fn to_props(&self) -> HashMap<String, String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be related to #4758 from @yahoNanJing

};
let mut config = ConfigOptions::new();
for (k, v) in task_props {
let _ = config.set(&k, &v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to #4758

| datafusion.optimizer.repartition_windows | Boolean | true | Should DataFusion repartition data using the partitions keys to execute window functions in parallel using the provided `target_partitions` level |
| datafusion.optimizer.skip_failed_rules | Boolean | true | When set to true, the logical plan optimizer will produce warning messages if any optimization rules produce errors and then proceed to the next rule. When set to false, any rules that produce errors will cause the query to fail. |
| datafusion.optimizer.top_down_join_key_reordering | Boolean | true | When set to true, the physical plan optimizer will run a top down process to reorder the join keys. Defaults to true |
| key | default | description |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major difference here seems to be the type column is no longer present. Was that intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the fields are now Rust typed not datatype. I could include this, but it seemed unnecessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably clear from the names if something is boolean or numeric. I don't feel strongly

use std::collections::BTreeMap;
use std::fmt::Display;

macro_rules! config_namespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

I think some introduction doc comments here would help -- specifically some examples of what this macro is creating would be nice

@tustvold
Copy link
Contributor Author

tustvold commented Jan 2, 2023

I plan to merge this tomorrow morning (GMT) unless anyone needs more time to review

@tustvold tustvold merged commit 21169b9 into apache:master Jan 3, 2023
@ursabot
Copy link

ursabot commented Jan 3, 2023

Benchmark runs are scheduled for baseline = 6c81c10 and contender = 21169b9. 21169b9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
3 participants