-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configurable normalization for configuration options and preserve case for S3 paths #13576
Conversation
Actually, I think it should work differently - you should specify whether a value is case-sensitive per item. For example, an AWS password would be case-sensitive, but CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION 'data.csv'
OPTIONS ('has_header' 'TRUE');
Error parsing TRUE as bool
caused by
External error: provided string was not `true` or `false` Will change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels the right direction.
Can you add tests showing why this matters?
Perhaps some tests with parsing non-lowercase s3 paths?
i agree but it doesn't have to be part of the SQL parser at all. alternatively, we could have some sort of option registry where options could have a type declared, so that parsing is consistent, but that would be a much bigger change |
a4490dd
to
7c2b3fe
Compare
When I see this PR, I remember a previous discussion: #11650. I have also tried some approaches for it earlier: #11330 (comment). Isn't it better to implement a future-proof and general solution, like the one I proposed, instead of toggling the configuration up and down? Do you have time to review the proposed approach and its intended behavior? Is it feasible to implement it without much effort? I have forgotten some of the implementation details, so perhaps you can provide a more reasonable perspective. |
hey @berkaysynnada, by future-proof solution do you mean synnada-ai@341b484#diff-2f697c310af1a48521829d8bd665cf64b6415fbf88edd370efa30b1ed686354dR1655? I agree that fine-grained control is better, but isn't it similar to how I implemented it? e.g. datafusion/datafusion/common/src/config.rs Line 442 in 06c013d
|
Sorry, I have missed your reply. After looking more in detail, you did actually what I referred. We have talked with @ozankabak, and the most optimal solution is that we can accommodate this config setting such that it will enables/disables all kind of normalizations. For example, if normalization is disabled, then all given inputs are needed to be written as how they are exactly written. If it is enabled, then the hardcoded normalization functions become activated, and the inputs are normalized accordingly. With all of these,I think the default value should be true, "enable normalizations". |
I feel we should deprecate and remove the global config, in favor of per-value normalization. |
I know it will most probably be a super-rare use case, but assume we add a hardcoded normalization for an option, and user somehow needs to not normalize his input. Does he need to change the code? |
If the option has hard-coded normalization (eg lowercasing), it means there is no difference between certain values (eg those differing in case). Why would a user be concerned about that? |
If an option accepts a file name and is coded to normalize the input to snake_case (which applies to most data file naming conventions), but for a specific file type snake_case normalization is not preferable, the user might want to retain the original naming style. Can that be a valid example? I am not insisting that we should keep it, I am just playing devil's advocate 😅 |
I like this example, thank you @berkaysynnada. |
The config option is basically an escape hatch for users in case they find themselves in a situation where the normalization we enforce is inapplicable in their (possibly edge) case. |
Feels like the right move is to add a deprecation to that option, see if we get any requests or issues, and if not, just remove it completely. I also don’t think this option should exist. Each value is either case-sensitive or not - we know that for sure for each value. If we’re missing some edge case, we should fix it instead of normalizing the value. Normalizing might solve one problem but could also break other stuff, like aws integration. If you’re good with that, @ozankabak @berkaysynnada, I’ll deprecate the option |
I am OK with removing it. We can add it later if it turns out to be necessary |
|
97590fe
to
66c3cd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @blaginin
This seems like a very nice improvement to me and moves the choice of when to lowercase the options to the individual options
@ozankabak / @berkaysynnada can you confirm the change of behavior in this PR will work for your usecase?
/// When set to true, SQL parser will normalize options value (convert value to lowercase). | ||
/// Note that this option is ignored and will be removed in the future. All case-insensitive values | ||
/// are normalized automatically. | ||
pub enable_options_value_normalization: bool, warn = "`enable_options_value_normalization` is deprecated and ignored", default = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use of the warning is quite cool
However, it isn't shown by default in datafusion-cli FWIW
> set datafusion.sql_parser.enable_options_value_normalization = true;
0 row(s) fetched.
Elapsed 0.001 seconds.
I had to run it with RUST_LOG=info
RUST_LOG=info cargo run
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.13s
Running `target/debug/datafusion-cli`
DataFusion CLI v43.0.0
> set datafusion.sql_parser.enable_options_value_normalization = true;
[2024-12-19T23:23:40Z WARN datafusion_common::config] `enable_options_value_normalization` is deprecated and ignored
0 row(s) fetched.
Elapsed 0.007 seconds
@@ -161,70 +160,6 @@ fn parse_ident_normalization() { | |||
} | |||
} | |||
|
|||
#[test] | |||
fn test_parse_options_value_normalization() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please put this test back and updated it so that it passes? That way we can reason about / more easily see the change in behavior introduced by this PR
Thank you @findepi for the reviews |
Yes, we can merge this. If additional needs arise, we can request/submit follow-up PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🚀
Thanks @blaginin @findepi @comphead @berkaysynnada and @ozankabak for sticking with this. It is a nice improvement. |
Which issue does this PR close?
Closes #11650
Related to #13456
Rationale for this change
Surprisingly, even the simplest S3 example doesn't work 😱
That's because s3 keys are case-sensitive, but now we always lowercase them.
I see you had the same issue for HuggingFace so I really think the default behaviour should be NOT normalizing.
What changes are included in this PR?
Change the default value, added transformation for some fields where it makes sense (enums, bools, etc)
Are these changes tested?
I updated the tests
Are there any user-facing changes?
Yes, changed the default value