Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurable normalization for configuration options and preserve case for S3 paths #13576

Merged
merged 16 commits into from
Dec 20, 2024

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Nov 26, 2024

Which issue does this PR close?

Closes #11650
Related to #13456

Rationale for this change

Surprisingly, even the simplest S3 example doesn't work 😱

> CREATE EXTERNAL TABLE test
STORED AS CSV
OPTIONS(
    'aws.access_key_id' '************',
    'aws.secret_access_key' '**********',
    'aws.region' 'eu-west-1'
)
LOCATION 's3://blaginin/playground/titanic.csv';


Object Store error: The operation lacked the necessary privileges to complete for path playground/titanic.csv: Client error with status 403 Forbidden: No Body

That's because s3 keys are case-sensitive, but now we always lowercase them.

I see you had the same issue for HuggingFace so I really think the default behaviour should be NOT normalizing.

What changes are included in this PR?

Change the default value, added transformation for some fields where it makes sense (enums, bools, etc)

Are these changes tested?

I updated the tests

Are there any user-facing changes?

Yes, changed the default value

@github-actions github-actions bot added sql SQL Planner common Related to common crate labels Nov 26, 2024
@blaginin blaginin changed the title Fix S3 in CLI: Do not normalize option values Fix S3 in CLI: Do not normalize options values Nov 26, 2024
@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Nov 26, 2024
@blaginin blaginin marked this pull request as ready for review November 28, 2024 12:53
@blaginin blaginin marked this pull request as draft November 28, 2024 12:57
@blaginin
Copy link
Contributor Author

blaginin commented Dec 1, 2024

Actually, I think it should work differently - you should specify whether a value is case-sensitive per item. For example, an AWS password would be case-sensitive, but has_header would be case-insensitive. Otherwise this doesn't work:

CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION 'data.csv'
OPTIONS ('has_header' 'TRUE');

Error parsing TRUE as bool
caused by
External error: provided string was not `true` or `false`

Will change

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels the right direction.
Can you add tests showing why this matters?
Perhaps some tests with parsing non-lowercase s3 paths?

@findepi
Copy link
Member

findepi commented Dec 2, 2024

you should specify whether a value is case-sensitive per item.

i agree

but it doesn't have to be part of the SQL parser at all.
It can simply be applied when pulling values from the option map (the moment when strings are interpreted into something meaningful)

alternatively, we could have some sort of option registry where options could have a type declared, so that parsing is consistent, but that would be a much bigger change

@github-actions github-actions bot added the core Core DataFusion crate label Dec 3, 2024
@blaginin blaginin force-pushed the bugfix/do-not-normalize-values branch from a4490dd to 7c2b3fe Compare December 4, 2024 23:39
@blaginin blaginin marked this pull request as ready for review December 4, 2024 23:49
@blaginin blaginin requested a review from findepi December 6, 2024 13:28
@berkaysynnada
Copy link
Contributor

When I see this PR, I remember a previous discussion: #11650. I have also tried some approaches for it earlier: #11330 (comment).

Isn't it better to implement a future-proof and general solution, like the one I proposed, instead of toggling the configuration up and down?

Do you have time to review the proposed approach and its intended behavior? Is it feasible to implement it without much effort? I have forgotten some of the implementation details, so perhaps you can provide a more reasonable perspective.

@blaginin
Copy link
Contributor Author

blaginin commented Dec 10, 2024

hey @berkaysynnada, by future-proof solution do you mean synnada-ai@341b484#diff-2f697c310af1a48521829d8bd665cf64b6415fbf88edd370efa30b1ed686354dR1655? I agree that fine-grained control is better, but isn't it similar to how I implemented it? e.g.

pub compression: Option<String>, transform = str::to_lowercase, default = Some("zstd(3)".into())

@blaginin blaginin requested a review from findepi December 10, 2024 17:59
@comphead
Copy link
Contributor

Thanks @blaginin and @findepi for the review.

@berkaysynnada
Copy link
Contributor

berkaysynnada commented Dec 12, 2024

hey @berkaysynnada, by future-proof solution do you mean synnada-ai@341b484#diff-2f697c310af1a48521829d8bd665cf64b6415fbf88edd370efa30b1ed686354dR1655? I agree that fine-grained control is better, but isn't it similar to how I implemented it? e.g.

pub compression: Option<String>, transform = str::to_lowercase, default = Some("zstd(3)".into())

Sorry, I have missed your reply. After looking more in detail, you did actually what I referred.
I just wonder can we remove that config setting at all 🤔 because now every option can modify the input as intended, and the users should write knowing that behavior -- for example an option is case sensitive, you would know that and write accordingly. But if it's not a specific option and just takes some elements from a set, and then you would know they are all normalized. In which scenarios does that config setting get a meaning?

We have talked with @ozankabak, and the most optimal solution is that we can accommodate this config setting such that it will enables/disables all kind of normalizations. For example, if normalization is disabled, then all given inputs are needed to be written as how they are exactly written. If it is enabled, then the hardcoded normalization functions become activated, and the inputs are normalized accordingly. With all of these,I think the default value should be true, "enable normalizations".

@findepi
Copy link
Member

findepi commented Dec 12, 2024

I feel we should deprecate and remove the global config, in favor of per-value normalization.
@berkaysynnada What's the use-case to have per-value normalization and a global config?

@berkaysynnada
Copy link
Contributor

I feel we should deprecate and remove the global config, in favor of per-value normalization. @berkaysynnada What's the use-case to have per-value normalization and a global config?

I know it will most probably be a super-rare use case, but assume we add a hardcoded normalization for an option, and user somehow needs to not normalize his input. Does he need to change the code?

@findepi
Copy link
Member

findepi commented Dec 12, 2024

If the option has hard-coded normalization (eg lowercasing), it means there is no difference between certain values (eg those differing in case). Why would a user be concerned about that?
It feels you didn't describe a use-case fully, or I didn't quite understand it.

@berkaysynnada
Copy link
Contributor

berkaysynnada commented Dec 12, 2024

If the option has hard-coded normalization (eg lowercasing), it means there is no difference between certain values (eg those differing in case). Why would a user be concerned about that? It feels you didn't describe a use-case fully, or I didn't quite understand it.

If an option accepts a file name and is coded to normalize the input to snake_case (which applies to most data file naming conventions), but for a specific file type snake_case normalization is not preferable, the user might want to retain the original naming style. Can that be a valid example?

I am not insisting that we should keep it, I am just playing devil's advocate 😅

@findepi
Copy link
Member

findepi commented Dec 12, 2024

I like this example, thank you @berkaysynnada.
If an option accepts a file name (eg to write data to), would we want to normalize it? Maybe we just shouldn't.

@ozankabak
Copy link
Contributor

The config option is basically an escape hatch for users in case they find themselves in a situation where the normalization we enforce is inapplicable in their (possibly edge) case.

@blaginin
Copy link
Contributor Author

blaginin commented Dec 12, 2024

Feels like the right move is to add a deprecation to that option, see if we get any requests or issues, and if not, just remove it completely.

I also don’t think this option should exist. Each value is either case-sensitive or not - we know that for sure for each value. If we’re missing some edge case, we should fix it instead of normalizing the value. Normalizing might solve one problem but could also break other stuff, like aws integration.

If you’re good with that, @ozankabak @berkaysynnada, I’ll deprecate the option

@ozankabak
Copy link
Contributor

I am OK with removing it. We can add it later if it turns out to be necessary

@blaginin
Copy link
Contributor Author

> set datafusion.sql_parser.enable_options_value_normalization to true;
[2024-12-13T12:44:10Z WARN  datafusion_common::config] `enable_options_value_normalization` is deprecated and ignored

@blaginin blaginin force-pushed the bugfix/do-not-normalize-values branch from 97590fe to 66c3cd5 Compare December 13, 2024 14:18
@blaginin blaginin requested review from comphead and alamb December 13, 2024 17:34
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @blaginin

This seems like a very nice improvement to me and moves the choice of when to lowercase the options to the individual options

@ozankabak / @berkaysynnada can you confirm the change of behavior in this PR will work for your usecase?

/// When set to true, SQL parser will normalize options value (convert value to lowercase).
/// Note that this option is ignored and will be removed in the future. All case-insensitive values
/// are normalized automatically.
pub enable_options_value_normalization: bool, warn = "`enable_options_value_normalization` is deprecated and ignored", default = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of the warning is quite cool

However, it isn't shown by default in datafusion-cli FWIW

> set datafusion.sql_parser.enable_options_value_normalization = true;
0 row(s) fetched.
Elapsed 0.001 seconds.

I had to run it with RUST_LOG=info

RUST_LOG=info cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/datafusion-cli`
DataFusion CLI v43.0.0
> set datafusion.sql_parser.enable_options_value_normalization = true;
[2024-12-19T23:23:40Z WARN  datafusion_common::config] `enable_options_value_normalization` is deprecated and ignored
0 row(s) fetched.
Elapsed 0.007 seconds

@@ -161,70 +160,6 @@ fn parse_ident_normalization() {
}
}

#[test]
fn test_parse_options_value_normalization() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please put this test back and updated it so that it passes? That way we can reason about / more easily see the change in behavior introduced by this PR

@alamb
Copy link
Contributor

alamb commented Dec 19, 2024

Thank you @findepi for the reviews

@berkaysynnada
Copy link
Contributor

Thanks @blaginin

This seems like a very nice improvement to me and moves the choice of when to lowercase the options to the individual options

@ozankabak / @berkaysynnada can you confirm the change of behavior in this PR will work for your usecase?

Yes, we can merge this. If additional needs arise, we can request/submit follow-up PRs.

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🚀

@alamb alamb merged commit 31acf45 into apache:main Dec 20, 2024
28 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 20, 2024

Thanks @blaginin @findepi @comphead @berkaysynnada and @ozankabak for sticking with this. It is a nice improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support per-option value normalization
6 participants