Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Internal error in regexp_replace() for some StringView input #12203

Merged
merged 8 commits into from
Sep 12, 2024

Conversation

devanbenz
Copy link
Contributor

@devanbenz devanbenz commented Aug 28, 2024

Which issue does this PR close?

Closes #12150 & closes #11912

Rationale for this change

What changes are included in this PR?

Currently the StringView data type cannot be cast toas_generic_string_array. The changes here resolve that: https://github.com/apache/datafusion/pull/12203/files#diff-0007996e12eb1b2e63363974424f13330f85a7fc48e0c55381f8a30a0f372931R206

Another point of issue was in the return_type match statement when using a function within the query on Utf8View that appears to be returning a ScalarArray of StringArray

For example:

SELECT
  REGEXP_REPLACE(column1_utf8view, lower(column1_utf8view), 'bar') AS k
FROM test;

lower(column1_utf8view) would return a StringArray instead of a StringViewArray which would cause the result of the query to return a StringArray. For now I am implementing a conversion step https://github.com/apache/datafusion/pull/12203/files#diff-0007996e12eb1b2e63363974424f13330f85a7fc48e0c55381f8a30a0f372931R516 to coerce the return value in to being a StringView as required by the signatures.

I think this is okay. Would love to work on optimizing/making this code better if possible though. For now this appears to be the path forward for the error I was seeing.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Aug 28, 2024
@devanbenz devanbenz marked this pull request as ready for review August 28, 2024 02:38
@Omega359
Copy link
Contributor

Omega359 commented Aug 28, 2024

There are a few things I noted when I reviewed this pull request, some of which may be good to either fix here or in other tickets/PRs:

  • The return type includes Utf8View, but the code doesn't do that
  • The return type tests for Binary/LargeBinary/BinaryView/Dictionary but those are unsupported signature types
  • The signature is missing Exact(vec![Utf8View, Utf8, Utf8, Utf8]),
  • I don't see anywhere that the args are actually checked to have 3 or 4 elements anymore
  • There are a bunch of macros in the tests that as best as I can see don't seem to be used at all. At least when I remove them the tests in that file all run. Given that I don't think the tests cover anything but StringArray as arg[0] Correction, the macros write out tests, my mistake.

@devanbenz
Copy link
Contributor Author

There are a few things I noted when I reviewed this pull request, some of which may be good to either fix here or in other tickets/PRs:

  • The return type includes Utf8View, but the code doesn't do that
  • The return type tests for Binary/LargeBinary/BinaryView/Dictionary but those are unsupported signature types
  • The signature is missing Exact(vec![Utf8View, Utf8, Utf8, Utf8]),
  • I don't see anywhere that the args are actually checked to have 3 or 4 elements anymore
  • There are a bunch of macros in the tests that as best as I can see don't seem to be used at all. At least when I remove them the tests in that file all run. Given that I don't think the tests cover anything but StringArray as arg[0] Correction, the macros write out tests, my mistake.

I've added some additional sqllogictests for flags & modified the type signature for

The signature is missing Exact(vec![Utf8View, Utf8, Utf8, Utf8]),

Could you please elaborate more on the return types in the first bullet?

Thanks!

@Omega359
Copy link
Contributor

Could you please elaborate more on the return types in the first bullet?

Sure. That was one I saw that wasn't related to anything in this PR but that I think might be improved.

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
        use DataType::*;
        Ok(match &arg_types[0] {
            LargeUtf8 | LargeBinary => LargeUtf8,
            Utf8 | Binary => Utf8,
            Utf8View | BinaryView => Utf8View,
            Null => Null,
            Dictionary(_, t) => match **t {
                LargeUtf8 | LargeBinary => LargeUtf8,
                Utf8 | Binary => Utf8,
                Null => Null,
                _ => {
                    return plan_err!(
                        "the regexp_replace can only accept strings but got {:?}",
                        **t
                    );
                }
            },
            other => {
                return plan_err!(
                    "The regexp_replace function can only accept strings. Got {other}"
                );
            }
        })
    }

Unless I'm mistaken that return type method's args are the same as what is passed to invoke (and what is declared in the signature) and that is just Utf8, LargeUtf8 and Utf8View (and the signature doesn't even cover the LargeUtf8 case). Basically, I think it needs to be cleaned up wrt types to be consistent.

@Omega359
Copy link
Contributor

Note that the above comment could (should?) be another PR - it was just noted by me when looking at the code for this.

@devanbenz
Copy link
Contributor Author

@alamb not sure if you have eyes on this but its a StringView related bug 🫡

@alamb
Copy link
Contributor

alamb commented Sep 11, 2024

I ran the regexp benchmarks on this PR and verified that there are no changes for the existing types

++ critcmp main fix/12150-regexprep-err-2
group                  main
-----                  ----
regexp_like_1000       1.00      3.8±0.01ms        ? ?/sec
regexp_match_1000      1.00      4.6±0.01ms        ? ?/sec
regexp_replace_1000    1.00      3.5±0.01ms        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @devanbenz and @Omega359

datafusion/functions/src/regex/regexpreplace.rs Outdated Show resolved Hide resolved
datafusion/functions/src/regex/regexpreplace.rs Outdated Show resolved Hide resolved
@alamb alamb merged commit 04e8e53 into apache:main Sep 12, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 12, 2024

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
3 participants