Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Split by regular expressions with ? and * repetition are not consistent with Spark #4884

Closed
NVnavkumar opened this issue Mar 1, 2022 · 4 comments · Fixed by #6959
Closed
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P2 Not required for release

Comments

@NVnavkumar
Copy link
Collaborator

Describe the bug
We currently fall back to CPU for repetition quantifiers ? and * with split because the behavior is not consistent with Spark.

Steps/Code to reproduce bug
Example:
For the input string 31313 and the pattern 4?, split will produce ['3','1','3','1','3'] on CPU, and ['','3','1','3','1','3'] on the GPU.

Expected behavior
The behavior should be consistent with Spark so we can enable this on GPU.

Also, see #4468 for related issue regarding regexp_replace

@NVnavkumar NVnavkumar added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 1, 2022
@NVnavkumar
Copy link
Collaborator Author

Also, note the same behavior occurs with {0,} and {0,n} repetitions as well.

@sameerz sameerz added P2 Not required for release cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels Mar 1, 2022
@NVnavkumar NVnavkumar self-assigned this Oct 28, 2022
@NVnavkumar
Copy link
Collaborator Author

NVnavkumar commented Oct 28, 2022

So relevant to solving this is to understand the JDK behavior. Here is the relevant line from doc for Pattern.split(...):

 When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

Currently cuDF emulates Python behavior, which will always include an empty leading substring with a zero-width match.

@NVnavkumar
Copy link
Collaborator Author

Will file another issue specific to zero-width matching to capture this JDK edge case so that we can enable in other circumstances

So relevant to solving this is to understand the JDK behavior. Here is the relevant line from doc for Pattern.split(...):

 When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

Currently cuDF emulates Python behavior, which will always include an empty leading substring with a zero-width match.

@NVnavkumar
Copy link
Collaborator Author

Filed #6958 to handle the zero-width match edge case. In other circumstances, we should enable * and ? on the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P2 Not required for release
Projects
None yet
2 participants