Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make regexp pattern [^a] consistent with Spark for multiline strings #4255

Merged
merged 5 commits into from
Dec 6, 2021

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Dec 1, 2021

Signed-off-by: Andy Grove andygrove@nvidia.com

Closes #4229

The following documentation from this PR explains the transpiler change that makes us consistent with CPU for patterns such as [^a].

// There are differences between cuDF and Java handling of newlines
// for negative character matches. The expression `[^a]` will match
// `\r` and `\n` in Java but not in cuDF, so we replace `[^a]` with
// `(?:[\r\n]|[^a])`. We also have to take into account whether any
// newline characters are included in the character range.
//
// Examples:
//
// `[^a]`     => `(?:[\r\n]|[^a])`
// `[^a\r]`   => `(?:[\n]|[^a])`
// `[^a\n]`   => `(?:[\r]|[^a])`
// `[^a\r\n]` => `[^a]`

…epect to newline characters

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove added this to the Nov 30 - Dec 10 milestone Dec 1, 2021
@andygrove andygrove self-assigned this Dec 1, 2021
@sameerz sameerz added the task Work required that improves the product but is not user facing label Dec 1, 2021
@andygrove andygrove changed the title WIP: Make regexp pattern [^a] consistent with Spark for multiline strings Make regexp pattern [^a] consistent with Spark for multiline strings Dec 3, 2021
@andygrove andygrove marked this pull request as ready for review December 3, 2021 20:59
@andygrove
Copy link
Contributor Author

build

@jlowe
Copy link
Member

jlowe commented Dec 6, 2021

build

@andygrove andygrove merged commit d3c5847 into NVIDIA:branch-22.02 Dec 6, 2021
@andygrove andygrove deleted the neg-class-newline branch December 6, 2021 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] regexp_replace [^a] has different behavior between CPU and GPU for multiline strings
3 participants