RLike: Fall back to CPU for regex that would produce incorrect results #4044

andygrove · 2021-11-05T21:03:42Z

This PR introduces a lightweight regular expression parser that allows us to inspect patterns to determine if they can be supported on GPU or not so that we can fall back to CPU in those cases. In most cases, this is necessary to handle edge cases that would cause cuDF to throw an invalid regex: nothing to match exception. Examples include:

Possessive quantifiers: a*+
Empty groups: a()?
Choice where one side is potentially empty: ^|a or a*|b

There are other cases where Java has support for advanced regex features that are not available in cuDF:

Complex character class usage such as [a-d[m-p]]

There is also the beginning of a transpiler so that we can alter the pattern before passing it to cuDF. So far there is only one trivial example of this and that is escaping - if it appears within a character class to represent the character - rather than being used to specify a character range, as in [abc-].

This is a large PR with a lot of new functionality and I have been leaning heavily on the fuzzing approach to find differences between CPU and GPU. The fuzz tests are included as part of the new unit test suite.

…r RLIKE Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove · 2021-11-05T21:12:21Z

build

docs/compatibility.md

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

…-rapids into rlike-support-more-regex

revans2

This is really great.

revans2 · 2021-11-09T14:25:23Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

+    regex.toRegexString
+  }
+
+  private def validate(regex: RegexAST): Unit = {


nit: Do we want to possibly look at some rules that can re-write some of these? like if we see "()" as the regular expression can we replace it with ".*"? I honestly don't know if that even would work, because I don't remember what java does in this case. This should probably be follow on work if we do want to look into this, because I don't want to hold this up from going in.

I do think we should do this but I wanted to take this one step at a time and start off with simply falling back to CPU and then follow up with optimizations so that this PR doesn't become overwhelming to review. Ideally, I think we should follow up with one PR per specific optimization, so we can make sure that each one has comprehensive tests.

That is fine with me. Being customer driven on what we pull in sounds good. After all most of these are corner cases, hey should be rare, and if we do add a modification step we need a lot of testing to really be sure it is doing the right thing.

revans2 · 2021-11-09T14:37:20Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala

+    // parse the source regular expression
+    val regex = new RegexParser(pattern).parse()
+    // validate that the regex is supported by cuDF
+    validate(regex)


nit: Do we want to try and validate the size/complexity of the regular expression? I don't know exactly what CUDF does to figure out if it needs a small, medium, large, or crazy big stack/memory, but it looks like we could do something similar, and fall back to the CPU if it is too large. The main reason for this is because we just dropped the default for spark.rapids.memory.gpu.reserve from 1GiB to 256 MiB. The reason we set it at 1GiB was because of hard coded regular expressions that we used. If we are going to fully support arbitrary regular expressions it would be nice to try and tie these two together in some way so we fall back to the CPU if there is not enough reserved memory, or we let users opt into larger regular expressions, but in the instructions we tell them that they need to increase the reserved memory accordingly.

Again this would probably be better as follow on work.

That sounds like a great idea. I will file a follow-on issue for this.

Filed as #4061

jlowe · 2021-11-09T15:33:21Z

build

revans2 · 2021-11-09T16:52:20Z

build

andygrove · 2021-11-09T17:24:10Z

build failed with

Unable to find image '***' locally
docker: Error response from daemon: unauthorized: access to the requested resource is not authorized.
See 'docker run --help'.
exit status 125
Error: Process completed with exit code 255.

andygrove · 2021-11-09T17:24:30Z

build

andygrove added 3 commits November 5, 2021 13:24

Implement regexp parser to detect when we need to fall back to CPU fo…

1ebbee7

…r RLIKE Signed-off-by: Andy Grove <andygrove@nvidia.com>

update compatibility docs

c5fcb28

code cleanup and documentation

62ea0e0

andygrove added this to the Nov 1 - Nov 12 milestone Nov 5, 2021

andygrove self-assigned this Nov 5, 2021

remove adhoc test

b4e4fd4

sameerz added the task Work required that improves the product but is not user facing label Nov 5, 2021

andygrove added 2 commits November 5, 2021 15:24

more parser tests

a00215c

revert accidental docs change

5d0d8d7

jlowe reviewed Nov 8, 2021

View reviewed changes

andygrove and others added 8 commits November 8, 2021 17:15

Move regular expression parser to new source file

3f8d2ae

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Update docs/compatibility.md

95695e3

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

Merge branch 'rlike-support-more-regex' of github.com:andygrove/spark…

49cc39c

…-rapids into rlike-support-more-regex

more fully implement hex and octal parsing and address other PR feedback

aeca310

make some regex validation less specific

9967f05

remove redundant check

8e63f54

add parser test for complex expression

8d32bdd

remove comment

65bad0f

revans2 reviewed Nov 9, 2021

View reviewed changes

revans2 previously approved these changes Nov 9, 2021

View reviewed changes

jlowe previously approved these changes Nov 9, 2021

View reviewed changes

revert removing check that was not redundant after all

62992b1

andygrove dismissed stale reviews from jlowe and revans2 via 62992b1 November 9, 2021 16:49

revans2 approved these changes Nov 9, 2021

View reviewed changes

andygrove mentioned this pull request Nov 9, 2021

[FEA] Validate the size/complexity of regular expressions #4061

Closed

andygrove merged commit d951ffa into NVIDIA:branch-21.12 Nov 9, 2021

andygrove deleted the rlike-support-more-regex branch November 9, 2021 21:24

andygrove mentioned this pull request Nov 10, 2021

Enable test_rlike_multi_line and update documentation #3988

Closed

andygrove linked an issue Nov 10, 2021 that may be closed by this pull request

[FEA] Support null characters in regular expressions in RLIKE #3962

Closed

andygrove mentioned this pull request Nov 10, 2021

[FEA] Support null characters in regular expressions in RLIKE #3962

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLike: Fall back to CPU for regex that would produce incorrect results #4044

RLike: Fall back to CPU for regex that would produce incorrect results #4044

andygrove commented Nov 5, 2021

andygrove commented Nov 5, 2021

revans2 left a comment

revans2 Nov 9, 2021

andygrove Nov 9, 2021

revans2 Nov 9, 2021

revans2 Nov 9, 2021

andygrove Nov 9, 2021

andygrove Nov 9, 2021

jlowe commented Nov 9, 2021

revans2 commented Nov 9, 2021

andygrove commented Nov 9, 2021

andygrove commented Nov 9, 2021

RLike: Fall back to CPU for regex that would produce incorrect results #4044

RLike: Fall back to CPU for regex that would produce incorrect results #4044

Conversation

andygrove commented Nov 5, 2021

andygrove commented Nov 5, 2021

revans2 left a comment

Choose a reason for hiding this comment

revans2 Nov 9, 2021

Choose a reason for hiding this comment

andygrove Nov 9, 2021

Choose a reason for hiding this comment

revans2 Nov 9, 2021

Choose a reason for hiding this comment

revans2 Nov 9, 2021

Choose a reason for hiding this comment

andygrove Nov 9, 2021

Choose a reason for hiding this comment

andygrove Nov 9, 2021

Choose a reason for hiding this comment

jlowe commented Nov 9, 2021

revans2 commented Nov 9, 2021

andygrove commented Nov 9, 2021

andygrove commented Nov 9, 2021