-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RLike: Fall back to CPU for regex that would produce incorrect results #4044
RLike: Fall back to CPU for regex that would produce incorrect results #4044
Conversation
…r RLIKE Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala
Show resolved
Hide resolved
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…-rapids into rlike-support-more-regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great.
regex.toRegexString | ||
} | ||
|
||
private def validate(regex: RegexAST): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we want to possibly look at some rules that can re-write some of these? like if we see "()" as the regular expression can we replace it with ".*"? I honestly don't know if that even would work, because I don't remember what java does in this case. This should probably be follow on work if we do want to look into this, because I don't want to hold this up from going in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we should do this but I wanted to take this one step at a time and start off with simply falling back to CPU and then follow up with optimizations so that this PR doesn't become overwhelming to review. Ideally, I think we should follow up with one PR per specific optimization, so we can make sure that each one has comprehensive tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine with me. Being customer driven on what we pull in sounds good. After all most of these are corner cases, hey should be rare, and if we do add a modification step we need a lot of testing to really be sure it is doing the right thing.
// parse the source regular expression | ||
val regex = new RegexParser(pattern).parse() | ||
// validate that the regex is supported by cuDF | ||
validate(regex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we want to try and validate the size/complexity of the regular expression? I don't know exactly what CUDF does to figure out if it needs a small, medium, large, or crazy big stack/memory, but it looks like we could do something similar, and fall back to the CPU if it is too large. The main reason for this is because we just dropped the default for spark.rapids.memory.gpu.reserve from 1GiB to 256 MiB. The reason we set it at 1GiB was because of hard coded regular expressions that we used. If we are going to fully support arbitrary regular expressions it would be nice to try and tie these two together in some way so we fall back to the CPU if there is not enough reserved memory, or we let users opt into larger regular expressions, but in the instructions we tell them that they need to increase the reserved memory accordingly.
Again this would probably be better as follow on work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a great idea. I will file a follow-on issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed as #4061
build |
build |
build failed with
|
build |
Closes #3797
This PR introduces a lightweight regular expression parser that allows us to inspect patterns to determine if they can be supported on GPU or not so that we can fall back to CPU in those cases. In most cases, this is necessary to handle edge cases that would cause cuDF to throw an
invalid regex: nothing to match
exception. Examples include:a*+
a()?
^|a
ora*|b
There are other cases where Java has support for advanced regex features that are not available in cuDF:
[a-d[m-p]]
There is also the beginning of a transpiler so that we can alter the pattern before passing it to cuDF. So far there is only one trivial example of this and that is escaping
-
if it appears within a character class to represent the character-
rather than being used to specify a character range, as in[abc-]
.This is a large PR with a lot of new functionality and I have been leaning heavily on the fuzzing approach to find differences between CPU and GPU. The fuzz tests are included as part of the new unit test suite.