-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Inconsistent handling of word boundaries \b
and \B
with StringSplit for regular expressions
#5478
Comments
Depends on rapidsai/cudf#11102 |
There are still a number of inconsistencies, particularly when the pattern consists entirely of word boundaries. cuDF/python has an extra empty token in the array returned by string split which Spark does not. Examples:
It is possible to use a workaround for this though. We can check if the pattern consists of entirely word or non-word boundaries: def isEntirely(pattern: RegexAST, component: RegexAST): Boolean = {
pattern match {
case RegexSequence(parts) => parts.forall(isEntirely(_, component))
case RegexGroup(_, term) => isEntirely(term, component)
case RegexChoice(l, r) => isEntirely(l, component) || isEntirely(r, component)
case `component` => true
case _ => false
}
}
val isEntirelyWordBoundary = isEntirely(ast, RegexEscaped('b')) || isEntirely(ast, RegexEscaped('B')) Let For Checking all these conditions on GPU and then doing the removal from a list column vector will be extremely complicated so I don't see a good way to implement this functionality as of now. |
Describe the bug
We currently fallback to CPU for regular expressions that include
\b
or\B
in string split mode since the behaviour is not consistent with Spark.Steps/Code to reproduce bug
If we enable word boundaries in RegexSplitMode, the test
"string split fuzz"
inRegularExpressionTranspilerSuite
produces this output:Expected behavior
The CPU and GPU should be consistent.
Environment details (please complete the following information)
None
Additional context
Related PR: #5479
The text was updated successfully, but these errors were encountered: