Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent handling of word boundaries \b and \B with StringSplit for regular expressions #5478

Open
anthony-chang opened this issue May 12, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@anthony-chang
Copy link
Contributor

anthony-chang commented May 12, 2022

Describe the bug
We currently fallback to CPU for regular expressions that include \b or \B in string split mode since the behaviour is not consistent with Spark.

Steps/Code to reproduce bug
If we enable word boundaries in RegexSplitMode, the test "string split fuzz" in RegularExpressionTranspilerSuite produces this output:

string split fuzz *** FAILED ***
  string_split pattern=((\b)) isRegex=true data=xb\nB limit=-2 
  CPU [4]: xb, \n, B,  
  GPU [4]: , xb, \n, B 

Expected behavior
The CPU and GPU should be consistent.

Environment details (please complete the following information)
None

Additional context
Related PR: #5479

@anthony-chang anthony-chang added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 12, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 17, 2022
@anthony-chang anthony-chang self-assigned this Jun 7, 2022
@anthony-chang
Copy link
Contributor Author

Depends on rapidsai/cudf#11102

@anthony-chang
Copy link
Contributor Author

There are still a number of inconsistencies, particularly when the pattern consists entirely of word boundaries. cuDF/python has an extra empty token in the array returned by string split which Spark does not. Examples:

- \b *** FAILED ***
  string_split pattern=\b isRegex=true data=a limit=-1 
  CPU [2]: a,  
  GPU [3]: , a, (RegularExpressionTranspilerSuite.scala:764)

- \b\b *** FAILED ***
  string_split pattern=\b\b isRegex=true data=a limit=-1 
  CPU [2]: a,  
  GPU [3]: , a, (RegularExpressionTranspilerSuite.scala:764)

- \B\B *** FAILED ***
  string_split pattern=\B\B isRegex=true data=-+ limit=-1 
  CPU [3]: -, +,  
  GPU [4]: , -, +, (RegularExpressionTranspilerSuite.scala:764)

It is possible to use a workaround for this though. We can check if the pattern consists of entirely word or non-word boundaries:

def isEntirely(pattern: RegexAST, component: RegexAST): Boolean = {
  pattern match {
    case RegexSequence(parts) => parts.forall(isEntirely(_, component))
    case RegexGroup(_, term) => isEntirely(term, component)
    case RegexChoice(l, r) => isEntirely(l, component) || isEntirely(r, component)
    case `component` => true
    case _ => false
  }
}

val isEntirelyWordBoundary = isEntirely(ast, RegexEscaped('b')) || isEntirely(ast, RegexEscaped('B'))

Let arr be the string array we get from cuDF's string split. If isEntirelyWordBoundary && arr.length > 1 && arr.head = "" is true, then we can remove the first element in arr and it will match the Spark string split. This method passed the Scala fuzz tests for limit <= 1.

For limit > 1 this wouldn't work since removing an element would be like removing one of the splits. We could try to work around this by running the split with limit = limit + 1 but the conditions for when to do this are even more complicated.

Checking all these conditions on GPU and then doing the removal from a list column vector will be extremely complicated so I don't see a good way to implement this functionality as of now.

@anthony-chang anthony-chang removed their assignment Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants