[BUG] Inconsistent handling of word boundaries `\b` and `\B` with StringSplit for regular expressions #5478

anthony-chang · 2022-05-12T22:50:30Z

Describe the bug
We currently fallback to CPU for regular expressions that include \b or \B in string split mode since the behaviour is not consistent with Spark.

Steps/Code to reproduce bug
If we enable word boundaries in RegexSplitMode, the test "string split fuzz" in RegularExpressionTranspilerSuite produces this output:

string split fuzz *** FAILED ***
  string_split pattern=((\b)) isRegex=true data=xb\nB limit=-2 
  CPU [4]: xb, \n, B,  
  GPU [4]: , xb, \n, B

Expected behavior
The CPU and GPU should be consistent.

Environment details (please complete the following information)
None

Additional context
Related PR: #5479

The text was updated successfully, but these errors were encountered:

anthony-chang · 2022-06-22T20:14:07Z

Depends on rapidsai/cudf#11102

anthony-chang · 2022-07-19T22:53:09Z

There are still a number of inconsistencies, particularly when the pattern consists entirely of word boundaries. cuDF/python has an extra empty token in the array returned by string split which Spark does not. Examples:

- \b *** FAILED ***
  string_split pattern=\b isRegex=true data=a limit=-1 
  CPU [2]: a,  
  GPU [3]: , a, (RegularExpressionTranspilerSuite.scala:764)

- \b\b *** FAILED ***
  string_split pattern=\b\b isRegex=true data=a limit=-1 
  CPU [2]: a,  
  GPU [3]: , a, (RegularExpressionTranspilerSuite.scala:764)

- \B\B *** FAILED ***
  string_split pattern=\B\B isRegex=true data=-+ limit=-1 
  CPU [3]: -, +,  
  GPU [4]: , -, +, (RegularExpressionTranspilerSuite.scala:764)

It is possible to use a workaround for this though. We can check if the pattern consists of entirely word or non-word boundaries:

def isEntirely(pattern: RegexAST, component: RegexAST): Boolean = {
  pattern match {
    case RegexSequence(parts) => parts.forall(isEntirely(_, component))
    case RegexGroup(_, term) => isEntirely(term, component)
    case RegexChoice(l, r) => isEntirely(l, component) || isEntirely(r, component)
    case `component` => true
    case _ => false
  }
}

val isEntirelyWordBoundary = isEntirely(ast, RegexEscaped('b')) || isEntirely(ast, RegexEscaped('B'))

Let arr be the string array we get from cuDF's string split. If isEntirelyWordBoundary && arr.length > 1 && arr.head = "" is true, then we can remove the first element in arr and it will match the Spark string split. This method passed the Scala fuzz tests for limit <= 1.

For limit > 1 this wouldn't work since removing an element would be like removing one of the splits. We could try to work around this by running the split with limit = limit + 1 but the conditions for when to do this are even more complicated.

Checking all these conditions on GPU and then doing the removal from a list column vector will be extremely complicated so I don't see a good way to implement this functionality as of now.

anthony-chang added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 12, 2022

anthony-chang mentioned this issue May 12, 2022

Add support for word boundaries \b and \B #5479

Merged

sameerz removed the ? - Needs Triage Need team to review and classify label May 17, 2022

sameerz mentioned this issue May 17, 2022

[FEA] Enable regular expressions by default #4509

Open

61 tasks

anthony-chang self-assigned this Jun 7, 2022

anthony-chang removed their assignment Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Inconsistent handling of word boundaries `\b` and `\B` with StringSplit for regular expressions #5478

[BUG] Inconsistent handling of word boundaries `\b` and `\B` with StringSplit for regular expressions #5478

anthony-chang commented May 12, 2022 •

edited

Loading

anthony-chang commented Jun 22, 2022

anthony-chang commented Jul 19, 2022

[BUG] Inconsistent handling of word boundaries \b and \B with StringSplit for regular expressions #5478

[BUG] Inconsistent handling of word boundaries \b and \B with StringSplit for regular expressions #5478

Comments

anthony-chang commented May 12, 2022 • edited Loading

anthony-chang commented Jun 22, 2022

anthony-chang commented Jul 19, 2022

[BUG] Inconsistent handling of word boundaries `\b` and `\B` with StringSplit for regular expressions #5478

[BUG] Inconsistent handling of word boundaries `\b` and `\B` with StringSplit for regular expressions #5478

anthony-chang commented May 12, 2022 •

edited

Loading