Skip to content

Commit

Permalink
RLike: Fall back to CPU for regex that would produce incorrect results (
Browse files Browse the repository at this point in the history
#4044)

* Implement regexp parser to detect when we need to fall back to CPU for RLIKE

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* update compatibility docs

* code cleanup and documentation

* remove adhoc test

* more parser tests

* revert accidental docs change

* Move regular expression parser to new source file

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update docs/compatibility.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* more fully implement hex and octal parsing and address other PR feedback

* make some regex validation less specific

* remove redundant check

* add parser test for complex expression

* remove comment

* revert removing check that was not redundant after all

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
  • Loading branch information
andygrove and jlowe authored Nov 9, 2021
1 parent c6d9e3c commit d951ffa
Show file tree
Hide file tree
Showing 6 changed files with 1,155 additions and 69 deletions.
78 changes: 21 additions & 57 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,14 +257,32 @@ The plugin supports reading `uncompressed`, `snappy` and `gzip` Parquet files an
fall back to the CPU when reading an unsupported compression format, and will error out in that
case.

## Regular Expressions
The RAPIDS Accelerator for Apache Spark currently supports string literal matches, not wildcard
matches.
## LIKE

If a null char '\0' is in a string that is being matched by a regular expression, `LIKE` sees it as
the end of the string. This will be fixed in a future release. The issue is
[here](https://github.com/NVIDIA/spark-rapids/issues/119).

## Regular Expressions

### regexp_replace

The RAPIDS Accelerator for Apache Spark currently supports string literal matches, not wildcard
matches for the `regexp_replace` function and will fall back to CPU if a regular expression pattern
is provided.

### RLike

The GPU implementation of `RLike` has the following known issues where behavior is not consistent with Apache Spark and
this expression is disabled by default. It can be enabled setting `spark.rapids.sql.expression.RLike=true`.

- `.` matches `\r` on the GPU but not on the CPU ([cuDF issue #9619](https://github.com/rapidsai/cudf/issues/9619))
- `$` does not match the end of string if the string ends with a line-terminator
([cuDF issue #9620](https://github.com/rapidsai/cudf/issues/9620))

`RLike` will fall back to CPU if any regular expressions are detected that are not supported on the GPU
or would produce different results on the GPU.

## Timestamps

Spark stores timestamps internally relative to the JVM time zone. Converting an arbitrary timestamp
Expand Down Expand Up @@ -569,60 +587,6 @@ distribution. Because the results are not bit-for-bit identical with the Apache
`approximate_percentile`, this feature is disabled by default and can be enabled by setting
`spark.rapids.sql.expression.ApproximatePercentile=true`.

## RLike

The GPU implementation of RLike has a number of known issues where behavior is not consistent with Apache Spark and
this expression is disabled by default. It can be enabled setting `spark.rapids.sql.expression.RLike=true`.

A summary of known issues is shown below but this is not intended to be a comprehensive list. We recommend that you
do your own testing to verify whether the GPU implementation of `RLike` is suitable for your use case.

We plan on improving the RLike functionality over time to make it more compatible with Spark so this feature should
be used at your own risk with the expectation that the behavior will change in future releases.

### Multi-line handling

The GPU implementation of RLike supports `^` and `$` to represent the start and end of lines within a string but
Spark uses `^` and `$` to refer to the start and end of the entire string (equivalent to `\A` and `\Z`).

| Pattern | Input | Spark on CPU | Spark on GPU |
|---------|--------|--------------|--------------|
| `^A` | `A\nB` | Match | Match |
| `A$` | `A\nB` | No Match | Match |
| `^B` | `A\nB` | No Match | Match |
| `B$` | `A\nB` | Match | Match |

As a workaround, `\A` and `\Z` can be used instead of `^` and `$`.

### Null support

The GPU implementation of RLike supports null characters in the input but does not support null characters in
the regular expression and will fall back to the CPU in this case.

### Qualifiers with nothing to repeat

Spark supports qualifiers in cases where there is nothing to repeat. For example, Spark supports `a*+` and this
will match all inputs. The GPU implementation of RLike does not support this syntax and will throw an exception with
the message `nothing to repeat at position 0`.

### Stricter escaping requirements

The GPU implementation of RLike has stricter requirements around escaping special characters in some cases.

| Pattern | Input | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `a[-+]` | `a-` | Match | No Match |
| `a[\-\+]` | `a-` | Match | Match |

### Empty groups

The GPU implementation of RLike does not support empty groups correctly.

| Pattern | Input | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `z()?` | `a` | No Match | Match |
| `z()*` | `a` | No Match | Match |

## Conditionals and operations with side effects (ANSI mode)

In Apache Spark condition operations like `if`, `coalesce`, and `case/when` lazily evaluate
Expand Down
24 changes: 17 additions & 7 deletions integration_tests/src/main/python/string_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -492,22 +492,30 @@ def test_rlike_embedded_null():
conf={'spark.rapids.sql.expression.RLike': 'true'})

@allow_non_gpu('ProjectExec', 'RLike')
def test_rlike_null_pattern():
def test_rlike_fallback_null_pattern():
gen = mk_str_gen('[abcd]{1,3}')
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a\u0000"'),
'RLike',
conf={'spark.rapids.sql.expression.RLike': 'true'})

@allow_non_gpu('ProjectExec', 'RLike')
def test_rlike_fallback_empty_group():
gen = mk_str_gen('[abcd]{1,3}')
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a()?"'),
'RLike',
conf={'spark.rapids.sql.expression.RLike': 'true'})

def test_rlike_escape():
gen = mk_str_gen('[ab]{0,2}[\\-\\+]{0,2}')
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a[\\\\-]"'),
conf={'spark.rapids.sql.expression.RLike': 'true'})

@pytest.mark.xfail(reason='cuDF supports multiline by default but Spark does not - https://github.com/rapidsai/cudf/issues/9439')
def test_rlike_multi_line():
gen = mk_str_gen('[abc]\n[def]')
assert_gpu_and_cpu_are_equal_collect(
Expand All @@ -518,18 +526,20 @@ def test_rlike_multi_line():
'a rlike "e$"'),
conf={'spark.rapids.sql.expression.RLike': 'true'})

@pytest.mark.xfail(reason='cuDF has stricter requirements around escaping - https://github.com/rapidsai/cudf/issues/9434')
def test_rlike_missing_escape():
gen = mk_str_gen('a[\\-\\+]')
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a[-]"'),
'a rlike "a[-]"',
'a rlike "a[+-]"',
'a rlike "a[a-b-]"'),
conf={'spark.rapids.sql.expression.RLike': 'true'})

@pytest.mark.xfail(reason='cuDF does not support qualifier with nothing to repeat - https://github.com/rapidsai/cudf/issues/9434')
def test_rlike_nothing_to_repeat():
@allow_non_gpu('ProjectExec', 'RLike')
def test_rlike_fallback_possessive_quantifier():
gen = mk_str_gen('(\u20ac|\\w){0,3}a[|b*.$\r\n]{0,2}c\\w{0,3}')
assert_gpu_and_cpu_are_equal_collect(
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a*+"'),
'RLike',
conf={'spark.rapids.sql.expression.RLike': 'true'})
Loading

0 comments on commit d951ffa

Please sign in to comment.