Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RLike: Fall back to CPU for regex that would produce incorrect results #4044

Merged
merged 15 commits into from
Nov 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 21 additions & 57 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,14 +257,32 @@ The plugin supports reading `uncompressed`, `snappy` and `gzip` Parquet files an
fall back to the CPU when reading an unsupported compression format, and will error out in that
case.

## Regular Expressions
The RAPIDS Accelerator for Apache Spark currently supports string literal matches, not wildcard
matches.
## LIKE

If a null char '\0' is in a string that is being matched by a regular expression, `LIKE` sees it as
the end of the string. This will be fixed in a future release. The issue is
[here](https://github.com/NVIDIA/spark-rapids/issues/119).

## Regular Expressions

### regexp_replace

The RAPIDS Accelerator for Apache Spark currently supports string literal matches, not wildcard
matches for the `regexp_replace` function and will fall back to CPU if a regular expression pattern
is provided.

### RLike

The GPU implementation of `RLike` has the following known issues where behavior is not consistent with Apache Spark and
this expression is disabled by default. It can be enabled setting `spark.rapids.sql.expression.RLike=true`.

- `.` matches `\r` on the GPU but not on the CPU ([cuDF issue #9619](https://github.com/rapidsai/cudf/issues/9619))
- `$` does not match the end of string if the string ends with a line-terminator
([cuDF issue #9620](https://github.com/rapidsai/cudf/issues/9620))

`RLike` will fall back to CPU if any regular expressions are detected that are not supported on the GPU
or would produce different results on the GPU.

## Timestamps

Spark stores timestamps internally relative to the JVM time zone. Converting an arbitrary timestamp
Expand Down Expand Up @@ -569,60 +587,6 @@ distribution. Because the results are not bit-for-bit identical with the Apache
`approximate_percentile`, this feature is disabled by default and can be enabled by setting
`spark.rapids.sql.expression.ApproximatePercentile=true`.

## RLike

The GPU implementation of RLike has a number of known issues where behavior is not consistent with Apache Spark and
this expression is disabled by default. It can be enabled setting `spark.rapids.sql.expression.RLike=true`.

A summary of known issues is shown below but this is not intended to be a comprehensive list. We recommend that you
do your own testing to verify whether the GPU implementation of `RLike` is suitable for your use case.

We plan on improving the RLike functionality over time to make it more compatible with Spark so this feature should
be used at your own risk with the expectation that the behavior will change in future releases.

### Multi-line handling

The GPU implementation of RLike supports `^` and `$` to represent the start and end of lines within a string but
Spark uses `^` and `$` to refer to the start and end of the entire string (equivalent to `\A` and `\Z`).

| Pattern | Input | Spark on CPU | Spark on GPU |
|---------|--------|--------------|--------------|
| `^A` | `A\nB` | Match | Match |
| `A$` | `A\nB` | No Match | Match |
| `^B` | `A\nB` | No Match | Match |
| `B$` | `A\nB` | Match | Match |

As a workaround, `\A` and `\Z` can be used instead of `^` and `$`.

### Null support

The GPU implementation of RLike supports null characters in the input but does not support null characters in
the regular expression and will fall back to the CPU in this case.

### Qualifiers with nothing to repeat

Spark supports qualifiers in cases where there is nothing to repeat. For example, Spark supports `a*+` and this
will match all inputs. The GPU implementation of RLike does not support this syntax and will throw an exception with
the message `nothing to repeat at position 0`.

### Stricter escaping requirements

The GPU implementation of RLike has stricter requirements around escaping special characters in some cases.

| Pattern | Input | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `a[-+]` | `a-` | Match | No Match |
| `a[\-\+]` | `a-` | Match | Match |

### Empty groups

The GPU implementation of RLike does not support empty groups correctly.

| Pattern | Input | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `z()?` | `a` | No Match | Match |
| `z()*` | `a` | No Match | Match |

## Conditionals and operations with side effects (ANSI mode)

In Apache Spark condition operations like `if`, `coalesce`, and `case/when` lazily evaluate
Expand Down
24 changes: 17 additions & 7 deletions integration_tests/src/main/python/string_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -492,22 +492,30 @@ def test_rlike_embedded_null():
conf={'spark.rapids.sql.expression.RLike': 'true'})

@allow_non_gpu('ProjectExec', 'RLike')
def test_rlike_null_pattern():
def test_rlike_fallback_null_pattern():
gen = mk_str_gen('[abcd]{1,3}')
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a\u0000"'),
'RLike',
conf={'spark.rapids.sql.expression.RLike': 'true'})

@allow_non_gpu('ProjectExec', 'RLike')
def test_rlike_fallback_empty_group():
gen = mk_str_gen('[abcd]{1,3}')
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a()?"'),
'RLike',
conf={'spark.rapids.sql.expression.RLike': 'true'})

def test_rlike_escape():
gen = mk_str_gen('[ab]{0,2}[\\-\\+]{0,2}')
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a[\\\\-]"'),
conf={'spark.rapids.sql.expression.RLike': 'true'})

@pytest.mark.xfail(reason='cuDF supports multiline by default but Spark does not - https://github.com/rapidsai/cudf/issues/9439')
def test_rlike_multi_line():
gen = mk_str_gen('[abc]\n[def]')
assert_gpu_and_cpu_are_equal_collect(
Expand All @@ -518,18 +526,20 @@ def test_rlike_multi_line():
'a rlike "e$"'),
conf={'spark.rapids.sql.expression.RLike': 'true'})

@pytest.mark.xfail(reason='cuDF has stricter requirements around escaping - https://github.com/rapidsai/cudf/issues/9434')
def test_rlike_missing_escape():
gen = mk_str_gen('a[\\-\\+]')
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a[-]"'),
'a rlike "a[-]"',
'a rlike "a[+-]"',
'a rlike "a[a-b-]"'),
conf={'spark.rapids.sql.expression.RLike': 'true'})

@pytest.mark.xfail(reason='cuDF does not support qualifier with nothing to repeat - https://github.com/rapidsai/cudf/issues/9434')
def test_rlike_nothing_to_repeat():
@allow_non_gpu('ProjectExec', 'RLike')
def test_rlike_fallback_possessive_quantifier():
gen = mk_str_gen('(\u20ac|\\w){0,3}a[|b*.$\r\n]{0,2}c\\w{0,3}')
assert_gpu_and_cpu_are_equal_collect(
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, gen).selectExpr(
'a rlike "a*+"'),
'RLike',
conf={'spark.rapids.sql.expression.RLike': 'true'})
Loading