Transpile simple choice-type regular expressions into lists of choices to use with string replace multi #7967

NVnavkumar · 2023-03-29T05:29:56Z

This uses the updated stringReplace(Multi) API from rapidsai/cudf#12858 and rapidsai/cudf#12979 to optimize scenarios that involve simple choices in regular expressions. For example, the regular expressions aa|bb can be transpiled to a list ["aa", "bb"] which can be converted to a ColumnVector and passed to the new stringReplace(Multi) API without using any regex. This results in an improved speedup (especially for large strings).

Some performance numbers:

String Length	GPU Speedup (Original)	GPU Speedup (Multi)
10	0.8200795371	0.7981159001
25	0.8328584473	0.7882058449
100	0.975628728	0.9526859394
200	0.8060741012	0.8109263682
400	1.02905312	1.045528894
800	1.44341563	1.475320584
1600	1.233090129	1.460763825
2000	1.677035523	1.658073974
3000	2.040379172	2.309269186
4000	2.296538473	2.506593343
5000	2.230454971	2.373438022
6000	2.550672573	3.261460184
7000	2.388433052	3.50422513
8000	2.810501651	2.896154833
9000	3.155946873	3.947336732
10000	2.780049845	3.173682502
11000	3.285384808	4.632360605
12000	2.684904315	3.93549146
13000	3.549764365	4.779788331
14000	3.013913874	3.761636145
15000	3.259679957	4.418292353
16000	3.071592244	4.162214575
17000	3.782882687	4.88841021
18000	3.568119919	4.160343195
19000	3.515424049	5.011750753
20000	3.597816177	5.249714386
21000	3.564485739	4.653738049
22000	3.470913154	4.506029904
23000	3.861551689	5.093572397
24000	3.340880507	4.611218396
25000	3.620928017	4.787056352
26000	3.769597192	4.690256848
27000	3.780631681	5.370580964
28000	3.841283666	5.518636173
29000	4.038020021	5.073208157
30000	4.073250526	5.277706773

This test created a Parquet file with 4096 rows of a single string column for each of string lengths. It then used a regular expression that was a simple choice (e.g. aaaaa|bbbbb), and then called regexp_replace on the dataframe using SQL, and then writes the result back to Parquet. This Parquet method was used due to the fact that keeping the entire dataset in memory results in OOM (running out of heap space on the JVM) on the CPU as the string length is increased. Posted is the effective speedup in a table. You can see this in the graph here:

One interesting result is that there is no obvious inflection point in terms of string length. In general, it looks like the multi-replace optimization will either perform about the same, and as the string size increases in the dataframe, the speedup difference between the GPU regexp_replace and the optimized version that transpiles the regex choice into a list of strings which can then be passed to the stringReplace(Multi) API and executed in parallel on the GPU.

Source Code of Benchmark

Signed-off-by: Navin Kumar <navink@nvidia.com>

…-transpile-replace-multi

…e are backrefs Signed-off-by: Navin Kumar <navink@nvidia.com>

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2023-03-29T05:50:12Z

build

integration_tests/src/main/python/regexp_test.py

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

gerashegalov

great improvements

gerashegalov · 2023-03-31T03:05:02Z

integration_tests/src/main/python/regexp_test.py

+def test_regexp_replace_fallback():
+    gen = mk_str_gen('[abcdef]{0,2}')
+
+    conf = { 'spark.rapids.sql.regexp.enabled': 'false' }


nit, here and other places, could use typed constants

Suggested change

conf = { 'spark.rapids.sql.regexp.enabled': 'false' }

conf = { 'spark.rapids.sql.regexp.enabled': False }

gerashegalov · 2023-03-31T03:39:59Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

@@ -593,6 +593,11 @@ object GpuOverrides extends Logging {
    lit.value == null
  }

+  def isSupportedStringReplacePattern(strLit: String): Boolean = {
+    // check for regex special characters, except for \u0000 which we can support
+    !regexList.filterNot(_ == "\u0000").exists(pattern => strLit.contains(pattern))


I think it's more readable in a conjunctive form, but it is not part of your PR, so very optional:

Suggested change

!regexList.filterNot(_ == "\u0000").exists(pattern => strLit.contains(pattern))

!regexList.exists(pattern => pattern != "\u0000" && strLit.contains(pattern))

Signed-off-by: Navin Kumar <navink@nvidia.com>

…-transpile-replace-multi

NVnavkumar · 2023-04-01T01:41:33Z

build

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2023-04-01T01:52:22Z

build

andygrove

LGTM

NVnavkumar added 5 commits March 19, 2023 16:43

WIP: String replace multi transpile for regexp

c24e3c6

Incorporate string replace multi for simple versions of regex choices

348ebea

Signed-off-by: Navin Kumar <navink@nvidia.com>

Merge branch 'branch-23.04' of github.com:NVIDIA/spark-rapids into re…

61a6c2a

…-transpile-replace-multi

Merge branch 'branch-23.04' of github.com:NVIDIA/spark-rapids into re…

4e1445e

…-transpile-replace-multi

Fix logic issue: regexp implmentation should always be used when ther…

7d7db6f

…e are backrefs Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar requested review from andygrove and revans2 March 29, 2023 05:29

Fix scalastyle issue

5271605

Signed-off-by: Navin Kumar <navink@nvidia.com>

mattahrens assigned NVnavkumar Mar 29, 2023

andygrove reviewed Mar 30, 2023

View reviewed changes

integration_tests/src/main/python/regexp_test.py Show resolved Hide resolved

andygrove reviewed Mar 30, 2023

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Show resolved Hide resolved

gerashegalov reviewed Mar 31, 2023

View reviewed changes

NVnavkumar added 3 commits March 31, 2023 09:41

Add support for choices inside capture groups

0e9ee4c

Signed-off-by: Navin Kumar <navink@nvidia.com>

Add unit test for getChoicesFromRegex

31fa6ed

Signed-off-by: Navin Kumar <navink@nvidia.com>

Merge branch 'branch-23.04' of github.com:NVIDIA/spark-rapids into re…

8d64e17

…-transpile-replace-multi

fix scalastyle

c0ab137

Signed-off-by: Navin Kumar <navink@nvidia.com>

sameerz added the performance A performance related task/issue label Apr 2, 2023

andygrove approved these changes Apr 3, 2023

View reviewed changes

NVnavkumar merged commit ade76be into NVIDIA:branch-23.04 Apr 3, 2023

NVnavkumar mentioned this pull request Apr 4, 2023

Fix regexp_replace integration test that should fallback when unicode is disabled #8031

Merged

revans2 mentioned this pull request Mar 15, 2024

[FEA] Find ways to convert regular expressions into faster operations #10600

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transpile simple choice-type regular expressions into lists of choices to use with string replace multi #7967

Transpile simple choice-type regular expressions into lists of choices to use with string replace multi #7967

NVnavkumar commented Mar 29, 2023

NVnavkumar commented Mar 29, 2023

gerashegalov left a comment

gerashegalov Mar 31, 2023

gerashegalov Mar 31, 2023

NVnavkumar commented Apr 1, 2023

NVnavkumar commented Apr 1, 2023

andygrove left a comment

	conf = { 'spark.rapids.sql.regexp.enabled': 'false' }
	conf = { 'spark.rapids.sql.regexp.enabled': False }

	!regexList.filterNot(_ == "\u0000").exists(pattern => strLit.contains(pattern))
	!regexList.exists(pattern => pattern != "\u0000" && strLit.contains(pattern))

Transpile simple choice-type regular expressions into lists of choices to use with string replace multi #7967

Transpile simple choice-type regular expressions into lists of choices to use with string replace multi #7967

Conversation

NVnavkumar commented Mar 29, 2023

NVnavkumar commented Mar 29, 2023

gerashegalov left a comment

Choose a reason for hiding this comment

gerashegalov Mar 31, 2023

Choose a reason for hiding this comment

gerashegalov Mar 31, 2023

Choose a reason for hiding this comment

NVnavkumar commented Apr 1, 2023

NVnavkumar commented Apr 1, 2023

andygrove left a comment

Choose a reason for hiding this comment