Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transpile simple choice-type regular expressions into lists of choices to use with string replace multi #7967

Merged
merged 10 commits into from
Apr 3, 2023

Conversation

NVnavkumar
Copy link
Collaborator

Fixes #7907.

This uses the updated stringReplace(Multi) API from rapidsai/cudf#12858 and rapidsai/cudf#12979 to optimize scenarios that involve simple choices in regular expressions. For example, the regular expressions aa|bb can be transpiled to a list ["aa", "bb"] which can be converted to a ColumnVector and passed to the new stringReplace(Multi) API without using any regex. This results in an improved speedup (especially for large strings).

Some performance numbers:

String Length GPU Speedup (Original) GPU Speedup (Multi)
10 0.8200795371 0.7981159001
25 0.8328584473 0.7882058449
100 0.975628728 0.9526859394
200 0.8060741012 0.8109263682
400 1.02905312 1.045528894
800 1.44341563 1.475320584
1600 1.233090129 1.460763825
2000 1.677035523 1.658073974
3000 2.040379172 2.309269186
4000 2.296538473 2.506593343
5000 2.230454971 2.373438022
6000 2.550672573 3.261460184
7000 2.388433052 3.50422513
8000 2.810501651 2.896154833
9000 3.155946873 3.947336732
10000 2.780049845 3.173682502
11000 3.285384808 4.632360605
12000 2.684904315 3.93549146
13000 3.549764365 4.779788331
14000 3.013913874 3.761636145
15000 3.259679957 4.418292353
16000 3.071592244 4.162214575
17000 3.782882687 4.88841021
18000 3.568119919 4.160343195
19000 3.515424049 5.011750753
20000 3.597816177 5.249714386
21000 3.564485739 4.653738049
22000 3.470913154 4.506029904
23000 3.861551689 5.093572397
24000 3.340880507 4.611218396
25000 3.620928017 4.787056352
26000 3.769597192 4.690256848
27000 3.780631681 5.370580964
28000 3.841283666 5.518636173
29000 4.038020021 5.073208157
30000 4.073250526 5.277706773

This test created a Parquet file with 4096 rows of a single string column for each of string lengths. It then used a regular expression that was a simple choice (e.g. aaaaa|bbbbb), and then called regexp_replace on the dataframe using SQL, and then writes the result back to Parquet. This Parquet method was used due to the fact that keeping the entire dataset in memory results in OOM (running out of heap space on the JVM) on the CPU as the string length is increased. Posted is the effective speedup in a table. You can see this in the graph here:

image

One interesting result is that there is no obvious inflection point in terms of string length. In general, it looks like the multi-replace optimization will either perform about the same, and as the string size increases in the dataframe, the speedup difference between the GPU regexp_replace and the optimized version that transpiles the regex choice into a list of strings which can then be passed to the stringReplace(Multi) API and executed in parallel on the GPU.

Source Code of Benchmark

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great improvements

def test_regexp_replace_fallback():
gen = mk_str_gen('[abcdef]{0,2}')

conf = { 'spark.rapids.sql.regexp.enabled': 'false' }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, here and other places, could use typed constants

Suggested change
conf = { 'spark.rapids.sql.regexp.enabled': 'false' }
conf = { 'spark.rapids.sql.regexp.enabled': False }

@@ -593,6 +593,11 @@ object GpuOverrides extends Logging {
lit.value == null
}

def isSupportedStringReplacePattern(strLit: String): Boolean = {
// check for regex special characters, except for \u0000 which we can support
!regexList.filterNot(_ == "\u0000").exists(pattern => strLit.contains(pattern))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more readable in a conjunctive form, but it is not part of your PR, so very optional:

Suggested change
!regexList.filterNot(_ == "\u0000").exists(pattern => strLit.contains(pattern))
!regexList.exists(pattern => pattern != "\u0000" && strLit.contains(pattern))

Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

@sameerz sameerz added the performance A performance related task/issue label Apr 2, 2023
Copy link
Contributor

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Optimize regexp_replace in multi-replace scenarios
4 participants