Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement extract_all_re function #9856

Closed
andygrove opened this issue Dec 7, 2021 · 6 comments · Fixed by #9909
Closed

[FEA] Implement extract_all_re function #9856

andygrove opened this issue Dec 7, 2021 · 6 comments · Fixed by #9909
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I would like to be able to implement a GPU version of Spark's regexp_extract_all function. Here is an example usage from the documentation.

SELECT regexp_extract_all('100-200, 300-400', '(\\d+)-(\\d+)', 1);
 ["100","300"]

Describe the solution you'd like
I would like a cuDF function extract_all_re which is similar to extract_re but returns columns of type List<String> containing all matches for each group.

Additionally, it would be nice if I could optionally pass cuDF a group index so that we don't always have to extract for all of the groups in the pattern. There is a related issue #9855 asking for this capability for extract_re.

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added feature request New feature or request Needs Triage Need team to review and classify labels Dec 7, 2021
@davidwendt davidwendt self-assigned this Dec 7, 2021
@davidwendt davidwendt added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Dec 8, 2021
@davidwendt
Copy link
Contributor

Looking a bit closer, this does appear to be similar to extract_record but the all means continue match/extracting after the first match is found? Also, I'm guessing the example in the description includes an index feature that is the subject of a separate issue #9855 and therefore is a bit confusing here. Ignoring indexing, this is what I'm looking at implementing so far:

s = [ "100-200 300-400", "500-600", "700-800", "900" ]
r =  extract_all( s, "'(\d+)-(\d+)" )
r is a LIST column of strings that looks like this:

[ [ "100", "200", "300", "400" ], // 2 sets of matches
  [ "500", "600" ], // 1 match
  [ "700", "800" ], // 1 match
  NULL              // no match
]

The 1st string "100-200 300-400" had 2 matches to (\d+)-(\d+) and so both sets of groups are returned.
The 2nd and 3rd strings each had one match and included those in the output rows.
The 4th string "900" does not match the pattern and so the output for this row is NULL.

@andygrove
Copy link
Contributor Author

andygrove commented Dec 10, 2021

Yes, the all here means continue matching. Your example looks correct to me, ignoring indexing. For reference, here is the equivalent example in Spark:

scala> val df = Seq("100-200 300-400", "500-600", "700-800", "900").toDF("str")
scala> df.createOrReplaceTempView("strings")

scala> spark.sql("SELECT regexp_extract_all(str, '(\\\\d+)-(\\\\d+)', 0) FROM strings").show
+---------------------------------------+
|regexp_extract_all(str, (\d+)-(\d+), 0)|
+---------------------------------------+
|                     [100-200, 300-400]|
|                              [500-600]|
|                              [700-800]|
|                                     []|
+---------------------------------------+

scala> spark.sql("SELECT regexp_extract_all(str, '(\\\\d+)-(\\\\d+)', 1) FROM strings").show
+---------------------------------------+
|regexp_extract_all(str, (\d+)-(\d+), 1)|
+---------------------------------------+
|                             [100, 300]|
|                                  [500]|
|                                  [700]|
|                                     []|
+---------------------------------------+

scala> spark.sql("SELECT regexp_extract_all(str, '(\\\\d+)-(\\\\d+)', 2) FROM strings").show
+---------------------------------------+
|regexp_extract_all(str, (\d+)-(\d+), 2)|
+---------------------------------------+
|                             [200, 400]|
|                                  [600]|
|                                  [800]|
|                                     []|
+---------------------------------------+

@davidwendt
Copy link
Contributor

Please do not include indexing examples here.

I don't understand the first example.

+---------------------------------------+
|regexp_extract_all(str, (\d+)-(\d+), 0)|
+---------------------------------------+
|                     [100-200, 300-400]|
|                              [500-600]|
|                              [700-800]|
|                                     []|
+---------------------------------------+

This does not look like extract to me.
Extract will return the values found inside the groups ( ). Like [ "100", "200" ] in my previous comment post.
Your example now appears to return the entire match result instead. Like [ "100-200" ]

@andygrove
Copy link
Contributor Author

The Spark function is confusing. A zero index is a special case and returns all of the strings that match the entire regexp pattern. I will see if the existing cuDF functions can already support this case and will file a new issue if not.

For the "normal" indexing case I think I can implement based on the proposed results that you previously posted.

@davidwendt
Copy link
Contributor

I think the closest behavior we have is findall
But I suspect you will need _record version of this API to return a lists column instead of a table. This would match the result in this comment.

Do you still need an extract_all as I've described in my comment?
Or perhaps an extract_record that would be similar to this but would only return the first matched group set.

@andygrove
Copy link
Contributor Author

Do you still need an extract_all as I've described in my comment?

Yes, I believe that this would give me what I need.

@beckernick beckernick removed the Needs Triage Need team to review and classify label Dec 16, 2021
@rapids-bot rapids-bot bot closed this as completed in #9909 Jan 5, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 5, 2022
Closes #9856 

Adds a new `cudf::strings::extract_all` API that returns a LIST column of extracted strings given a regex pattern.

This is similar to nvstrings version of `extract` called `extract_record` but returns groups from all matches in each string instead of just the first match. Here is pseudo code of it's behavior on various strings input:
```
s = [ "ABC-200 DEF-400", "GHI-60", "JK-800", "900", NULL ]
r =  extract_all( s, "'(\w+)-(\d+)" )
r is a LIST column of strings that looks like this:

[ [ "ABC", "200", "DEF", "400" ], // 2 matches
  [ "GHI", "60" ], // 1 match
  [ "JK", "800" ], // 1 match
  NULL,            // no match
  NULL
]
```
Each match results in two groups as specified in the regex pattern.

Also reorganized the extract source code into `src/strings/extract` directory.
The match-counting has been factored out into new `count_matches.cuh` since it will become common code used with `findall_record` in a follow on PR.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #9909
rapids-bot bot pushed a commit that referenced this issue Jan 27, 2022
Reference #9856 specifically #9856 (comment)

Adds `cudf::strings::findall_record` which was initially implemented in nvstrings but not ported over since LIST column types did not exist at the time and returning a vector of small columns was very inefficient. This API should also allow using the current python function `cudf.str.findall()` with the `expand=False` parameter more effectively. A follow-on PR will address these python changes.

This PR reorganizes the libcudf strings _find_ source files into the `cpp/src/strings/search` subdirectory as well. Also, `findall()` has only a regex version so the `_re` suffix is dropped from the name in the libcudf implementation. The python changes in this PR address only the name change and the addition of the new API in the cython interface.

Depends on #9909 -- shares the `cudf::strings::detail::count_matches()` utility function.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9911
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants