-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Implement extract_all_re function #9856
Comments
Looking a bit closer, this does appear to be similar to
The 1st string |
Yes, the
|
Please do not include indexing examples here. I don't understand the first example.
This does not look like extract to me. |
The Spark function is confusing. A zero index is a special case and returns all of the strings that match the entire regexp pattern. I will see if the existing cuDF functions can already support this case and will file a new issue if not. For the "normal" indexing case I think I can implement based on the proposed results that you previously posted. |
I think the closest behavior we have is Do you still need an |
Yes, I believe that this would give me what I need. |
Closes #9856 Adds a new `cudf::strings::extract_all` API that returns a LIST column of extracted strings given a regex pattern. This is similar to nvstrings version of `extract` called `extract_record` but returns groups from all matches in each string instead of just the first match. Here is pseudo code of it's behavior on various strings input: ``` s = [ "ABC-200 DEF-400", "GHI-60", "JK-800", "900", NULL ] r = extract_all( s, "'(\w+)-(\d+)" ) r is a LIST column of strings that looks like this: [ [ "ABC", "200", "DEF", "400" ], // 2 matches [ "GHI", "60" ], // 1 match [ "JK", "800" ], // 1 match NULL, // no match NULL ] ``` Each match results in two groups as specified in the regex pattern. Also reorganized the extract source code into `src/strings/extract` directory. The match-counting has been factored out into new `count_matches.cuh` since it will become common code used with `findall_record` in a follow on PR. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #9909
Reference #9856 specifically #9856 (comment) Adds `cudf::strings::findall_record` which was initially implemented in nvstrings but not ported over since LIST column types did not exist at the time and returning a vector of small columns was very inefficient. This API should also allow using the current python function `cudf.str.findall()` with the `expand=False` parameter more effectively. A follow-on PR will address these python changes. This PR reorganizes the libcudf strings _find_ source files into the `cpp/src/strings/search` subdirectory as well. Also, `findall()` has only a regex version so the `_re` suffix is dropped from the name in the libcudf implementation. The python changes in this PR address only the name change and the addition of the new API in the cython interface. Depends on #9909 -- shares the `cudf::strings::detail::count_matches()` utility function. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9911
Is your feature request related to a problem? Please describe.
I would like to be able to implement a GPU version of Spark's regexp_extract_all function. Here is an example usage from the documentation.
Describe the solution you'd like
I would like a cuDF function
extract_all_re
which is similar toextract_re
but returns columns of typeList<String>
containing all matches for each group.Additionally, it would be nice if I could optionally pass cuDF a group index so that we don't always have to extract for all of the groups in the pattern. There is a related issue #9855 asking for this capability for
extract_re
.Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: