-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add version of extract_re that takes an index #9855
Comments
Why include groups in the pattern that will not return anything?
and you only care about the 2nd group just change the pattern to remove the
|
I cannot argue with the logic here, but this is just how the Spark function works and we have no control over how people invoke it. We could potentially rewrite the regexp pattern in the plugin to remove unreferenced groups but I would be nervous about going that far. |
Is the goal to leverage information downstream in the DAG that indicates some of the capture groups weren't actually necessary for this execution? Could you share a bit about the use cases where this comes up? For example, I can imagine it coming up in exploratory data analysis. Curious to understand how common/significant this is. |
I am also curious to understand this problem. From the original description it doesn't seem like the request is stemming from some complex workflow where some groups are "discovered" to be unnecessary, but rather to optimize user workflows even when users invoke the function with suboptimal regexes. Naively this seems like a use case where we should be aiming to educate users as to best practices for performance rather than optimizing additional code paths, i.e. we should be documenting that including unnecessary groups in the regex will impact performance. That would be consistent with a lot of our messaging around cuDF Python where there are often ways to do things with pandas that would translate to slow cuDF solutions, and we try to document and socialize knowledge about faster ways to do those things in cuDF without trying to optimize all the slower ways. Maybe I'm missing a key reason that this use case is different, though. |
To add some more context to this request. Spark provides a SQL function regexp_extract with the signature |
This issue has been labeled |
This issue has been labeled |
Is your feature request related to a problem? Please describe.
From Spark, when we call
extract_re
we often are only interested in extracting a single group rather than all the groups in the pattern. We currently callextract_re
which returns aTable
and we then get the column we are interested in and discard the others. It would be more efficient if we could pass the column index to cuDF so that only one column needs instantiating.Describe the solution you'd like
I would like a signature something like
extract_re(pattern, index)
.Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: