Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Regex: Provide option for '.' to not match '\r' #9619

Closed
andygrove opened this issue Nov 5, 2021 · 4 comments
Closed

[FEA] Regex: Provide option for '.' to not match '\r' #9619

andygrove opened this issue Nov 5, 2021 · 4 comments
Assignees
Labels
feature request New feature or request

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.
In Python and cuDF, the default behavior is that . will match all characters except for line-feeds, so this will match on \r. However, the default behavior in Java (as used by Spark) is that . will not match \r so this leads to an incompatibility.

cuDF example:

>>> cudf.Series(['a\rb']).str.contains('a.', regex=True)
0    True
dtype: bool

>>> cudf.Series(['a\nb']).str.contains('a.', regex=True)
0    False
dtype: bool

Spark example:

scala> val df = Seq("a\rb", "a\nb").toDF("c0")
scala> df.createOrReplaceTempView("t1")
scala> spark.sql("select c0 rlike 'a.' from t1").show
+-----------+                                                                   
|c0 RLIKE a.|
+-----------+
|      false|
|      false|
+-----------+

Describe the solution you'd like
I would like an option to specify whether . should match \r so that we can match the default Java behavior?

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added feature request New feature or request Needs Triage Need team to review and classify labels Nov 5, 2021
@davidwendt davidwendt self-assigned this Nov 5, 2021
@davidwendt
Copy link
Contributor

I'm reluctant to add specific support for \r character since it could create confusion in other places where \n is handled -- like for ^ and $ as well as builtin classes \W and \D.
Could you replace the \r with \n in the input string? Since both are single byte characters, this could be done in-place on the chars column.

@andygrove
Copy link
Contributor Author

I think that the best I can do in terms of a workaround is to inspect the input first to see if it does contain \r if the regex pattern contains . and fall back to the CPU in that case. Replacing \r with \n could have side effects but it all depends what the regex pattern is.

@davidwendt
Copy link
Contributor

Another option is to use [^\r\n] instead of .
This would match any character but \r or \n.

@andygrove
Copy link
Contributor Author

@davidwendt Thanks for the suggestion. We have gone with the approach of transpiling the regular expressions and replacing . with [^\r\n] so I am closing this issue,

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants