You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
In Python and cuDF, the default behavior is that . will match all characters except for line-feeds, so this will match on \r. However, the default behavior in Java (as used by Spark) is that . will not match \r so this leads to an incompatibility.
I'm reluctant to add specific support for \r character since it could create confusion in other places where \n is handled -- like for ^ and $ as well as builtin classes \W and \D.
Could you replace the \r with \n in the input string? Since both are single byte characters, this could be done in-place on the chars column.
I think that the best I can do in terms of a workaround is to inspect the input first to see if it does contain \r if the regex pattern contains . and fall back to the CPU in that case. Replacing \r with \n could have side effects but it all depends what the regex pattern is.
@davidwendt Thanks for the suggestion. We have gone with the approach of transpiling the regular expressions and replacing . with [^\r\n] so I am closing this issue,
Is your feature request related to a problem? Please describe.
In Python and cuDF, the default behavior is that
.
will match all characters except for line-feeds, so this will match on\r
. However, the default behavior in Java (as used by Spark) is that.
will not match\r
so this leads to an incompatibility.cuDF example:
Spark example:
Describe the solution you'd like
I would like an option to specify whether
.
should match\r
so that we can match the default Java behavior?Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: