Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add functions for splitting strings #346

Merged
merged 2 commits into from
Nov 1, 2022

Conversation

richtia
Copy link
Member

@richtia richtia commented Sep 23, 2022

PR adds functions for splitting strings and fixed some spacing in the options of other functions.

- value: "varchar<L2>"
name: "separator"
description: A character used for splitting the string.
return: "List<varchar<L1>>"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't exactly sure how to put the return type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

-
name: regex_string_split
description: >-
Split a string into a list of strings, based on a regular expression pattern. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could use a bit more explanation. I guess the idea is that it works the same as a regular string split, i.e. removing the substrings matched by the regex from the resulting string list. However, I could also imagine someone interpreting it as the regex picking only the split point, such that every character from the original string ends up in one of the returned list elements. Both implementations would be useful, but the one that removes the matched string is more expressive, because you could wrap the regex in a positive lookahead to mimic the other implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I updated the description to be more detailed. From what I've seen, the implementation that removes the matched substring is also how a bunch of different SQL dialects do it.

Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (but I'll leave this open for a little while as I'm no longer effectively the only reviewer)

@CLAassistant
Copy link

CLAassistant commented Oct 6, 2022

CLA assistant check
All committers have signed the CLA.

@richtia
Copy link
Member Author

richtia commented Oct 19, 2022

@westonpace could you give this one a look?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question. Do we want to support a max_split argument? Most languages (Java, Python, JavaScript) support this but I notice it is not in some standard engines (e.g. PostgreSQL, SQL server, and DuckDb).

@westonpace
Copy link
Member

Per the sync meeting notes, since there are a significant # of engines that do not support max_split, that would be a second function (e.g. bounded_string_split or something). So let's move forward with this as is.

@westonpace westonpace merged commit 20a2f14 into substrait-io:main Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants