Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add functions for splitting strings #346

Merged
merged 2 commits into from
Nov 1, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 95 additions & 27 deletions extensions/functions_string.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -101,13 +101,13 @@ scalar_functions:
impls:
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII]
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED]
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED]
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "varchar<L1>"
name: "input"
Expand All @@ -120,13 +120,13 @@ scalar_functions:
return: "varchar<L1>"
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII]
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED]
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED]
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "string"
name: "input"
Expand Down Expand Up @@ -523,13 +523,13 @@ scalar_functions:
impls:
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII]
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED]
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED]
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "varchar<L1>"
name: "input"
Expand All @@ -542,13 +542,13 @@ scalar_functions:
return: i64
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII]
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED]
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED]
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "string"
name: "input"
Expand Down Expand Up @@ -620,13 +620,13 @@ scalar_functions:
impls:
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII]
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED]
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED]
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "string"
name: "input"
Expand All @@ -637,13 +637,13 @@ scalar_functions:
return: i64
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII]
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED]
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED]
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "varchar<L1>"
name: "input"
Expand All @@ -654,13 +654,13 @@ scalar_functions:
return: i64
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII]
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED]
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED]
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "fixedchar<L1>"
name: "input"
Expand Down Expand Up @@ -1015,13 +1015,13 @@ scalar_functions:
impls:
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII]
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED]
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED]
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "string"
name: "input"
Expand All @@ -1041,13 +1041,13 @@ scalar_functions:
return: "string"
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII]
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED]
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED]
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "varchar<L1>"
name: "input"
Expand Down Expand Up @@ -1263,6 +1263,74 @@ scalar_functions:
- value: i32
name: "count"
return: "string"
-
name: string_split
description: >-
Split a string into a list of strings, based on a specified `separator` character.
impls:
- args:
- value: "varchar<L1>"
name: "input"
description: The input string.
- value: "varchar<L2>"
name: "separator"
description: A character used for splitting the string.
return: "List<varchar<L1>>"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't exactly sure how to put the return type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

- args:
- value: "string"
name: "input"
description: The input string.
- value: "string"
name: "separator"
description: A character used for splitting the string.
return: "List<string>"
-
name: regex_string_split
description: >-
Split a string into a list of strings, based on a regular expression pattern. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could use a bit more explanation. I guess the idea is that it works the same as a regular string split, i.e. removing the substrings matched by the regex from the resulting string list. However, I could also imagine someone interpreting it as the regex picking only the split point, such that every character from the original string ends up in one of the returned list elements. Both implementations would be useful, but the one that removes the matched string is more expressive, because you could wrap the regex in a positive lookahead to mimic the other implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I updated the description to be more detailed. From what I've seen, the implementation that removes the matched substring is also how a bunch of different SQL dialects do it.

regular expression pattern should follow the International Components for Unicode
implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html).

The `case_sensitivity` option specifies case-sensitive or case-insensitive matching.
Enabling the `multiline` option will treat the input string as multiple lines. This makes
the `^` and `$` characters match at the beginning and end of any line, instead of just the
beginning and end of the input string. Enabling the `dotall` option makes the `.` character
match line terminator characters in a string.
impls:
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "varchar<L1>"
name: "input"
description: The input string.
- value: "varchar<L2>"
name: "pattern"
description: The regular expression to search for within the input string.
return: "List<varchar<L1>>"
- args:
- name: case_sensitivity
options: [ CASE_SENSITIVE, CASE_INSENSITIVE, CASE_INSENSITIVE_ASCII ]
required: false
- name: multiline
options: [ MULTILINE_DISABLED, MULTILINE_ENABLED ]
required: false
- name: dotall
options: [ DOTALL_DISABLED, DOTALL_ENABLED ]
required: false
- value: "string"
name: "input"
description: The input string.
- value: "string"
name: "pattern"
description: The regular expression to search for within the input string.
return: "List<string>"

aggregate_functions:

Expand Down