Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] split_record output empty list for empty input string #16453

Closed
ttnghia opened this issue Jul 31, 2024 · 5 comments · Fixed by #16466
Closed

[BUG] split_record output empty list for empty input string #16453

ttnghia opened this issue Jul 31, 2024 · 5 comments · Fixed by #16466
Assignees
Labels
bug Something isn't working

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Jul 31, 2024

For an empty input string, split_record does not output anything (empty list). This is inconsistent with the cases when the input does not contain the split delimiter then the output list has one element equal to the input. For example:

input = "abcxyz"
output = split_record(input, "@")
output = ["abcxyz"]

However:

input = ""
output = split_record(input, "@")
output = []

This inconsistent behavior causes mismatching between cudf and Java String.split API, which states that:

If the expression does not match any part of the input then the resulting array has just one element, namely this string.

Ref: https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#split-java.lang.String-int-

@ttnghia ttnghia added the bug Something isn't working label Jul 31, 2024
@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 31, 2024

CC @davidwendt.

@davidwendt davidwendt self-assigned this Jul 31, 2024
@davidwendt
Copy link
Contributor

The output looks correct to me. It is returning a list with a single empty string and not an empty list.
The input string is empty and so it returns an empty string in the output.

>>> import cudf
>>> import pandas as pd
>>> ps = pd.Series(['','abcdef'])
>>> ps.str.split('@')
0          []
1    [abcdef]
dtype: object
>>> gs = cudf.Series(['','abcdef'])
>>> gs.str.split('@')
0          []
1    [abcdef]
dtype: list

The [] contains an empty string and does not indicate an empty list.

>>> ps.str.split('@')[0]
['']
>>> gs.str.split('@')[0]
['']

You can see this unit test also shows that any empty string will return an empty string

TEST_F(StringsSplitTest, SplitRecord)
{
std::vector<char const*> h_strings{" Héllo thesé", nullptr, "are some ", "tést String", ""};
auto validity =
thrust::make_transform_iterator(h_strings.begin(), [](auto str) { return str != nullptr; });
cudf::test::strings_column_wrapper strings(h_strings.begin(), h_strings.end(), validity);
auto result =
cudf::strings::split_record(cudf::strings_column_view(strings), cudf::string_scalar(" "));
using LCW = cudf::test::lists_column_wrapper<cudf::string_view>;
LCW expected(
{LCW{"", "Héllo", "thesé"}, LCW{}, LCW{"are", "some", "", ""}, LCW{"tést", "String"}, LCW{""}},
validity);
CUDF_TEST_EXPECT_COLUMNS_EQUAL(result->view(), expected);
}

Note that the last entry is an empty string and the expected value for that row is LCW{""} and not LCW{}

@ttnghia
Copy link
Contributor Author

ttnghia commented Aug 1, 2024

The output seems to be correct in that case, in which the input has other non-empty strings.
However, for input with all empty, the output is empty:

TEST_F(StringsSplitTest, SplitRecordAllEmpty)
{
auto input = cudf::test::strings_column_wrapper({"", "", "", ""});
auto sv = cudf::strings_column_view(input);
auto delimiter = cudf::string_scalar("s");
auto empty = cudf::string_scalar("");
using LCW = cudf::test::lists_column_wrapper<cudf::string_view>;
LCW expected({LCW{}, LCW{}, LCW{}, LCW{}});

@GregoryKimball
Copy link
Contributor

Thank you @ttnghia for investigating this. We certainly have a bug in split and rsplit with an input column of all empty strings. @davidwendt and I discussed this issue and we agree that fixing the root cause will be too involved to make it in time for 24.08. We looked into some edge-case hacks on the libcudf side and couldn't find a working solution that would cover split, split_records, split_re, rsplit, etc. We plan to address this edge case bug in 24.10.

Is it possible for Spark-RAPIDS to detect the all-empty string column case and avoid calling libcudf?

@ttnghia
Copy link
Contributor Author

ttnghia commented Aug 1, 2024

I think we can workaround that in the plugin, waiting for cudf fix in the next release. Thanks Greg.

rapids-bot bot pushed a commit that referenced this issue Aug 13, 2024
Fixes specialized behavior for all empty input column on the strings split APIs.
Verifying behavior with Pandas `str.split( pat, expand, regex )`
`pat=None     -- whitespace`
`expand=False -- record APIs`
`regex=True   -- re APIs`

- [x] `split`
- [x] `split` - whitespace
- [x] `rsplit`
- [x] `rsplit` - whitespace
- [x] `split_record`
- [x] `split_record` - whitespace
- [x] `rsplit_record`
- [x] `rsplit_record` - whitespace
- [x] `split_re`
- [x] `rsplit_re`
- [x] `split_record_re`
- [x] `rsplit_record_re`

Closes #16453

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #16466
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants