-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] title
function documentation does not match implemenation
#8596
Comments
jlowe
added
bug
Something isn't working
libcudf
Affects libcudf (C++/CUDA) code.
strings
strings issues (C++ and Python)
labels
Jun 23, 2021
This is designed after the Pandas str.title() function.
So it looks like the documentation is incorrect. |
jlowe
changed the title
[BUG]
[BUG] Jun 23, 2021
title
function capitalizes characters that are not preceded by whitespacetitle
function documentation does not match implemenation
Thanks @davidwendt I updated the summary accordingly so this tracks fixing the documentation. I'll file a separate feature request for a way to capitalize only after whitespace which is what we need for Spark. |
Filed #8597 to track the new behavior we need for emulating Spark's |
rapids-bot bot
pushed a commit
that referenced
this issue
Jun 28, 2021
Closes #8596 #8597 Adds a `sequence_type` parameter to help control how words are delimited within a string when processing the `title()` logic. ``` std::unique_ptr<column> title( strings_column_view const& input, string_character_types sequence_type = string_character_types::ALPHA, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); ``` The default `ALPHA` type preserves the original behavior which matches with the Pandas `str.title()` where the first character after a non-ALPHA is upper-cased and the rest are lower-cased. Likewise, specifying `ALPHANUM` will treat a sequence of alphanumeric characters as a word and upper-case only after a non-ALPHANUM (e.g. whitespace). The `sequence_type` type is declared in the `char_types.hpp` header and so these types can be reused in this API. The following usage should satisfy the #8597 feature request. ``` cudf::strings_column_view view(input); // input strings column auto result = cudf::strings::title(view, cudf::string_character_types::ALPHANUM); ``` The `ALPHANUM` set contains all alphanumeric character types which leaves out only `SPACE` which is the set of whitespace characters. The gtests for `cudf::strings::title()` have been updated to include test-cases using `ALPHANUM`. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) - Nghia Truong (https://github.com/ttnghia) URL: #8602
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
The libcudf strings
title()
documentation states that the first character after each occurrence of spaces is capitalized, but the implementation is capitalizing after non-alphabetic characters instead.Steps/Code to reproduce bug
Call
title()
on a string column that contains the string "n2VIDIA". The expected output is "N2vidia" but instead the result is "N2Vidia".Expected behavior
title()
should capitalize after whitespace rather than any non-alphabetic character. If the current behavior oftitle()
is considered correct then the documentation needs to be updated and there needs to be a mechanism that can capitalize only after whitespace.The text was updated successfully, but these errors were encountered: