Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] title function documentation does not match implemenation #8596

Closed
jlowe opened this issue Jun 23, 2021 · 3 comments · Fixed by #8602
Closed

[BUG] title function documentation does not match implemenation #8596

jlowe opened this issue Jun 23, 2021 · 3 comments · Fixed by #8602
Assignees
Labels
bug Something isn't working doc Documentation libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@jlowe
Copy link
Member

jlowe commented Jun 23, 2021

Describe the bug
The libcudf strings title() documentation states that the first character after each occurrence of spaces is capitalized, but the implementation is capitalizing after non-alphabetic characters instead.

Steps/Code to reproduce bug
Call title() on a string column that contains the string "n2VIDIA". The expected output is "N2vidia" but instead the result is "N2Vidia".

Expected behavior
title() should capitalize after whitespace rather than any non-alphabetic character. If the current behavior of title() is considered correct then the documentation needs to be updated and there needs to be a mechanism that can capitalize only after whitespace.

@jlowe jlowe added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Jun 23, 2021
@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Jun 23, 2021
@davidwendt davidwendt self-assigned this Jun 23, 2021
@davidwendt
Copy link
Contributor

This is designed after the Pandas str.title() function.

>>> import pandas as pd
>>> s = pd.Series(["n2vidia"])
>>> s.str.title()
0    N2Vidia

So it looks like the documentation is incorrect.

@jlowe jlowe changed the title [BUG] title function capitalizes characters that are not preceded by whitespace [BUG] title function documentation does not match implemenation Jun 23, 2021
@jlowe
Copy link
Member Author

jlowe commented Jun 23, 2021

Thanks @davidwendt I updated the summary accordingly so this tracks fixing the documentation. I'll file a separate feature request for a way to capitalize only after whitespace which is what we need for Spark.

@jlowe jlowe added the doc Documentation label Jun 23, 2021
@jlowe
Copy link
Member Author

jlowe commented Jun 23, 2021

Filed #8597 to track the new behavior we need for emulating Spark's initcap behavior.

rapids-bot bot pushed a commit that referenced this issue Jun 28, 2021
Closes #8596  #8597

Adds a `sequence_type` parameter to help control how words are delimited within a string when processing the `title()` logic.
```
std::unique_ptr<column> title(
  strings_column_view const& input,
  string_character_types sequence_type = string_character_types::ALPHA,
  rmm::mr::device_memory_resource* mr  = rmm::mr::get_current_device_resource());
```

The default `ALPHA` type preserves the original behavior which matches with the Pandas `str.title()` where the first character after a non-ALPHA is upper-cased and the rest are lower-cased. Likewise, specifying `ALPHANUM` will treat a sequence of alphanumeric characters as a word and upper-case only after a non-ALPHANUM (e.g. whitespace).

The `sequence_type` type is declared in the `char_types.hpp` header and so these types can be reused in this API.

The following usage should satisfy the #8597 feature request.
```
   cudf::strings_column_view view(input); // input strings column
   auto result = cudf::strings::title(view, cudf::string_character_types::ALPHANUM);
```
The `ALPHANUM` set contains all alphanumeric character types which leaves out only `SPACE` which is the set of  whitespace characters.

The gtests for `cudf::strings::title()` have been updated to include test-cases using `ALPHANUM`.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - Nghia Truong (https://github.com/ttnghia)

URL: #8602
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working doc Documentation libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants