Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add support for string 'nan', 'inf' & '-inf' values while type-casting to float #9613

Merged
merged 7 commits into from
Nov 8, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 77 additions & 4 deletions python/cudf/cudf/core/column/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,69 @@ def str_to_boolean(column: StringColumn):
cudf.dtype("timedelta64[ns]"): str_cast.int2timedelta,
}

_NAN_INF_VARIATIONS = [
"nan",
"NAN",
"Nan",
"naN",
"nAN",
"NAn",
"nAn",
"-inf",
"-INF",
"-InF",
"-inF",
"-iNF",
"-INf",
"-iNf",
"+inf",
"+INF",
"+InF",
"+inF",
"+iNF",
"+INf",
"+Inf",
"+iNf",
"inf",
"INF",
"InF",
"inF",
"iNF",
"INf",
"iNf",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would all of these really happen?
If so, there are a couple of missing combinations: nAn, -iNf, +iNf, iNf
If not, would suggest perhaps only supporting NaN, nan, NAN; -Inf, -inf, -INF; +Inf, +inf, +INF; Inf, inf, INF

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe helpful:

def _proc_inf_strings(col):

Copy link
Contributor Author

@galipremsagar galipremsagar Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would all of these really happen? If so, there are a couple of missing combinations: nAn, -iNf, +iNf, iNf If not, would suggest perhaps only supporting NaN, nan, NAN; -Inf, -inf, -INF; +Inf, +inf, +INF; Inf, inf, INF

Thanks for catching this @davidwendt. The string to float conversion happens via NumPy machinery. FWIW, it appears that NumPy converts all strings to lower case equivalent and then performs the type-cast which makes it easier to handle these kinds of combinations, for ex:

>>> s
array(['213', 'nan', 'NAN', 'INF'], dtype=object)
>>> s.astype('float')
array([213.,  nan,  nan,  inf])

I had these combinations added to match this behavior, if you think performing a .lower operation and then replacing is a better alternative we can switch to that as well. But I suppose libcudf only supports NaN, Inf values as opposed to nan, inf which rules out the .lower operation for us.

Did some benchmarks with a subset and comprehensive set:

# With a subset

In [4]: s = cudf.Series(['1', 'nan', "Nan", "Inf", "-INF", "3246734628374", "324772364823", "324234.234234", "32483463567"] * 1000000)

In [5]: %timeit s.astype('float64')
17.3 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)



# All combinations

In [2]: s = cudf.Series(['1', 'nan', "Nan", "Inf", "-INF", "3246734628374", "324772364823", "324234.234234", "32483463567"] * 1000000)

In [3]: %timeit s.astype('float64')
18 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think it is reasonable for libcudf to support all uppercase for this as documented for std::stof here: https://en.cppreference.com/w/cpp/string/basic_string/stof
This way, you could call .upper() which should be faster than .replace().

This change is not difficult so let me know if you want me to open a PR for this.
Since 21.12 is now in burndown, we could also go ahead with this PR as is and make the change to libcudf and change cudf to use .upper() in the next release as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 21.12 is now in burndown, we could also go ahead with this PR as is and make the change to libcudf and change cudf to use .upper() in the next release as well.

This sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe helpful:

def _proc_inf_strings(col):

I think both these code-paths should be unified a little bit, since David will be adding the all uppercase support in next release, I've added a TODO to merge these two paths.

_LIBCUDF_SUPPORTED_NAN_INF_VARIATIONS = [
"NaN",
"NaN",
"NaN",
"NaN",
"NaN",
"NaN",
"NaN",
"-Inf",
"-Inf",
"-Inf",
"-Inf",
"-Inf",
"-Inf",
"-Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
"Inf",
]


def _is_supported_regex_flags(flags):
return flags == 0 or (
Expand Down Expand Up @@ -5201,21 +5264,31 @@ def as_numerical_column(
self, dtype: Dtype, **kwargs
) -> "cudf.core.column.NumericalColumn":
out_dtype = cudf.dtype(dtype)

string_col = self
if out_dtype.kind in {"i", "u"}:
if not libstrings.is_integer(self).all():
if not libstrings.is_integer(string_col).all():
raise ValueError(
"Could not convert strings to integer "
"type due to presence of non-integer values."
)
elif out_dtype.kind == "f":
if not libstrings.is_float(self).all():
# TODO: Replace this `replace` call with a
# case-insensitive method once following
# issue is fixed: https://github.com/rapidsai/cudf/issues/5217
old_values = cudf.core.column.as_column(_NAN_INF_VARIATIONS)
new_values = cudf.core.column.as_column(
_LIBCUDF_SUPPORTED_NAN_INF_VARIATIONS
)
string_col = libcudf.replace.replace(
string_col, old_values, new_values
)
if not libstrings.is_float(string_col).all():
raise ValueError(
"Could not convert strings to float "
"type due to presence of non-floating values."
)

result_col = _str_to_numeric_typecast_functions[out_dtype](self)
result_col = _str_to_numeric_typecast_functions[out_dtype](string_col)
return result_col

def _as_datetime_or_timedelta_column(self, dtype, format):
Expand Down
2 changes: 2 additions & 0 deletions python/cudf/cudf/core/tools/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,8 @@ def _proc_inf_strings(col):
"""Convert "inf/infinity" strings into "Inf", the native string
representing infinity in libcudf
"""
# TODO: This can be handled by libcudf in
# future see StringColumn.as_numerical_column
col = libstrings.replace_multi(
col, as_column(["+", "inf", "inity"]), as_column(["", "Inf", ""]),
)
Expand Down
14 changes: 13 additions & 1 deletion python/cudf/cudf/tests/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,19 @@ def test_string_astype(dtype):
):
data = ["1", "2", "3", "4", "5"]
elif dtype.startswith("float"):
data = ["1.0", "2.0", "3.0", "4.0", "5.0"]
data = [
"1.0",
"2.0",
"3.0",
"4.0",
None,
"5.0",
"nan",
"-INF",
"NaN",
"inF",
"NAn",
]
elif dtype.startswith("bool"):
data = ["True", "False", "True", "False", "False"]
elif dtype.startswith("datetime64"):
Expand Down