[BUG] Problems converting String dtype Series with "nan" to Float (ValueError) #7488

dmitra79 · 2021-03-02T22:02:49Z

A Series of strings should be convertible to a series of floats, even if some entries are NAN. However, instead a ValueError gets thrown:

The following works fine in Pandas
'''
x= pd.Series(['1.1', '2.3', '', '3'])
x[x=='']=np.NAN
x.astype('float64')
'''
but throws an error in cudf:
'''
x= cudf.Series(['1.1', '2.3', '', '3'])
x[x=='']=np.NAN
x.astype('float64')
'''

ValueError: Could not convert strings to float type due to presence of non-floating values.

Environment overview (please complete the following information)
cudf.version = '0.15.0'
installed with conda

 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=18.04
 DISTRIB_CODENAME=bionic
 DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
 NAME="Ubuntu"
 VERSION="18.04.5 LTS (Bionic Beaver)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 18.04.5 LTS"
 VERSION_ID="18.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=bionic
 UBUNTU_CODENAME=bionic
 Linux ctusbamibi-gpu01 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

 Architecture:        x86_64
 CPU op-mode(s):      32-bit, 64-bit
 Byte Order:          Little Endian
 CPU(s):              40
 On-line CPU(s) list: 0-39
 Thread(s) per core:  2
 Core(s) per socket:  10
 Socket(s):           2
 NUMA node(s):        2
 Vendor ID:           GenuineIntel
 CPU family:          6
 Model:               85
 Model name:          Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
 Stepping:            7
 CPU MHz:             1000.232
 CPU max MHz:         2201.0000
 CPU min MHz:         1000.0000
 BogoMIPS:            4400.00
 Virtualization:      VT-x
 L1d cache:           32K
 L1i cache:           32K
 L2 cache:            1024K
 L3 cache:            14080K
 NUMA node0 CPU(s):   0-9,20-29
 NUMA node1 CPU(s):   10-19,30-39

The text was updated successfully, but these errors were encountered:

kkraus14 · 2021-03-03T00:29:02Z

@dmitra79 this is a design decision of cudf. In your above example:

import pandas as pd

x = pd.Series(['1.1', '2.3', '', '3'])

Pandas by default creates this Series as an object dtype. This means arbitrary Python objects can be used in the Series. So when you try to set the empty string to a np.nan value, Pandas is happy to have that Python object in the Series.

Unfortunately, this isn't possible for us to handle on the GPU, so when someone does:

import cudf

x = cudf.Series(['1.1', '2.3', '', '3'])

We create this Series as a string dtype. When trying to set the empty string to np.nan, it's ambiguous whether we should typecast the np.nan to a string or typecast the Series to a float dtype. For this reason, we raise an error and this is our intended behavior.

I would also recommending upgrading to a newer cuDF as v0.18 has recently released and many of these behaviors and error messages have been improved over the last couple of releases.

dmitra79 · 2021-03-03T00:38:28Z

@kkraus14 The error gets thrown not when values are set to NAN, but when the casting of the series to float happens (" x.astype('float64') "- which should be pretty unambiguous. Also, this pandas-like code worked in previous versions of cudf without error (I ran into this problem trying to use an older piece of code).

kkraus14 · 2021-03-03T00:45:02Z

Apologies, I may have misinterpreted. What is the result of x before the .astype('float64') call? I'm guessing there's a string value of nan that we possibly don't handle in typecasting in which case we should address it.

dmitra79 · 2021-03-03T01:21:54Z

For this code:

x= cudf.Series(['1.1', '2.3', '', '3'])
x[x=='']=np.NAN
print(x)
print(x[2])
x.astype('float64')
x

The output is:

0    1.1
1    2.3
2    nan
3      3
dtype: object
nan

The error is:

<ipython-input-47-e771a2ea29ff> in <module>
      3 print(x)
      4 print(x[2])
----> 5 x.astype('float64')
      6 x

~/anaconda3/envs/mindsynchro/lib/python3.8/site-packages/cudf/core/series.py in astype(self, dtype, copy, errors)
   2130         except Exception as e:
   2131             if errors == "raise":
-> 2132                 raise e
   2133             elif errors == "warn":
   2134                 import traceback

~/anaconda3/envs/mindsynchro/lib/python3.8/site-packages/cudf/core/series.py in astype(self, dtype, copy, errors)
   2122             return self.copy(deep=copy)
   2123         try:
-> 2124             data = self._column.astype(dtype)
   2125 
   2126             return self._copy_construct(

~/anaconda3/envs/mindsynchro/lib/python3.8/site-packages/cudf/core/column/column.py in astype(self, dtype, **kwargs)
    892             return self.as_string_column(dtype, **kwargs)
    893         else:
--> 894             return self.as_numerical_column(dtype, **kwargs)
    895 
    896     def as_categorical_column(self, dtype, **kwargs):

~/anaconda3/envs/mindsynchro/lib/python3.8/site-packages/cudf/core/column/string.py in as_numerical_column(self, dtype, **kwargs)
   4570         elif out_dtype.kind == "f":
   4571             if not cpp_is_float(self).all():
-> 4572                 raise ValueError(
   4573                     "Could not convert strings to float "
   4574                     "type due to presence of non-floating values."

ValueError: Could not convert strings to float type due to presence of non-floating values.

dmitra79 · 2021-03-03T01:29:25Z

PS. I just checked with cudf 0.18 (installed from scratch in a new environment) - same issue

kkraus14 · 2021-03-03T01:32:00Z

Thanks for the reproducer. We'll look into this.

In the meantime, I would suggest using None instead of nan as that would work via null values. Then if you want NaN float values you can use a fillna(np.nan) after typecasting to float64.

dmitra79 · 2021-03-03T02:10:42Z

Thanks for the suggestion of using "None"! That really helps!

kkraus14 · 2021-03-15T15:42:27Z

@davidwendt do you happen to know if there's a way to convert a string to a NaN float value? I.E. convert something like "nan" or "NaN"?

davidwendt · 2021-03-15T15:53:48Z

Yes, if a string is "NaN" then cudf::strings::to_floats() will convert that to float nan value.
Also, it does not except "nan". Only "NaN".
Likewise, it looks for "-Inf" and "Inf" to assign infinity and -infinity.
Looks like I should add these to doxygen for to_floats

kkraus14 · 2021-03-15T15:58:30Z

Looks like Pandas allows case-insensitive conversion for nan, inf, and -inf, so we should handle the same from the Python side and replace with NaN, Inf, and -Inf respectively.

github-actions · 2021-04-14T16:06:01Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2021-05-14T18:13:52Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

…ing to `float` (#9613) Fixes: #7488 This PR add's support for strings that are `nan`, `inf` & `-inf` and their case-sensitive variations to be supported while type-casting from string column to `float` dtype. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - David Wendt (https://github.com/davidwendt) - https://github.com/brandon-b-miller URL: #9613

dmitra79 added Needs Triage Need team to review and classify bug Something isn't working labels Mar 2, 2021

harrism added the Python Affects Python cuDF API. label Mar 2, 2021

kkraus14 removed the Needs Triage Need team to review and classify label Mar 3, 2021

kkraus14 closed this as completed Mar 3, 2021

kkraus14 reopened this Mar 3, 2021

kkraus14 changed the title ~~[BUG] Problems converting Series with NAN to Float (ValueError)~~ [BUG] Problems converting String dtype Series with "nan" to Float (ValueError) Mar 3, 2021

github-actions bot added the inactive-30d label Apr 14, 2021

kkraus14 removed the inactive-30d label Apr 14, 2021

github-actions bot added the inactive-30d label May 14, 2021

galipremsagar self-assigned this Sep 22, 2021

galipremsagar removed the inactive-30d label Sep 22, 2021

galipremsagar mentioned this issue Nov 4, 2021

[REVIEW] Add support for string 'nan', 'inf' & '-inf' values while type-casting to float #9613

Merged

rapids-bot bot closed this as completed in #9613 Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Problems converting String dtype Series with "nan" to Float (ValueError) #7488

[BUG] Problems converting String dtype Series with "nan" to Float (ValueError) #7488

dmitra79 commented Mar 2, 2021

kkraus14 commented Mar 3, 2021

dmitra79 commented Mar 3, 2021

kkraus14 commented Mar 3, 2021

dmitra79 commented Mar 3, 2021 •

edited

Loading

dmitra79 commented Mar 3, 2021

kkraus14 commented Mar 3, 2021

dmitra79 commented Mar 3, 2021

kkraus14 commented Mar 15, 2021

davidwendt commented Mar 15, 2021

kkraus14 commented Mar 15, 2021

github-actions bot commented Apr 14, 2021

github-actions bot commented May 14, 2021

[BUG] Problems converting String dtype Series with "nan" to Float (ValueError) #7488

[BUG] Problems converting String dtype Series with "nan" to Float (ValueError) #7488

Comments

dmitra79 commented Mar 2, 2021

kkraus14 commented Mar 3, 2021

dmitra79 commented Mar 3, 2021

kkraus14 commented Mar 3, 2021

dmitra79 commented Mar 3, 2021 • edited Loading

dmitra79 commented Mar 3, 2021

kkraus14 commented Mar 3, 2021

dmitra79 commented Mar 3, 2021

kkraus14 commented Mar 15, 2021

davidwendt commented Mar 15, 2021

kkraus14 commented Mar 15, 2021

github-actions bot commented Apr 14, 2021

github-actions bot commented May 14, 2021

dmitra79 commented Mar 3, 2021 •

edited

Loading