Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Problems converting String dtype Series with "nan" to Float (ValueError) #7488

Closed
dmitra79 opened this issue Mar 2, 2021 · 12 comments · Fixed by #9613
Closed

[BUG] Problems converting String dtype Series with "nan" to Float (ValueError) #7488

dmitra79 opened this issue Mar 2, 2021 · 12 comments · Fixed by #9613
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@dmitra79
Copy link

dmitra79 commented Mar 2, 2021

A Series of strings should be convertible to a series of floats, even if some entries are NAN. However, instead a ValueError gets thrown:

The following works fine in Pandas
'''
x= pd.Series(['1.1', '2.3', '', '3'])
x[x=='']=np.NAN
x.astype('float64')
'''
but throws an error in cudf:
'''
x= cudf.Series(['1.1', '2.3', '', '3'])
x[x=='']=np.NAN
x.astype('float64')
'''

ValueError: Could not convert strings to float type due to presence of non-floating values.

Environment overview (please complete the following information)
cudf.version = '0.15.0'
installed with conda

 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=18.04
 DISTRIB_CODENAME=bionic
 DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
 NAME="Ubuntu"
 VERSION="18.04.5 LTS (Bionic Beaver)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 18.04.5 LTS"
 VERSION_ID="18.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=bionic
 UBUNTU_CODENAME=bionic
 Linux ctusbamibi-gpu01 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

 Architecture:        x86_64
 CPU op-mode(s):      32-bit, 64-bit
 Byte Order:          Little Endian
 CPU(s):              40
 On-line CPU(s) list: 0-39
 Thread(s) per core:  2
 Core(s) per socket:  10
 Socket(s):           2
 NUMA node(s):        2
 Vendor ID:           GenuineIntel
 CPU family:          6
 Model:               85
 Model name:          Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
 Stepping:            7
 CPU MHz:             1000.232
 CPU max MHz:         2201.0000
 CPU min MHz:         1000.0000
 BogoMIPS:            4400.00
 Virtualization:      VT-x
 L1d cache:           32K
 L1i cache:           32K
 L2 cache:            1024K
 L3 cache:            14080K
 NUMA node0 CPU(s):   0-9,20-29
 NUMA node1 CPU(s):   10-19,30-39
@dmitra79 dmitra79 added Needs Triage Need team to review and classify bug Something isn't working labels Mar 2, 2021
@harrism harrism added the Python Affects Python cuDF API. label Mar 2, 2021
@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Mar 3, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 3, 2021

@dmitra79 this is a design decision of cudf. In your above example:

import pandas as pd

x = pd.Series(['1.1', '2.3', '', '3'])

Pandas by default creates this Series as an object dtype. This means arbitrary Python objects can be used in the Series. So when you try to set the empty string to a np.nan value, Pandas is happy to have that Python object in the Series.

Unfortunately, this isn't possible for us to handle on the GPU, so when someone does:

import cudf

x = cudf.Series(['1.1', '2.3', '', '3'])

We create this Series as a string dtype. When trying to set the empty string to np.nan, it's ambiguous whether we should typecast the np.nan to a string or typecast the Series to a float dtype. For this reason, we raise an error and this is our intended behavior.

I would also recommending upgrading to a newer cuDF as v0.18 has recently released and many of these behaviors and error messages have been improved over the last couple of releases.

@kkraus14 kkraus14 closed this as completed Mar 3, 2021
@dmitra79
Copy link
Author

dmitra79 commented Mar 3, 2021

@kkraus14 The error gets thrown not when values are set to NAN, but when the casting of the series to float happens (" x.astype('float64') "- which should be pretty unambiguous. Also, this pandas-like code worked in previous versions of cudf without error (I ran into this problem trying to use an older piece of code).

@kkraus14 kkraus14 reopened this Mar 3, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 3, 2021

Apologies, I may have misinterpreted. What is the result of x before the .astype('float64') call? I'm guessing there's a string value of nan that we possibly don't handle in typecasting in which case we should address it.

@dmitra79
Copy link
Author

dmitra79 commented Mar 3, 2021

For this code:

x= cudf.Series(['1.1', '2.3', '', '3'])
x[x=='']=np.NAN
print(x)
print(x[2])
x.astype('float64')
x

The output is:

0    1.1
1    2.3
2    nan
3      3
dtype: object
nan

The error is:

<ipython-input-47-e771a2ea29ff> in <module>
      3 print(x)
      4 print(x[2])
----> 5 x.astype('float64')
      6 x

~/anaconda3/envs/mindsynchro/lib/python3.8/site-packages/cudf/core/series.py in astype(self, dtype, copy, errors)
   2130         except Exception as e:
   2131             if errors == "raise":
-> 2132                 raise e
   2133             elif errors == "warn":
   2134                 import traceback

~/anaconda3/envs/mindsynchro/lib/python3.8/site-packages/cudf/core/series.py in astype(self, dtype, copy, errors)
   2122             return self.copy(deep=copy)
   2123         try:
-> 2124             data = self._column.astype(dtype)
   2125 
   2126             return self._copy_construct(

~/anaconda3/envs/mindsynchro/lib/python3.8/site-packages/cudf/core/column/column.py in astype(self, dtype, **kwargs)
    892             return self.as_string_column(dtype, **kwargs)
    893         else:
--> 894             return self.as_numerical_column(dtype, **kwargs)
    895 
    896     def as_categorical_column(self, dtype, **kwargs):

~/anaconda3/envs/mindsynchro/lib/python3.8/site-packages/cudf/core/column/string.py in as_numerical_column(self, dtype, **kwargs)
   4570         elif out_dtype.kind == "f":
   4571             if not cpp_is_float(self).all():
-> 4572                 raise ValueError(
   4573                     "Could not convert strings to float "
   4574                     "type due to presence of non-floating values."

ValueError: Could not convert strings to float type due to presence of non-floating values.

@dmitra79
Copy link
Author

dmitra79 commented Mar 3, 2021

PS. I just checked with cudf 0.18 (installed from scratch in a new environment) - same issue

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 3, 2021

Thanks for the reproducer. We'll look into this.

In the meantime, I would suggest using None instead of nan as that would work via null values. Then if you want NaN float values you can use a fillna(np.nan) after typecasting to float64.

@dmitra79
Copy link
Author

dmitra79 commented Mar 3, 2021

Thanks for the suggestion of using "None"! That really helps!

@kkraus14 kkraus14 changed the title [BUG] Problems converting Series with NAN to Float (ValueError) [BUG] Problems converting String dtype Series with "nan" to Float (ValueError) Mar 3, 2021
@kkraus14
Copy link
Collaborator

@davidwendt do you happen to know if there's a way to convert a string to a NaN float value? I.E. convert something like "nan" or "NaN"?

@davidwendt
Copy link
Contributor

Yes, if a string is "NaN" then cudf::strings::to_floats() will convert that to float nan value.
Also, it does not except "nan". Only "NaN".
Likewise, it looks for "-Inf" and "Inf" to assign infinity and -infinity.
Looks like I should add these to doxygen for to_floats

@kkraus14
Copy link
Collaborator

Looks like Pandas allows case-insensitive conversion for nan, inf, and -inf, so we should handle the same from the Python side and replace with NaN, Inf, and -Inf respectively.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@galipremsagar galipremsagar self-assigned this Sep 22, 2021
@rapids-bot rapids-bot bot closed this as completed in #9613 Nov 8, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 8, 2021
…ing to `float` (#9613)

Fixes: #7488 

This PR add's support for strings that are `nan`, `inf` & `-inf` and their case-sensitive variations to be supported while type-casting from string column to `float` dtype.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - https://github.com/brandon-b-miller

URL: #9613
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
5 participants