Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionBlock.is_numeric is always False #22290

Closed
jschendel opened this issue Aug 12, 2018 · 3 comments · Fixed by #22345
Closed

ExtensionBlock.is_numeric is always False #22290

jschendel opened this issue Aug 12, 2018 · 3 comments · Fixed by #22345
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@jschendel
Copy link
Member

Code Sample/Problem Description

Currently ExtensionBlock.is_numeric always returns False. This can be problematic for extension arrays that are numeric, as this is used under the hood in places to filter to numeric columns in a DataFrame. I'll be using IntegerArray as an example, but this in principle applies to any numeric extension array, e.g. DecimalArray in the testing suite, an extension array for units/uncertainties, etc.

Setup:

In [2]: df = pd.DataFrame({'group': list('aaabbb'),
   ...:                    'val1': IntegerArray([0, 1, 2, np.nan, 3, 4]),
   ...:                    'val2': np.arange(6)})
   ...:                    

In [3]: df
Out[3]: 
  group val1  val2
0     a    0     0
1     a    1     1
2     a    2     2
3     b  NaN     3
4     b    3     4
5     b    4     5

The IntegerArray column is ignored by DataFrame._get_numeric_data():

In [4]: df._get_numeric_data()
Out[4]: 
   val2
0     0
1     1
2     2
3     3
4     4
5     5

This leads some numeric routines, such as DataFrame.corr ignoring the IntegerArray column:

In [5]: df.corr()
Out[5]: 
      val2
val2   1.0

Likewise, groupby uses ExtensionBlock.is_numeric to filter to numeric columns for some operations, leading to the IntegerArray column being ignored, even if explicitly requested:

In [6]: df.groupby('group').sum()
Out[6]: 
       val2
group      
a         3
b        12

In [7]: df.groupby('group')['val1', 'val2'].sum()
Out[7]: 
       val2
group      
a         3
b        12

Expected Output

I'd expect ExtensionBlock.is_numeric to return True when appropriate, and for behavior to be consistent with non-extension numeric dtypes.

My first impression is that this should be an attribute of the ExtensionArray or ExtensionDtype class that defaults to False, with numeric implementations setting the attribute to True, and ExtensionBlock.is_numeric would read the value from there.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 0370740
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.29-galliumos
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+456.g0370740
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jschendel jschendel added Numeric Operations Arithmetic, Comparison, and Logical operations ExtensionArray Extending pandas with custom dtypes or arrays. labels Aug 12, 2018
@jschendel jschendel added this to the 0.24.0 milestone Aug 12, 2018
@jbrockmendel
Copy link
Member

I think instead of hardcoding is_xyz attributes on the block subclasses they should be inferred like in Index. Both for correctness (i.e. this Issue) and so we can share code between Block and Index, both of which are array-like.

@TomAugspurger
Copy link
Contributor

On my WIP SparseArray branch, I get around this by adding

    @property
    def _is_numeric(self):
        return False

to ExtensionDtype. And then ExtensionBlock looks for that.

We could infer from ExtensionDtype.kind, and exclude things like O, S, and U

@jreback
Copy link
Contributor

jreback commented Aug 13, 2018

so for integer I made 2 changes (these are not pushed yet). These provide dispatch to blocks as needed for all ops.

+diff --git a/pandas/core/dtypes/common.py b/pandas/core/dtypes/common.py
index b8cbb4150..a40b54919 100644
--- a/pandas/core/dtypes/common.py
+++ b/pandas/core/dtypes/common.py
@@ -1487,7 +1487,8 @@ def is_numeric_dtype(arr_or_dtype):
     if arr_or_dtype is None:
         return False
     tipo = _get_dtype_type(arr_or_dtype)
-    return (issubclass(tipo, (np.number, np.bool_)) and
+    return (issubclass(tipo, (np.number, np.bool_)) or
+            getattr(tipo, 'is_numeric_dtype', False) and
             not issubclass(tipo, (np.datetime64, np.timedelta64)))
 
 
diff --git a/pandas/core/arrays/integer.py b/pandas/core/arrays/integer.py
index c12611706..38ff5bfad 100644
--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -37,6 +37,10 @@ class _IntegerDtype(ExtensionDtype):
     type = None
     na_value = np.nan
 
+    @property
+    def is_numeric_dtype(self):
+        return True
+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants