Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: nullable Float32/64 ExtensionArray #34307

Merged
merged 35 commits into from
Sep 30, 2020

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 22, 2020

As proposed at #32265 (comment), this is adding a float extension dtype and array (for now avoiding/ignoring the NaN-discussion), using the masked array approach following our other nullable dtypes.

Brief overview:

  • This PR adds a FloatingArray and Float32Dtype / Float64Dtype (similar as for the nullable integer dtypes)

  • I am using the "Float64" pattern with capital "F" as name for the dtype to distinguish the string name of the "float64" dtype from numpy (consistent with Int64 etc)

  • For now I only added Float32 and Float64 (we can see later if we want to add others like float16)

  • Upon conversion of input (both in construction as in operations such as setitem), any np.nan is coerced to NA (we can see later to add an option / config for this)

  • The added tests are for now mostly copied from the existing integer/boolean tests, and adapted for Float

  • It directly uses some of the masked ops I implemented earlier for IntegerArray, which means that eg sum already shows a similar perf benefit:

    a = np.random.randn(1_000_000) 
    a[np.random.randint(0, 1_000_000, 100)] = np.nan 
    s1 = pd.Series(a)
    s2 = pd.Series(a, dtype="Float64")
    
    In [6]: np.allclose(s1.sum(), s2.sum()) 
    Out[6]: True
    
    # numpy float series
    In [7]: %timeit s1.sum()   
    2.73 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # nullable float series
    In [8]: %timeit s2.sum()   
    875 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • Still to do:

    • Add whatsnew + docs
    • Division with Int64 dtype can now result in nullable Float Dtype instead of numpy float array
  • Other todo's but that can probably be left for follow-up PRs:

    • Update the convert_dtypes method to also return nullable floats (one question here: what with floats that are all "integer"-like? I think we want to keep returning nullable int for that, but we might want to add a parameter controlling that behaviour?)
    • Some more arithmetical operations with IntegerArray/BooleanArray can now return a FloatingArray instead of numpy float array
    • Ufuncs on IntegerArray/BooleanArray can return FloatingArray, ufuncs on mixture of IntegerArray/FloatArray
    • Consolidate some tests that are common for integer/boolean/float
    • Deduplicate more things between the Array classes and move to BaseMaskedArray (eg the arithmetic / comparison ops can probably be moved with a bit of refactoring)
      • __from_arrow__
      • ufuncs
      • coerce_to_array, but this will probably require some refactoring
    • formatting options

def test_divide_by_zero(self, zero, negative):
# TODO pending NA/NaN discussion
# https://github.com/pandas-dev/pandas/issues/32265/
a = pd.array([0, 1, -1, None], dtype="Float64")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this and other tests, shouldn't you also test dtype Float32 ?

tm.assert_extension_array_equal(result, expected)


def test_cross_type_arithmetic():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to add tests for arithmetic with pd.NA ?

pandas/tests/arrays/floating/test_to_numpy.py Show resolved Hide resolved
a = pd.array([0, 0, 0, 1, 1, 1, None, None, None], dtype="Float64")
b = pd.array([0, 1, None, 0, 1, None, 0, 1, None], dtype="Float64")
result = a ** b
expected = pd.array([1, 0, None, 1, 1, 1, 1, None, None], dtype="Float64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also worth testing cases where Float is raised to an Int power and Int to a Float power (the latter case should I think become Float)?

def test_rpow_one_to_na(self):
# https://github.com/pandas-dev/pandas/issues/22022
# https://github.com/pandas-dev/pandas/issues/29997
arr = pd.array([np.nan, np.nan], dtype="Float64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If later NaN and NA are distinguishable in this context could it make sense to use None or NA explicitly in these tests (it wouldn't actually matter in this specific case but still)?

@pep8speaks
Copy link

pep8speaks commented May 29, 2020

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-30 08:18:27 UTC

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 29, 2020

@dsaxton @Dr-Irv thanks for the comments!
For your comments on test_arithmetic.py, that's mostly because I copied those from the integer tests, and it was done like that there as well. Not a reason to improve it of course ;), but I am completely refactoring those arithmetic tests for integer (#34454), and if that goes through, will do the same here.

And we indeed also need to start having tests where those dtypes are mixed (eg Int64 with Float64)

@jorisvandenbossche
Copy link
Member Author

cc @pandas-dev/pandas-core

tm.assert_extension_array_equal(result, expected)

# reversed
a = a[1:] # Can't raise integers to negative powers.
Copy link
Member

@dsaxton dsaxton May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be possible, with the result now being a FloatingArray? Do we also want to test raising float values (other than NaN) to a FloatingArray power, maybe including fractional values in the exponent?

if not all(isinstance(t, _FloatingDtype) for t in dtypes):
return None
np_dtype = np.find_common_type(
[t.numpy_dtype for t in dtypes], [] # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what error was this producing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd This was added in #33607.
The problem is that by doing the if not all(isinstance(t, _FloatingDtype) for t in dtypes): return None, we know that by ending up here, we have a list of all FloatingDtype dtypes (which have a numpy_dtype attribute). But mypy doesn't know this.
A cast(List[_FloatingDtype], dtypes) might help?

pandas/core/arrays/floating.py Show resolved Hide resolved
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to deduplicate with IntegerArray before or after merging this? I noted a couple places, but there's plenty more that could be done.

Comment on lines 607 to 610
.. versionchanged:: 1.0.0

Now uses :attr:`pandas.NA` as its missing value,
rather than :attr:`numpy.nan`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this versionchanged.

pandas/core/arrays/floating.py Show resolved Hide resolved
scalars = to_numeric(strings, errors="raise")
return cls._from_sequence(scalars, dtype, copy)

_HANDLED_TYPES = (np.ndarray, numbers.Number)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to think about ops involving a mix of our extension arrays something like

np.add(FloatingArray, IntegerArray)

How feasible is an __array_ufunc__ on a base class of FloatingArray & IntegerArray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right now you get a TypeError when calling numpy ufuncs on a mixture of integer / floating arrays. Noted that as a follow-up item.

Comment on lines 393 to 394
result = self._data.astype(dtype.numpy_dtype, copy=False)
return type(self)(result, mask=self._mask, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these both copy=False? Shouldn't the first be copy=copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, we currently also do incorrectly for IntegerDtype ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for the integer case -> #34931 (can copy that over here once the other is merged)

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved
@@ -319,6 +331,9 @@ def array(
elif inferred_dtype == "integer":
return IntegerArray._from_sequence(data, copy=copy)

elif inferred_dtype == "floating":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe handle mixed-integer-float too?

In [11]: pd._libs.lib.infer_dtype([1., np.nan, 2])
Out[11]: 'mixed-integer-float'

In [12]: pd._libs.lib.infer_dtype([1., 2])
Out[12]: 'mixed-integer-float'

@jorisvandenbossche
Copy link
Member Author

@TomAugspurger @WillAyd regarding sharing code with the other masked arrays (it's not only with integer, but also with boolean): yes, this certainly needs to happen.

There are plenty of things that should be shared (some will need a bit of refactoring). See also the last todo item in the top post. I have noted some more things to be shared, and I also actually already started deduplicating some things that I noticed while preparing this PR that could easily be done prior to this PR: #34187, #34308 (there are also other things that could be done prior, I think).
(and I also started refactoring the arithmetic tests, with one of the goals to make it easier to share them: #34454)

But about doing those in this PR: I have a slight preference to not do this. First, because it won't actually make the diff of this PR smaller, but rather more complex (as it would involve moving things from integer.py to one of the other files). So it won't necessarily make it easier to review this PR.
And second, because some of those need a bit of refactoring to properly share things. That will certainly be for the better, but it might be cleaner / clearer to do this in separate, more focused PRs (eg rewriting the ufunc machinery, rewriting the integer/float_arithmetic_ops, .. I would rather do those in separate PRs).

(and I am also commited to actually doing such follow-ups, like I already started doing some PRs prior to this)

For things that are straight copy-paste and require no changes, those can certainly be done here (or in a pre-cursor PR).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 1, 2020 via email

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will look soon

@jbrockmendel
Copy link
Member

%timeit s1.sum()

Out of curiosity, what is driving the improved perf here?

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel it's mostly due to avoiding using the "nan.." versions (eg nansum is quite a bit slower as sum. See also #34773).

I think there are mostly two reasons the nanops versions are slower: 1) you need to calculate the mask every time, and 2) the approach nanops takes (copy array + putmask to fill in "neutral" value, eg 0 for sum) seems slower compared to the newer where argument in numpy ufuncs.

Eg for

a = np.random.randn(1_000_000)  
a[np.random.randint(0, 1_000_000, 100)] = np.nan

the nansum basically is:

In [14]: %%timeit 
    ...: mask = np.isnan(a) 
    ...: temp = a.copy() 
    ...: np.putmask(temp, mask, 0) 
    ...: np.sum(temp) 
2.34 ms ± 48.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

which is also I suppose more or less what numpy does:

In [15]: %timeit np.nansum(a)      
2.15 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

whereas with a known mask + where argument:

In [16]: %timeit np.sum(a, where=~mask) 
813 µs ± 9.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(now, it also depends on the number of values that are missing how well the different approaches work, I think)

Also, floating-point-wise, both approaches are also not the same, and result in slightly different numbers (np.nansum and the multi-line nanops implementation above give exactly the same result, but the where approach gives a difference after 14 significant digits)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs some performance testing to ensure that didn't break anything on regular float arrays (as there are some changes in the shared code)

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

.. warning::

Experimental: the new floating data types are currently experimental, and its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a doc section can link to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a doc section can link to?

Not yet, can probably put more or less the same content as in this whatsnew somewhere in the user guide

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should consolidate https://pandas.pydata.org/pandas-docs/dev/user_guide/integer_na.html and https://pandas.pydata.org/pandas-docs/dev/user_guide/boolean.html into a single "nullable data types" page and add it there (not in this PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that has been on my personal to do list, but never got to it. In general we should have a page about data types (and can still have some subpages if needed)

pandas/_testing.py Show resolved Hide resolved

if isinstance(
other, (ABCDataFrame, ABCSeries, ABCIndexClass, IntegerArray)
other,
(ABCDataFrame, ABCSeries, ABCIndexClass, IntegerArray, FloatingArray),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should create a superclass NumericArray I think for Integer / Floating

pandas/core/arrays/integer.py Outdated Show resolved Hide resolved
pandas/core/arrays/integer.py Outdated Show resolved Hide resolved

mask = None

if isinstance(other, (BooleanArray, IntegerArray)):
if isinstance(other, (BooleanArray, IntegerArray, FloatingArray)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments above about creating NumericArray

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used here BaseMaskedArray because we basically want to check for any masked array we now have (including BooleanArray)

@@ -83,6 +83,8 @@ def ensure_float(arr):
float_arr : The original array cast to the float dtype if
possible. Otherwise, the original array is returned.
"""
if is_extension_array_dtype(arr.dtype):
return arr.to_numpy(dtype="float64", na_value=np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if/elif

@@ -494,7 +495,7 @@ def _cython_operation(
else:
values = ensure_int_or_float(values)
elif is_numeric and not is_complex_dtype(values):
values = ensure_float64(values)
values = ensure_float64(ensure_float(values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we really need to do this twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because we need to have ensure_float to convert a EA FloatingArray into a numpy float array, and the cython ensure_float64 is written to expect a numpy array, not an EA.
If you actually have already a float64 ndarray, the ensure_float is a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about you make the same mod in ensure_float64 then, much more clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_float64 is a cython function that is used in many places, so doing something like that might affect performance in many cases, so needs to be done carefully, so I would rather want to keep that for a different PR. I also think it might be clearn to leave those ensure_<dtype> to deal with only numpy arrays (in addition, ensure_float64 is also implemented as a template-ized method for many different dtypes, so it's not that straightforward to add a custom check only for float).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments, but much of this needs to share code

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
pandas/_testing.py Show resolved Hide resolved
base = None
type: Type

def __repr__(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should e in the BaseMaskedArray class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be in the BaseMaskedArray class

There would be nothing else using it, as the Integer/Boolean also have custom implementations. It needs some refactor to be able to share.

def _is_numeric(self) -> bool:
return True

@cache_readonly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pandas/core/arrays/floating.py Outdated Show resolved Hide resolved
from pandas.core.arrays._arrow_utils import pyarrow_array_to_numpy_and_mask

pyarrow_type = pyarrow.from_numpy_dtype(self.type)
if not array.type.equals(pyarrow_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally also push to the super-class (looks doable if you use self.construct_array_type

pandas/core/arrays/floating.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

Thanks for the reviews! Pushed some updates.

Regarding the comments about things that could/should be shared: I copied a few additional properties that can be shared verbatim, but for all others this will require some refactor, and see my long comment about that above why I think it is better to do that in separate PRs: #34307 (comment)

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved
pandas/core/arrays/floating.py Outdated Show resolved Hide resolved
jorisvandenbossche and others added 3 commits September 30, 2020 08:48
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
@jorisvandenbossche
Copy link
Member Author

Any remaining comments here?

@jreback
Copy link
Contributor

jreback commented Sep 30, 2020

will look again

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good to get this in. as indicated refactoring to share code is a good next set of PRs. Also really pushing this dtypes into all of the numeric tests via a fixture is needed (maybe this first). pls open followon issues for these

results = []
for arr in chunks:
data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type)
float_arr = FloatingArray(data.copy(), ~mask, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need to copy?

# return the same object for copy=False
return self.copy() if copy else self
# if we are astyping to another nullable masked dtype, we can fastpath
if isinstance(dtype, BaseMaskedDtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif



@register_extension_dtype
class Float32Dtype(FloatingDtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have almost no support for float16 (e.g. no algos support) and float128 is platform specific (IIRC)


.. ipython:: python

# the default numpy float64 dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be non-comments (e.g. just before the block itself)

@jreback jreback merged commit 0846dc1 into pandas-dev:master Sep 30, 2020
@jreback
Copy link
Contributor

jreback commented Sep 30, 2020

thanks @jorisvandenbossche really nice, sorry for the long time to review, as always shorter PRs are much easier (but get it!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants