-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: nullable Float32/64 ExtensionArray #34307
Conversation
def test_divide_by_zero(self, zero, negative): | ||
# TODO pending NA/NaN discussion | ||
# https://github.com/pandas-dev/pandas/issues/32265/ | ||
a = pd.array([0, 1, -1, None], dtype="Float64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this and other tests, shouldn't you also test dtype Float32
?
tm.assert_extension_array_equal(result, expected) | ||
|
||
|
||
def test_cross_type_arithmetic(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to add tests for arithmetic with pd.NA
?
a = pd.array([0, 0, 0, 1, 1, 1, None, None, None], dtype="Float64") | ||
b = pd.array([0, 1, None, 0, 1, None, 0, 1, None], dtype="Float64") | ||
result = a ** b | ||
expected = pd.array([1, 0, None, 1, 1, 1, 1, None, None], dtype="Float64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it also worth testing cases where Float is raised to an Int power and Int to a Float power (the latter case should I think become Float)?
def test_rpow_one_to_na(self): | ||
# https://github.com/pandas-dev/pandas/issues/22022 | ||
# https://github.com/pandas-dev/pandas/issues/29997 | ||
arr = pd.array([np.nan, np.nan], dtype="Float64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If later NaN and NA are distinguishable in this context could it make sense to use None or NA explicitly in these tests (it wouldn't actually matter in this specific case but still)?
Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-09-30 08:18:27 UTC |
@dsaxton @Dr-Irv thanks for the comments! And we indeed also need to start having tests where those dtypes are mixed (eg Int64 with Float64) |
cc @pandas-dev/pandas-core |
tm.assert_extension_array_equal(result, expected) | ||
|
||
# reversed | ||
a = a[1:] # Can't raise integers to negative powers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be possible, with the result now being a FloatingArray? Do we also want to test raising float values (other than NaN) to a FloatingArray power, maybe including fractional values in the exponent?
pandas/core/arrays/floating.py
Outdated
if not all(isinstance(t, _FloatingDtype) for t in dtypes): | ||
return None | ||
np_dtype = np.find_common_type( | ||
[t.numpy_dtype for t in dtypes], [] # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what error was this producing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd This was added in #33607.
The problem is that by doing the if not all(isinstance(t, _FloatingDtype) for t in dtypes): return None
, we know that by ending up here, we have a list of all FloatingDtype dtypes (which have a numpy_dtype
attribute). But mypy doesn't know this.
A cast(List[_FloatingDtype], dtypes)
might help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to deduplicate with IntegerArray before or after merging this? I noted a couple places, but there's plenty more that could be done.
pandas/core/arrays/floating.py
Outdated
.. versionchanged:: 1.0.0 | ||
|
||
Now uses :attr:`pandas.NA` as its missing value, | ||
rather than :attr:`numpy.nan`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this versionchanged.
scalars = to_numeric(strings, errors="raise") | ||
return cls._from_sequence(scalars, dtype, copy) | ||
|
||
_HANDLED_TYPES = (np.ndarray, numbers.Number) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to think about ops involving a mix of our extension arrays something like
np.add(FloatingArray, IntegerArray)
How feasible is an __array_ufunc__
on a base class of FloatingArray & IntegerArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, right now you get a TypeError when calling numpy ufuncs on a mixture of integer / floating arrays. Noted that as a follow-up item.
pandas/core/arrays/floating.py
Outdated
result = self._data.astype(dtype.numpy_dtype, copy=False) | ||
return type(self)(result, mask=self._mask, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these both copy=False
? Shouldn't the first be copy=copy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, we currently also do incorrectly for IntegerDtype ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix for the integer case -> #34931 (can copy that over here once the other is merged)
pandas/core/construction.py
Outdated
@@ -319,6 +331,9 @@ def array( | |||
elif inferred_dtype == "integer": | |||
return IntegerArray._from_sequence(data, copy=copy) | |||
|
|||
elif inferred_dtype == "floating": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe handle mixed-integer-float
too?
In [11]: pd._libs.lib.infer_dtype([1., np.nan, 2])
Out[11]: 'mixed-integer-float'
In [12]: pd._libs.lib.infer_dtype([1., 2])
Out[12]: 'mixed-integer-float'
@TomAugspurger @WillAyd regarding sharing code with the other masked arrays (it's not only with integer, but also with boolean): yes, this certainly needs to happen. There are plenty of things that should be shared (some will need a bit of refactoring). See also the last todo item in the top post. I have noted some more things to be shared, and I also actually already started deduplicating some things that I noticed while preparing this PR that could easily be done prior to this PR: #34187, #34308 (there are also other things that could be done prior, I think). But about doing those in this PR: I have a slight preference to not do this. First, because it won't actually make the diff of this PR smaller, but rather more complex (as it would involve moving things from (and I am also commited to actually doing such follow-ups, like I already started doing some PRs prior to this) For things that are straight copy-paste and require no changes, those can certainly be done here (or in a pre-cursor PR). |
I'm happy to have the deduplication done as followups. That's easier for me
to understand than abstracting things ahead of time.
…On Mon, Jun 1, 2020 at 7:23 AM Joris Van den Bossche < ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> @WillAyd
<https://github.com/WillAyd> regarding sharing code with the other masked
arrays (it's not only with integer, but also with boolean): yes, this
certainly needs to happen.
There are plenty of things that should be shared (some will need a bit of
refactoring). See also the last todo item in the top post. I have noted
some more things to be shared, and I also actually already started
deduplicating some things that I noticed while preparing this PR that could
easily be done prior to this PR: #34187
<#34187>, #34308
<#34308> (there are also other
things that could be done prior, I think).
(and I also started refactoring the arithmetic tests, with one of the
goals to make it easier to share them: #34454
<#34454>)
But about doing those in *this* PR: I have a slight preference to not do
this. First, because it won't actually make the diff of this PR smaller,
but rather more complex (as it would involve moving things from integer.py
to one of the other files). So it won't necessarily make it easier to
review this PR.
And second, because some of those need a bit of refactoring to properly
share things. That will certainly be for the better, but it might be
cleaner / clearer to do this in separate, more focused PRs (eg rewriting
the ufunc machinery, rewriting the integer/float_arithmetic_ops, .. I
would rather do those in separate PRs).
(and I am also commited to actually doing such follow-ups, like I already
started doing some PRs prior to this)
For things that are straight copy-paste and require no changes, those can
certainly be done here (or in a pre-cursor PR).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34307 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOITHDGMYPAFVKC6RBV3RUOMUBANCNFSM4NHWOC7Q>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will look soon
Out of curiosity, what is driving the improved perf here? |
@jbrockmendel it's mostly due to avoiding using the "nan.." versions (eg I think there are mostly two reasons the nanops versions are slower: 1) you need to calculate the mask every time, and 2) the approach nanops takes (copy array + putmask to fill in "neutral" value, eg 0 for sum) seems slower compared to the newer Eg for
the nansum basically is:
which is also I suppose more or less what numpy does:
whereas with a known mask + where argument:
(now, it also depends on the number of values that are missing how well the different approaches work, I think) Also, floating-point-wise, both approaches are also not the same, and result in slightly different numbers ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs some performance testing to ensure that didn't break anything on regular float arrays (as there are some changes in the shared code)
doc/source/whatsnew/v1.1.0.rst
Outdated
|
||
.. warning:: | ||
|
||
Experimental: the new floating data types are currently experimental, and its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a doc section can link to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a doc section can link to?
Not yet, can probably put more or less the same content as in this whatsnew somewhere in the user guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should consolidate https://pandas.pydata.org/pandas-docs/dev/user_guide/integer_na.html and https://pandas.pydata.org/pandas-docs/dev/user_guide/boolean.html into a single "nullable data types" page and add it there (not in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that has been on my personal to do list, but never got to it. In general we should have a page about data types (and can still have some subpages if needed)
pandas/core/arrays/boolean.py
Outdated
|
||
if isinstance( | ||
other, (ABCDataFrame, ABCSeries, ABCIndexClass, IntegerArray) | ||
other, | ||
(ABCDataFrame, ABCSeries, ABCIndexClass, IntegerArray, FloatingArray), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should create a superclass NumericArray I think for Integer / Floating
pandas/core/arrays/integer.py
Outdated
|
||
mask = None | ||
|
||
if isinstance(other, (BooleanArray, IntegerArray)): | ||
if isinstance(other, (BooleanArray, IntegerArray, FloatingArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments above about creating NumericArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used here BaseMaskedArray because we basically want to check for any masked array we now have (including BooleanArray)
pandas/core/dtypes/common.py
Outdated
@@ -83,6 +83,8 @@ def ensure_float(arr): | |||
float_arr : The original array cast to the float dtype if | |||
possible. Otherwise, the original array is returned. | |||
""" | |||
if is_extension_array_dtype(arr.dtype): | |||
return arr.to_numpy(dtype="float64", na_value=np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if/elif
@@ -494,7 +495,7 @@ def _cython_operation( | |||
else: | |||
values = ensure_int_or_float(values) | |||
elif is_numeric and not is_complex_dtype(values): | |||
values = ensure_float64(values) | |||
values = ensure_float64(ensure_float(values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, we really need to do this twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way because we need to have ensure_float
to convert a EA FloatingArray into a numpy float array, and the cython ensure_float64
is written to expect a numpy array, not an EA.
If you actually have already a float64 ndarray, the ensure_float
is a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about you make the same mod in ensure_float64 then, much more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure_float64
is a cython function that is used in many places, so doing something like that might affect performance in many cases, so needs to be done carefully, so I would rather want to keep that for a different PR. I also think it might be clearn to leave those ensure_<dtype>
to deal with only numpy arrays (in addition, ensure_float64
is also implemented as a template-ized method for many different dtypes, so it's not that straightforward to add a custom check only for float).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few comments, but much of this needs to share code
base = None | ||
type: Type | ||
|
||
def __repr__(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should e in the BaseMaskedArray class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be in the BaseMaskedArray class
There would be nothing else using it, as the Integer/Boolean also have custom implementations. It needs some refactor to be able to share.
pandas/core/arrays/floating.py
Outdated
def _is_numeric(self) -> bool: | ||
return True | ||
|
||
@cache_readonly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
from pandas.core.arrays._arrow_utils import pyarrow_array_to_numpy_and_mask | ||
|
||
pyarrow_type = pyarrow.from_numpy_dtype(self.type) | ||
if not array.type.equals(pyarrow_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally also push to the super-class (looks doable if you use self.construct_array_type
Thanks for the reviews! Pushed some updates. Regarding the comments about things that could/should be shared: I copied a few additional properties that can be shared verbatim, but for all others this will require some refactor, and see my long comment about that above why I think it is better to do that in separate PRs: #34307 (comment) |
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
Any remaining comments here? |
will look again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok good to get this in. as indicated refactoring to share code is a good next set of PRs. Also really pushing this dtypes into all of the numeric tests via a fixture is needed (maybe this first). pls open followon issues for these
results = [] | ||
for arr in chunks: | ||
data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type) | ||
float_arr = FloatingArray(data.copy(), ~mask, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually need to copy?
# return the same object for copy=False | ||
return self.copy() if copy else self | ||
# if we are astyping to another nullable masked dtype, we can fastpath | ||
if isinstance(dtype, BaseMaskedDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif
|
||
|
||
@register_extension_dtype | ||
class Float32Dtype(FloatingDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have almost no support for float16 (e.g. no algos support) and float128 is platform specific (IIRC)
|
||
.. ipython:: python | ||
|
||
# the default numpy float64 dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be non-comments (e.g. just before the block itself)
thanks @jorisvandenbossche really nice, sorry for the long time to review, as always shorter PRs are much easier (but get it!) |
As proposed at #32265 (comment), this is adding a float extension dtype and array (for now avoiding/ignoring the NaN-discussion), using the masked array approach following our other nullable dtypes.
Brief overview:
This PR adds a
FloatingArray
andFloat32Dtype
/Float64Dtype
(similar as for the nullable integer dtypes)I am using the "Float64" pattern with capital "F" as name for the dtype to distinguish the string name of the "float64" dtype from numpy (consistent with Int64 etc)
For now I only added Float32 and Float64 (we can see later if we want to add others like float16)
Upon conversion of input (both in construction as in operations such as setitem), any
np.nan
is coerced to NA (we can see later to add an option / config for this)The added tests are for now mostly copied from the existing integer/boolean tests, and adapted for Float
It directly uses some of the masked ops I implemented earlier for IntegerArray, which means that eg
sum
already shows a similar perf benefit:Still to do:
Other todo's but that can probably be left for follow-up PRs:
convert_dtypes
method to also return nullable floats (one question here: what with floats that are all "integer"-like? I think we want to keep returning nullable int for that, but we might want to add a parameter controlling that behaviour?)__from_arrow__
coerce_to_array
, but this will probably require some refactoring