-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Values dtype #192
Comments
A very timely question. I'm running into this exact problem with float vs. UFloat arrays (uncertainties), specifically in the underlying test case from pandas/tests/extension/base/constructors.py and dtype='pint[meter]':
Pandas wants to construct the Series by filling in the One of the big promises of EAs is consistent pd.NA handling. I wonder whether it's appropriate to broach that with pint's NumPy facet, or if there's a better way to get Pandas to stay in the PandasArray realm and not touch the Numpy facet at all. The codepath that gets us there is the test |
So I created the special case for PlainQuantity and fixed a few erroneous np.nan vs. pd.NA problems in the test cases (following the patterns of pandas/tests/extension/test_integer.py and pandas/tests/extension/test_floating.py, which use pd.NA instead of np.nan for missing data), and lots more test cases are working. But... pandas/core/internals/managers.py has this code in fast_xs:
And in this case it creates an empty array by way of PintArray._from_sequence using an empty list. With no visible master_scalar, this defaults to creating an array for floats, not UFloats. This is where a dtype with a magnitude type would come in handy... |
so to set this up in pint-pandas, of the top of my head it would need:
I think a module flag like 'cache_values_dtype' could be used so those that don't need this don't see the dtype as it makes using .astype quite wordy |
Just to mention I've got things back to this point:
and my own test cases are passing. I'll update my PRs shortly for review. |
Commits to pint and pint-pandas are pushed. For the record, the small adjustments I've also made to pandas:
I'll create a Pandas PR when there's enough chain on the gears from the Pint and Pint Pandas PRs. |
Pandas PR was submitted last week: pandas-dev/pandas#53970 |
Pandas 2.1 is going to be much better for Pint-Pandas. All my tests now work with no mods needed to Pandas, and I've reverted much of the complexity I originally thought I needed to handle uncertainties. The simplifications were possible because I re-tested an initial assumption about compatibility/incompatibility of Here's a comment I pulled from the
From that I learned that the safe way to think about it is "you can do what you want, but you should know that when Pandas does what it wants, it's going to canonicalize NA values to the na_value of the EA dtype". So...pick your NA dtype carefully and be ready for it to show up. And from what I have seen, Pandas is now well-behaved as it manages the interplay between various types of arrays feeding PintArray (IntegerArray, FloatingArray which can both have pd.NA values, and ndarray, which can have np.nans) and the operation of PintArray functions. I don't think Pint-Pandas needs to implement a second type in brackets. |
Two PintArrays can have the same dtype, eg 'pint[m]' but have data stored with different dtypes, eg ints and floats. This used to cause issues when np.array was used as the values since you couldn't store nan in the int np.array, but has mostly been fixed since defaulting to using PandasArrays for values. I think there are still minor issues relating to converting and infering the dtype for values.
Should the dtype contain the values dtype, eg 'pint[m][int]' ?
The text was updated successfully, but these errors were encountered: