Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constructor path unexpectedly changes outcome #205

Open
rwijtvliet opened this issue Sep 11, 2023 · 2 comments
Open

Constructor path unexpectedly changes outcome #205

rwijtvliet opened this issue Sep 11, 2023 · 2 comments

Comments

@rwijtvliet
Copy link

rwijtvliet commented Sep 11, 2023

Example

import pandas as pd
import pint
import pint_pandas
import numpy as np
units1 = pd.Series([3.0, np.nan]).astype('pint[MWh]')
units2 = pd.Series([3.0, np.nan], dtype='pint[MWh]')
units3 = pd.Series([3, np.nan], dtype='pint[MWh]')

Inconsistency 1: representation of NaN.

The value of the NaN element at index position 1 changes slightly between 1 and 2/3:

>>> units1[1]
<Quantity(nan, 'megawatt_hour')>
>>> units2[1]
<Quantity(<NA>, 'megawatt_hour')>
>>> units3[1]
<Quantity(<NA>, 'megawatt_hour')>

>>> type(units1[1].m)
<class 'numpy.float64'>
>>> type(units2[1].m)
<class 'pandas._libs.missing.NAType'>

Inconsistency 2: impact of int

Also, getting the magnitude delivers inconsistent results. Surprisingly, the difference here is between 1/2 and 3:

>>> units1.pint.m
0    3.0
1    NaN
dtype: float64
>>> units2.pint.m
0    3.0
1    NaN
dtype: float64
>>> units3.pint.m
0       3
1    <NA>
dtype: object

Notice how the latter is a series of objects.

Versions

Tested with following versions of (pandas, pint, pint-pandas):

  • (1.5.3, 0.22, 0.4)

  • (1.5.3, 0.22, 0.5)

  • (2.1.0, 0.22, 0.5)

All give the same result

@andrewgsavage
Copy link
Collaborator

you're seeing the effects of the underlying data stored in different ways. you can view this with
no_units3.values.data
This happens as the data is passed through pd.array in __init__. This ensures the data array can store a form of nan.

If I'm understanding correctly, you're expecting all these to behave the same. Previously the data was converted to float so ints and other types could not be stored, but prevented these issues. However people wanted to store other dtypes.

including the data array dtype, eg 'dtype: pint[m][int]' would help with these issues (at least make it clearer as to why its behaving odd), but hasn't been as necessary so far.

@MichaelTiemannOSC
Copy link
Collaborator

In the Pandas world there's a long-running thread about resolving NA vs. np.nan for null values. I did quite a lot of work for a long time on a branch where I used NA as the na_value instead of np.nan, and it works great (and I believe also works for both Float64 and Int64). To make this work, Pint simply needs to cast int64 arrays to Int64 and float64 to Float64. I reversed that when I started working with adding uncertainties to Pint and PintPandas, because Uncertainties is very hardwired to using np.nan as its na_value, and I found it easier to align all numeric types to np.nan (nothwithstanding the problem you show). I would not be surprised if it were actually easy to add a behavioiral flag (NA_VALUE) to set the value Pint and Pint Pandas use and it just work. But I don't have time to develop/test that. (Still waiting for my uncertainties changes to make it through.)

@andrewgsavage andrewgsavage mentioned this issue Aug 5, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants