Consider support for pandas #401

tomchor · 2016-07-12T16:41:10Z

I'd like to suggest as an enhancement that support for Pandas be considered. I see the need for this is several places over the internet (this one, this one, this one and this one for example) and I also have discussed this need in personal communications.

Some points:

I know that, with the huge amount of functionality that pandas has, full support is very hard, but maybe the main functions (mean, std, etc.) and multiplication/division functionality
However, given that Numpy is already supported, an expansion to pandas wouldn't have to start from scratch, so there's that
Furthermore, since I have a lot of interest in a Pandas-with-units extension for my own personal projects, I could help out with the development once we agree on how it should be done (as I'm sure other people would also like to do)

Some examples of how it could work would be:

df = pd.DataFrame(np.random.randn(6,3))
df = df * (ureg.m, ureg.km, ureg.cm)
df.sum().sum()
df.mean().sum()

Currently, probably because of the numpy support, df.sum().sum() works, but df.mean().sum() doesn't. Furthermore, in this way, each element has its own unit:

                       0                          1                          2
0   -1.14025689919 meter    2.09836283028 kilometer  0.979569146473 centimeter
1    1.00550215217 meter   0.447878034084 kilometer  0.831879097545 centimeter
2   -1.37479008756 meter  0.0440848611227 kilometer    1.2712599879 centimeter
3   0.250507523208 meter   -1.12502135888 kilometer  0.262470368675 centimeter
4  -0.329479372063 meter  -0.201112807862 kilometer   1.10957183866 centimeter
5      1.049932299 meter   0.109837949121 kilometer  -1.42261917121 centimeter

Maybe a better way would be something like

<QuantitySet(
          0         1         2
0 -1.140257  2.098363  0.979569
1  1.005502  0.447878  0.831879
2 -1.374790  0.044085  1.271260
3  0.250508 -1.125021  0.262470
4 -0.329479 -0.201113  1.109572
5  1.049932  0.109838 -1.422619
'meter', 'kilometer', 'centimeter')>

The text was updated successfully, but these errors were encountered:

hgrecco · 2016-07-20T03:03:58Z

This is indeed a very nice feature to have. I am not sure what would be the best way, so I would be very happy to have you lead this development. But as a side note, there are always two options:

putting a foreign object (in this case a pandas dataset) inside a Quantity object

Pro: you are in control
Con: Every element in the dataset must have the same unit.
2) putting a Quantity Object inside pandas. In the
Pro: Unit can be per row/column
Con: you need to make sure that pandas is doing the right thing with your objects.

tomchor · 2016-07-20T12:27:37Z

Unfortunately right now I lack the technical knowledge to actively program this feature (I'm not a professional programmer and I'm kind of new to developing packages, class-oriented programming etc.), but I'll do my best to come up with a prototype.

In my opinion the possibility to have different columns with different units is extremely useful, so I'd think that, whatever we do, we should make this feature available. That said, I think that finding a way of putting Quantity objects inside pandas without modifying/monkey-patching pandas is going to be very tricky (again, I'm not an expert so I might be wrong!). But simply using a Quantity object to hold an entire DataFrame would mean that every column has the same unit, so I'd not go that way either.

There is a possible third option, which is to create a new class for pint that holds a DataFrame and a dict of units for it. I called QuantitySet in the short example in my question. The idea is for it to be just like Quantity, but instead of holding a Numpy array and a unit, it holds a Pandas DataFrame and a dict of units. (Just like pandas has DataFrame and Series, pint would have QuantitySet and Quantity.)

Of course, that would make Pint less simple, which is one of beauties of your package, so I'm not sure you'd be willing to do this.

Here's a little snipped that I made to kind of illustrate what I mean. I'm not saying it should be programmed like this; I made this just as a quick viability-check:

def _with_units(data, units):
    """
    units: dict
        dictionary with the names of each column (key) and their pint unit (val)
    """
    import pandas as pd

    data = data.copy()
    cols = data.columns
    unts = [ '<{}>'.format(units[c]) if c in units.keys() else '<?>' for c in cols ]
    columns = pd.MultiIndex.from_tuples(zip(cols, unts))
    data.columns = columns
    return data
_pd.DataFrame.with_units = _with_units

class myData(object):
    """
    Attempt to create a pint/pandas support
    """
    watermark = '\n<QuantitySet>'
    def __init__(self, df, dic):
        self.df = df.copy()
        self.dic = dic
    def __getitem__(self, item):
        return self.df.with_units(self.dic).__getitem__(item)
    def __repr__(self):
        return self.df.with_units(self.dic).__repr__()+self.watermark
    def __str__(self):
        return self.df.with_units(self.dic).__str__()+self.watermark

This for me it returns

print(mydata)
                                       u                v
                        <meter / second> <meter / second>
%Y-%m-%d-%H-%M-%S.%f                                     
2013-03-01 00:00:00.000   -1.5411300e+00    1.2202900e+00
2013-03-01 00:00:00.100   -1.4924000e+00    1.1553200e+00
2013-03-01 00:00:00.200   -1.5086400e+00    1.1390800e+00
2013-03-01 00:00:00.300   -1.4111800e+00    1.1065900e+00
2013-03-01 00:00:00.400   -1.4924000e+00    1.0741000e+00
2013-03-01 00:00:00.500   -1.4924000e+00    1.1065900e+00
2013-03-01 00:00:00.600   -1.6710800e+00    1.1715600e+00
...                                  ...              ...
2013-03-01 00:29:59.300   -1.0863000e+00    1.0416100e+00
2013-03-01 00:29:59.400   -1.0700600e+00    1.0416100e+00
2013-03-01 00:29:59.500   -1.0700600e+00    1.0741000e+00
2013-03-01 00:29:59.600   -1.1025500e+00    1.0741000e+00
2013-03-01 00:29:59.700   -1.1187900e+00    1.0903500e+00
2013-03-01 00:29:59.800   -1.1675200e+00    1.1065900e+00
2013-03-01 00:29:59.900   -1.2162500e+00    1.1228300e+00

[18000 rows x 2 columns]
<QuantitySet>

Any thoughts?

khaeru · 2017-02-17T04:16:20Z

Hi—I ran into this issue tonight.

…there are always two options:
…
2. putting a Quantity Object inside pandas
Pro: Unit can be per row/column
Con: you need to make sure that pandas is doing the right thing with your objects.

FWIW, this would be workable, but something about Quantity seems to give some trouble:

>>> import pandas as pd
>>> import pint
>>> ureg = pint.UnitRegistry()
>>> df = pd.DataFrame([[ureg('5 m'), 1.2, 'a'], [ureg('10 m'), 3.4, 'b']])
>>> df
          0    1  2
0   5 meter  1.2  a
1  10 meter  3.4  b
>>> df.dtypes
0     object
1    float64
2     object
dtype: object

Assignment does not work as expected:

>>> df.loc[0, 0] = ureg('15 m')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-3d760c092671> in <module>()
----> 1 df.loc[0, 0] = ureg('15 m')

/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    139             key = com._apply_if_callable(key, self.obj)
    140         indexer = self._get_setitem_indexer(key)
--> 141         self._setitem_with_indexer(indexer, value)
    142 
    143     def _has_valid_type(self, k, axis):

/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
    536                 else:
    537 
--> 538                     if len(labels) != len(value):
    539                         raise ValueError('Must have equal len keys and value '
    540                                          'when setting with an iterable')

/usr/lib/python3/dist-packages/pint/quantity.py in __len__(self)
   1079 
   1080     def __len__(self):
-> 1081         return len(self._magnitude)
   1082 
   1083     def __iter__(self):

TypeError: len() of unsized object

or

>>> df.loc[:, 0] = ureg('15 m')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-cdabd0d4d93c> in <module>()
----> 1 df.loc[:, 0] = ureg('15 m')

/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    139             key = com._apply_if_callable(key, self.obj)
    140         indexer = self._get_setitem_indexer(key)
--> 141         self._setitem_with_indexer(indexer, value)
    142 
    143     def _has_valid_type(self, k, axis):

/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
    536                 else:
    537 
--> 538                     if len(labels) != len(value):
    539                         raise ValueError('Must have equal len keys and value '
    540                                          'when setting with an iterable')

/usr/lib/python3/dist-packages/pint/quantity.py in __len__(self)
   1079 
   1080     def __len__(self):
-> 1081         return len(self._magnitude)
   1082 
   1083     def __iter__(self):

TypeError: len() of unsized object

One must instead do things like:

>>> df.loc[:, 0] = pd.Series([ureg('15 m')] * 2)
>>> df
          0    1  2
0  15 meter  1.2  a
1  15 meter  3.4  b

…so basic pandas functionality is not really available.

After a bit of digging, this error ultimately seems to happen because pandas.types.inference.is_list_like():

def is_list_like(arg):
    return (hasattr(arg, '__iter__') and
            not isinstance(arg, string_and_binary_types))

…returns True for pint.Quantity, which has:

    def __iter__(self):
        # Allow exception to propagate in case of non-iterable magnitude
        it_mag = iter(self.magnitude)
        return iter((self.__class__(mag, self._units) for mag in it_mag))

Perhaps Quantity could set, or unset, __iter__, depending on whether it is in fact iterable? This seems to be what pandas expects. But the documentation seems to suggest it should be adequate to have __iter__ = None to indicate that iteration is not available; so perhaps pandas is wrong in expecting hasattr(arg, '__iter__') to mean the same thing. Not sure…

Hope this is helpful somewhat.

dalito · 2017-02-17T08:30:40Z

Supporting pandas well requires (IMHO) solving Allow custom metadata to be attached to panel/df/series?.

Bernhard10 · 2017-08-01T11:06:02Z

Pandas has an open enhancement proposal for support of units (via pint and /or other libraries), but it was last active in 2015:
pandas-dev/pandas#10349

@andrewgsavage

684: Add pandas support r=hgrecco a=znicholls This pull request adds pandas support to pint (hence is related to #645, #401 and pandas-dev/pandas#10349). An example can be seen in `example-notebooks/basic-example.ipynb`. It's a little bit hacksih, feedback would be greatly appreciated by me and @andrewgsavage. One obvious example is that we have to run all the interface tests with `pytest` to fit with `pandas` test suite, which introduces a dependency for the CI and currently gives us this awkward testing setup (see the alterations we had to make to `testsuite`). This also means that our code coverage tests are fiddly too. If you'd like us to squash the commits, that can be done. If pint has a linter, it would be good to run that over this pull request too as we're a little bit all over the place re style. Things to discuss: - [x] general feedback and changes - [x] test setup, especially need for pytest for pandas tests and hackish way to get around automatic discovery - [x] squashing/rebasing - [x] linting/other code style (related to #664 and #628: we're happy with whatever, I've found using an automatic linter e.g. black and/or flake8 has made things much simpler in other projects) - [x] including notebooks in the repo (if we want to, I'm happy to put them under CI so we can make sure they run) - [x] setting up the docs correctly Co-authored-by: Zebedee Nicholls <zebedee.nicholls@climate-energy-college.org> Co-authored-by: andrewgsavage <andrewgsavage@gmail.com>

tomchor mentioned this issue Aug 1, 2017

ENH: unit of measurement / physical quantities pandas-dev/pandas#10349

Closed

wisp3rwind mentioned this issue Jun 4, 2018

Pandas support using new API #645

Closed

znicholls mentioned this issue Aug 29, 2018

Add pandas support #684

Merged

6 tasks

hgrecco closed this as completed Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider support for pandas #401

Consider support for pandas #401

tomchor commented Jul 12, 2016

hgrecco commented Jul 20, 2016

tomchor commented Jul 20, 2016

khaeru commented Feb 17, 2017

dalito commented Feb 17, 2017

Bernhard10 commented Aug 1, 2017

Consider support for pandas #401

Consider support for pandas #401

Comments

tomchor commented Jul 12, 2016

hgrecco commented Jul 20, 2016

tomchor commented Jul 20, 2016

khaeru commented Feb 17, 2017

dalito commented Feb 17, 2017

Bernhard10 commented Aug 1, 2017