Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider support for pandas #401

Closed
tomchor opened this issue Jul 12, 2016 · 5 comments
Closed

Consider support for pandas #401

tomchor opened this issue Jul 12, 2016 · 5 comments

Comments

@tomchor
Copy link

tomchor commented Jul 12, 2016

I'd like to suggest as an enhancement that support for Pandas be considered. I see the need for this is several places over the internet (this one, this one, this one and this one for example) and I also have discussed this need in personal communications.

Some points:

  • I know that, with the huge amount of functionality that pandas has, full support is very hard, but maybe the main functions (mean, std, etc.) and multiplication/division functionality
  • However, given that Numpy is already supported, an expansion to pandas wouldn't have to start from scratch, so there's that
  • Furthermore, since I have a lot of interest in a Pandas-with-units extension for my own personal projects, I could help out with the development once we agree on how it should be done (as I'm sure other people would also like to do)

Some examples of how it could work would be:

df = pd.DataFrame(np.random.randn(6,3))
df = df * (ureg.m, ureg.km, ureg.cm)
df.sum().sum()
df.mean().sum()

Currently, probably because of the numpy support, df.sum().sum() works, but df.mean().sum() doesn't. Furthermore, in this way, each element has its own unit:

                       0                          1                          2
0   -1.14025689919 meter    2.09836283028 kilometer  0.979569146473 centimeter
1    1.00550215217 meter   0.447878034084 kilometer  0.831879097545 centimeter
2   -1.37479008756 meter  0.0440848611227 kilometer    1.2712599879 centimeter
3   0.250507523208 meter   -1.12502135888 kilometer  0.262470368675 centimeter
4  -0.329479372063 meter  -0.201112807862 kilometer   1.10957183866 centimeter
5      1.049932299 meter   0.109837949121 kilometer  -1.42261917121 centimeter

Maybe a better way would be something like

<QuantitySet(
          0         1         2
0 -1.140257  2.098363  0.979569
1  1.005502  0.447878  0.831879
2 -1.374790  0.044085  1.271260
3  0.250508 -1.125021  0.262470
4 -0.329479 -0.201113  1.109572
5  1.049932  0.109838 -1.422619
'meter', 'kilometer', 'centimeter')>
@hgrecco
Copy link
Owner

hgrecco commented Jul 20, 2016

This is indeed a very nice feature to have. I am not sure what would be the best way, so I would be very happy to have you lead this development. But as a side note, there are always two options:

  1. putting a foreign object (in this case a pandas dataset) inside a Quantity object
  • Pro: you are in control
  • Con: Every element in the dataset must have the same unit.
    2) putting a Quantity Object inside pandas. In the
  • Pro: Unit can be per row/column
  • Con: you need to make sure that pandas is doing the right thing with your objects.

@tomchor
Copy link
Author

tomchor commented Jul 20, 2016

Unfortunately right now I lack the technical knowledge to actively program this feature (I'm not a professional programmer and I'm kind of new to developing packages, class-oriented programming etc.), but I'll do my best to come up with a prototype.

In my opinion the possibility to have different columns with different units is extremely useful, so I'd think that, whatever we do, we should make this feature available. That said, I think that finding a way of putting Quantity objects inside pandas without modifying/monkey-patching pandas is going to be very tricky (again, I'm not an expert so I might be wrong!). But simply using a Quantity object to hold an entire DataFrame would mean that every column has the same unit, so I'd not go that way either.

There is a possible third option, which is to create a new class for pint that holds a DataFrame and a dict of units for it. I called QuantitySet in the short example in my question. The idea is for it to be just like Quantity, but instead of holding a Numpy array and a unit, it holds a Pandas DataFrame and a dict of units. (Just like pandas has DataFrame and Series, pint would have QuantitySet and Quantity.)

Of course, that would make Pint less simple, which is one of beauties of your package, so I'm not sure you'd be willing to do this.

Here's a little snipped that I made to kind of illustrate what I mean. I'm not saying it should be programmed like this; I made this just as a quick viability-check:

def _with_units(data, units):
    """
    units: dict
        dictionary with the names of each column (key) and their pint unit (val)
    """
    import pandas as pd

    data = data.copy()
    cols = data.columns
    unts = [ '<{}>'.format(units[c]) if c in units.keys() else '<?>' for c in cols ]
    columns = pd.MultiIndex.from_tuples(zip(cols, unts))
    data.columns = columns
    return data
_pd.DataFrame.with_units = _with_units

class myData(object):
    """
    Attempt to create a pint/pandas support
    """
    watermark = '\n<QuantitySet>'
    def __init__(self, df, dic):
        self.df = df.copy()
        self.dic = dic
    def __getitem__(self, item):
        return self.df.with_units(self.dic).__getitem__(item)
    def __repr__(self):
        return self.df.with_units(self.dic).__repr__()+self.watermark
    def __str__(self):
        return self.df.with_units(self.dic).__str__()+self.watermark

This for me it returns

print(mydata)
                                       u                v
                        <meter / second> <meter / second>
%Y-%m-%d-%H-%M-%S.%f                                     
2013-03-01 00:00:00.000   -1.5411300e+00    1.2202900e+00
2013-03-01 00:00:00.100   -1.4924000e+00    1.1553200e+00
2013-03-01 00:00:00.200   -1.5086400e+00    1.1390800e+00
2013-03-01 00:00:00.300   -1.4111800e+00    1.1065900e+00
2013-03-01 00:00:00.400   -1.4924000e+00    1.0741000e+00
2013-03-01 00:00:00.500   -1.4924000e+00    1.1065900e+00
2013-03-01 00:00:00.600   -1.6710800e+00    1.1715600e+00
...                                  ...              ...
2013-03-01 00:29:59.300   -1.0863000e+00    1.0416100e+00
2013-03-01 00:29:59.400   -1.0700600e+00    1.0416100e+00
2013-03-01 00:29:59.500   -1.0700600e+00    1.0741000e+00
2013-03-01 00:29:59.600   -1.1025500e+00    1.0741000e+00
2013-03-01 00:29:59.700   -1.1187900e+00    1.0903500e+00
2013-03-01 00:29:59.800   -1.1675200e+00    1.1065900e+00
2013-03-01 00:29:59.900   -1.2162500e+00    1.1228300e+00

[18000 rows x 2 columns]
<QuantitySet>

Any thoughts?

@khaeru
Copy link
Contributor

khaeru commented Feb 17, 2017

Hi—I ran into this issue tonight.

…there are always two options:

2. putting a Quantity Object inside pandas
Pro: Unit can be per row/column
Con: you need to make sure that pandas is doing the right thing with your objects.

FWIW, this would be workable, but something about Quantity seems to give some trouble:

>>> import pandas as pd
>>> import pint
>>> ureg = pint.UnitRegistry()
>>> df = pd.DataFrame([[ureg('5 m'), 1.2, 'a'], [ureg('10 m'), 3.4, 'b']])
>>> df
          0    1  2
0   5 meter  1.2  a
1  10 meter  3.4  b
>>> df.dtypes
0     object
1    float64
2     object
dtype: object

Assignment does not work as expected:

>>> df.loc[0, 0] = ureg('15 m')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-3d760c092671> in <module>()
----> 1 df.loc[0, 0] = ureg('15 m')

/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    139             key = com._apply_if_callable(key, self.obj)
    140         indexer = self._get_setitem_indexer(key)
--> 141         self._setitem_with_indexer(indexer, value)
    142 
    143     def _has_valid_type(self, k, axis):

/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
    536                 else:
    537 
--> 538                     if len(labels) != len(value):
    539                         raise ValueError('Must have equal len keys and value '
    540                                          'when setting with an iterable')

/usr/lib/python3/dist-packages/pint/quantity.py in __len__(self)
   1079 
   1080     def __len__(self):
-> 1081         return len(self._magnitude)
   1082 
   1083     def __iter__(self):

TypeError: len() of unsized object

or

>>> df.loc[:, 0] = ureg('15 m')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-cdabd0d4d93c> in <module>()
----> 1 df.loc[:, 0] = ureg('15 m')

/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    139             key = com._apply_if_callable(key, self.obj)
    140         indexer = self._get_setitem_indexer(key)
--> 141         self._setitem_with_indexer(indexer, value)
    142 
    143     def _has_valid_type(self, k, axis):

/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
    536                 else:
    537 
--> 538                     if len(labels) != len(value):
    539                         raise ValueError('Must have equal len keys and value '
    540                                          'when setting with an iterable')

/usr/lib/python3/dist-packages/pint/quantity.py in __len__(self)
   1079 
   1080     def __len__(self):
-> 1081         return len(self._magnitude)
   1082 
   1083     def __iter__(self):

TypeError: len() of unsized object

One must instead do things like:

>>> df.loc[:, 0] = pd.Series([ureg('15 m')] * 2)
>>> df
          0    1  2
0  15 meter  1.2  a
1  15 meter  3.4  b

…so basic pandas functionality is not really available.

After a bit of digging, this error ultimately seems to happen because pandas.types.inference.is_list_like():

def is_list_like(arg):
    return (hasattr(arg, '__iter__') and
            not isinstance(arg, string_and_binary_types))

…returns True for pint.Quantity, which has:

    def __iter__(self):
        # Allow exception to propagate in case of non-iterable magnitude
        it_mag = iter(self.magnitude)
        return iter((self.__class__(mag, self._units) for mag in it_mag))

Perhaps Quantity could set, or unset, __iter__, depending on whether it is in fact iterable? This seems to be what pandas expects. But the documentation seems to suggest it should be adequate to have __iter__ = None to indicate that iteration is not available; so perhaps pandas is wrong in expecting hasattr(arg, '__iter__') to mean the same thing. Not sure…

Hope this is helpful somewhat.

@dalito
Copy link
Contributor

dalito commented Feb 17, 2017

Supporting pandas well requires (IMHO) solving Allow custom metadata to be attached to panel/df/series?.

@Bernhard10
Copy link

Pandas has an open enhancement proposal for support of units (via pint and /or other libraries), but it was last active in 2015:
pandas-dev/pandas#10349

@znicholls znicholls mentioned this issue Aug 29, 2018
6 tasks
bors bot added a commit that referenced this issue Sep 6, 2018
684: Add pandas support r=hgrecco a=znicholls

This pull request adds pandas support to pint (hence is related to #645, #401 and pandas-dev/pandas#10349).

An example can be seen in `example-notebooks/basic-example.ipynb`.

It's a little bit hacksih, feedback would be greatly appreciated by me and @andrewgsavage. One obvious example is that we have to run all the interface tests with `pytest` to fit with `pandas` test suite, which introduces a dependency for the CI and currently gives us this awkward testing setup (see the alterations we had to make to `testsuite`). This also means that our code coverage tests are fiddly too.

If you'd like us to squash the commits, that can be done.

If pint has a linter, it would be good to run that over this pull request too as we're a little bit all over the place re style.

Things to discuss:

- [x]  general feedback and changes
- [x] test setup, especially need for pytest for pandas tests and hackish way to get around automatic discovery
- [x] squashing/rebasing
- [x] linting/other code style (related to #664 and #628: we're happy with whatever, I've found using an automatic linter e.g. black and/or flake8 has made things much simpler in other projects)
- [x] including notebooks in the repo (if we want to, I'm happy to put them under CI so we can make sure they run)
- [x] setting up the docs correctly

Co-authored-by: Zebedee Nicholls <zebedee.nicholls@climate-energy-college.org>
Co-authored-by: andrewgsavage <andrewgsavage@gmail.com>
@hgrecco hgrecco closed this as completed Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants