-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider support for pandas #401
Comments
This is indeed a very nice feature to have. I am not sure what would be the best way, so I would be very happy to have you lead this development. But as a side note, there are always two options:
|
Unfortunately right now I lack the technical knowledge to actively program this feature (I'm not a professional programmer and I'm kind of new to developing packages, class-oriented programming etc.), but I'll do my best to come up with a prototype. In my opinion the possibility to have different columns with different units is extremely useful, so I'd think that, whatever we do, we should make this feature available. That said, I think that finding a way of putting Quantity objects inside pandas without modifying/monkey-patching pandas is going to be very tricky (again, I'm not an expert so I might be wrong!). But simply using a Quantity object to hold an entire DataFrame would mean that every column has the same unit, so I'd not go that way either. There is a possible third option, which is to create a new class for pint that holds a DataFrame and a dict of units for it. I called QuantitySet in the short example in my question. The idea is for it to be just like Quantity, but instead of holding a Numpy array and a unit, it holds a Pandas DataFrame and a dict of units. (Just like pandas has DataFrame and Series, pint would have QuantitySet and Quantity.) Of course, that would make Pint less simple, which is one of beauties of your package, so I'm not sure you'd be willing to do this. Here's a little snipped that I made to kind of illustrate what I mean. I'm not saying it should be programmed like this; I made this just as a quick viability-check:
This for me it returns
Any thoughts? |
Hi—I ran into this issue tonight.
FWIW, this would be workable, but something about >>> import pandas as pd
>>> import pint
>>> ureg = pint.UnitRegistry()
>>> df = pd.DataFrame([[ureg('5 m'), 1.2, 'a'], [ureg('10 m'), 3.4, 'b']])
>>> df
0 1 2
0 5 meter 1.2 a
1 10 meter 3.4 b
>>> df.dtypes
0 object
1 float64
2 object
dtype: object Assignment does not work as expected: >>> df.loc[0, 0] = ureg('15 m')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-3d760c092671> in <module>()
----> 1 df.loc[0, 0] = ureg('15 m')
/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
139 key = com._apply_if_callable(key, self.obj)
140 indexer = self._get_setitem_indexer(key)
--> 141 self._setitem_with_indexer(indexer, value)
142
143 def _has_valid_type(self, k, axis):
/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
536 else:
537
--> 538 if len(labels) != len(value):
539 raise ValueError('Must have equal len keys and value '
540 'when setting with an iterable')
/usr/lib/python3/dist-packages/pint/quantity.py in __len__(self)
1079
1080 def __len__(self):
-> 1081 return len(self._magnitude)
1082
1083 def __iter__(self):
TypeError: len() of unsized object or >>> df.loc[:, 0] = ureg('15 m')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-37-cdabd0d4d93c> in <module>()
----> 1 df.loc[:, 0] = ureg('15 m')
/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
139 key = com._apply_if_callable(key, self.obj)
140 indexer = self._get_setitem_indexer(key)
--> 141 self._setitem_with_indexer(indexer, value)
142
143 def _has_valid_type(self, k, axis):
/home/khaeru/.local/lib/python3.5/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
536 else:
537
--> 538 if len(labels) != len(value):
539 raise ValueError('Must have equal len keys and value '
540 'when setting with an iterable')
/usr/lib/python3/dist-packages/pint/quantity.py in __len__(self)
1079
1080 def __len__(self):
-> 1081 return len(self._magnitude)
1082
1083 def __iter__(self):
TypeError: len() of unsized object One must instead do things like: >>> df.loc[:, 0] = pd.Series([ureg('15 m')] * 2)
>>> df
0 1 2
0 15 meter 1.2 a
1 15 meter 3.4 b …so basic pandas functionality is not really available. After a bit of digging, this error ultimately seems to happen because def is_list_like(arg):
return (hasattr(arg, '__iter__') and
not isinstance(arg, string_and_binary_types)) …returns def __iter__(self):
# Allow exception to propagate in case of non-iterable magnitude
it_mag = iter(self.magnitude)
return iter((self.__class__(mag, self._units) for mag in it_mag)) Perhaps Hope this is helpful somewhat. |
Supporting pandas well requires (IMHO) solving Allow custom metadata to be attached to panel/df/series?. |
Pandas has an open enhancement proposal for support of units (via pint and /or other libraries), but it was last active in 2015: |
684: Add pandas support r=hgrecco a=znicholls This pull request adds pandas support to pint (hence is related to #645, #401 and pandas-dev/pandas#10349). An example can be seen in `example-notebooks/basic-example.ipynb`. It's a little bit hacksih, feedback would be greatly appreciated by me and @andrewgsavage. One obvious example is that we have to run all the interface tests with `pytest` to fit with `pandas` test suite, which introduces a dependency for the CI and currently gives us this awkward testing setup (see the alterations we had to make to `testsuite`). This also means that our code coverage tests are fiddly too. If you'd like us to squash the commits, that can be done. If pint has a linter, it would be good to run that over this pull request too as we're a little bit all over the place re style. Things to discuss: - [x] general feedback and changes - [x] test setup, especially need for pytest for pandas tests and hackish way to get around automatic discovery - [x] squashing/rebasing - [x] linting/other code style (related to #664 and #628: we're happy with whatever, I've found using an automatic linter e.g. black and/or flake8 has made things much simpler in other projects) - [x] including notebooks in the repo (if we want to, I'm happy to put them under CI so we can make sure they run) - [x] setting up the docs correctly Co-authored-by: Zebedee Nicholls <zebedee.nicholls@climate-energy-college.org> Co-authored-by: andrewgsavage <andrewgsavage@gmail.com>
I'd like to suggest as an enhancement that support for Pandas be considered. I see the need for this is several places over the internet (this one, this one, this one and this one for example) and I also have discussed this need in personal communications.
Some points:
Some examples of how it could work would be:
Currently, probably because of the numpy support,
df.sum().sum()
works, butdf.mean().sum()
doesn't. Furthermore, in this way, each element has its own unit:Maybe a better way would be something like
The text was updated successfully, but these errors were encountered: