Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Aggregate to data with error range #6898

Closed
toddrjen opened this issue Apr 17, 2014 · 6 comments
Closed

ENH: Aggregate to data with error range #6898

toddrjen opened this issue Apr 17, 2014 · 6 comments
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@toddrjen
Copy link
Contributor

As mentioned in issue #6897, working with data with error ranges is pretty much universal in science, as well as many other fields. There are python packages, like uncertainties, for working with this sort of data. However, pandas has no built-in tools for creating or working with data with error ranges, leaving users to create their own columns or a separate pandas object to hold error ranges (see, e.g. #5638) or manually creating and using uncertainties objects.

I think it would be very helpful if there was an aggregate method that would aggregate data to data with an error range (such as an uncertanties array). By default, it could use mean to get the center values and and sem (standard error of the mean) or std to get the error ranges, but it would probably be good for users to be able to specify their own functions for calculating the center values and/or the error ranges.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2014

can u show a sample use case and implementation (even if slow)?

@toddrjen
Copy link
Contributor Author

So, say someone has 15 experiments (multindex 1), each with 20 experimental conditions (multindex 2), and they record 100 trials for each condition (columns). The person wants to publish this data, so they need to get the mean and error range for the data. So they need to collapse along condition, getting both the mean and standard error. With this approach, the user could use experiment as index and condition as column.

Here is a simple example (using a lambda for the implementation):

import pandas as pd
import numpy as np
from scipy import stats
from uncertainties import ufloat

ind = pd.MultiIndex.from_product([np.arange(15), np.arange(20)])
df = pd.DataFrame(np.random.randn(15*20,100), index=ind, columns=np.arange(100))
res = df.apply(lambda x: ufloat(np.mean(x), stats.sem(x)), axis=1).unstack()

This becomes much more important for more complicated analyses. Doing manipulations of data with many-level multi-indexes becomes much, much harder if you also have to manage a second error table, column, or index. I can give an example for that as well, but it will be longer.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

Using uncertainties makes all of your data object dtype, negating pretty much all of pandas efficiencies. Instead something like this would work (for the multi-level slicing this requires master/0.14), coming soon:

In [20]: ind = pd.MultiIndex.from_product([np.arange(5), np.arange(2)])

In [21]: cols = pd.MultiIndex.from_product([np.arange(5), ['value','error']])

In [22]: df = pd.DataFrame(np.random.randn(5*2,10), index=ind, columns=cols).sortlevel().sortlevel(axis=1)

In [25]: df
Out[25]: 
            0                   1                   2                   3                   4          
        error     value     error     value     error     value     error     value     error     value
0 0  1.684247 -0.768990  1.745643 -0.460112  0.547230  1.204622 -0.645565  0.767882  1.038075 -0.004924
  1 -1.038735  1.268667  0.288511 -0.056458  0.052893 -0.181397 -0.416198 -0.117648  1.092671 -0.085161
1 0 -1.027876 -0.504794  1.145330  0.149904 -1.735783 -1.292422  0.111824  1.213310 -0.165664 -1.644664
  1  0.356636  1.076804 -2.442231 -0.694032 -0.531767 -0.177785  0.911135 -0.477786  0.677379  1.758926
2 0  1.720729  0.170775  0.348073 -1.441842  1.377164 -1.434962 -1.332751 -0.681837 -0.169488 -0.847964
  1 -1.260312 -0.000384  0.333589  0.338253 -0.871582 -0.813060 -0.056995 -0.653637 -0.937449  1.143176
3 0 -1.457335 -1.102507  0.691152 -2.469394  0.615936  1.310255  1.306816 -0.035045  0.435257  1.455832
  1  1.855440  0.923589 -1.061110  0.995526  0.126394 -0.579312 -1.445212 -1.391565  1.575050  0.071588
4 0 -0.155716  0.917270 -0.257610 -1.180983  1.356626 -0.077675  0.973249 -0.418510 -0.607244 -0.927557
  1 -1.305623  0.737657 -0.891516  0.893158  1.387652 -1.825456  1.406268 -0.827154  0.147286 -1.361848

[10 rows x 10 columns]

In [23]: df.loc[:,(slice(None),'error')]
Out[23]: 
            0         1         2         3         4
        error     error     error     error     error
0 0  1.684247  1.745643  0.547230 -0.645565  1.038075
  1 -1.038735  0.288511  0.052893 -0.416198  1.092671
1 0 -1.027876  1.145330 -1.735783  0.111824 -0.165664
  1  0.356636 -2.442231 -0.531767  0.911135  0.677379
2 0  1.720729  0.348073  1.377164 -1.332751 -0.169488
  1 -1.260312  0.333589 -0.871582 -0.056995 -0.937449
3 0 -1.457335  0.691152  0.615936  1.306816  0.435257
  1  1.855440 -1.061110  0.126394 -1.445212  1.575050
4 0 -0.155716 -0.257610  1.356626  0.973249 -0.607244
  1 -1.305623 -0.891516  1.387652  1.406268  0.147286

[10 rows x 5 columns]

In [24]: df.loc[:,0]
Out[24]: 
        error     value
0 0  1.684247 -0.768990
  1 -1.038735  1.268667
1 0 -1.027876 -0.504794
  1  0.356636  1.076804
2 0  1.720729  0.170775
  1 -1.260312 -0.000384
3 0 -1.457335 -1.102507
  1  1.855440  0.923589
4 0 -0.155716  0.917270
  1 -1.305623  0.737657

[10 rows x 2 columns]

@toddrjen
Copy link
Contributor Author

Yes, that is the problem with the current situation. The idea of this issue is to improve the current situation by creating values with uncertainties in a more integrated, reliable, and useful way.

Your proposal works fine for simple situations at the end of the analysis. But if you want to do manipulations, it becomes much more difficult. If you want to do manipulations with a many-level multiindex, it becomes extremely difficult. Under this proposal, these manipulations would be no more difficult than they are for single values.

If you want to do mathematics, such as adding or multiplying two dataframes, your proposal is also far more difficult. Mathematical operations on means and mathematical operations on standard errors are different. There are mathematical rules for handling this, called error propogation, that are handled automatically by the uncertainties package, but under your proposal would need to be looked up and coded explicitly. Also, using uncertainties just involves doing operations on the dataframe, while under your proposal you would need to split out the mean and error columns, do different mathematical operations on each, then recombine them. You can do this in pandas, but it is much more difficult than just df1*df2.

Since working with errors is almost universal in science, I think having strong, built-in support for it in pandas is important.

@jreback jreback added this to the Someday milestone Apr 23, 2014
@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@toddrjen its a nice idea

not sure how efficient the uncertainties package handles these types of things. these are going to be represented as object dtype by pandas/numpy, so not sure efficient this would be. you might want to ask the author / investigate this.

If this could be integrated as a pseuo-dtype into numpy (or perhaps cythonize some hotspots) that might help.

So would need some performance tests to determine feasibility.

@mroeschke mroeschke added ExtensionArray Extending pandas with custom dtypes or arrays. and removed Dtype Conversions Unexpected or buggy dtype conversions labels Apr 11, 2021
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@mroeschke
Copy link
Member

Yeah this would be best implemented by a 3rd party library using uncertainties as an EA data dtype. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

3 participants