ENH: Aggregate to data with error range #6898

toddrjen · 2014-04-17T10:07:06Z

As mentioned in issue #6897, working with data with error ranges is pretty much universal in science, as well as many other fields. There are python packages, like uncertainties, for working with this sort of data. However, pandas has no built-in tools for creating or working with data with error ranges, leaving users to create their own columns or a separate pandas object to hold error ranges (see, e.g. #5638) or manually creating and using uncertainties objects.

I think it would be very helpful if there was an aggregate method that would aggregate data to data with an error range (such as an uncertanties array). By default, it could use mean to get the center values and and sem (standard error of the mean) or std to get the error ranges, but it would probably be good for users to be able to specify their own functions for calculating the center values and/or the error ranges.

The text was updated successfully, but these errors were encountered:

jreback · 2014-04-17T11:51:46Z

can u show a sample use case and implementation (even if slow)?

toddrjen · 2014-04-22T09:42:55Z

So, say someone has 15 experiments (multindex 1), each with 20 experimental conditions (multindex 2), and they record 100 trials for each condition (columns). The person wants to publish this data, so they need to get the mean and error range for the data. So they need to collapse along condition, getting both the mean and standard error. With this approach, the user could use experiment as index and condition as column.

Here is a simple example (using a lambda for the implementation):

import pandas as pd
import numpy as np
from scipy import stats
from uncertainties import ufloat

ind = pd.MultiIndex.from_product([np.arange(15), np.arange(20)])
df = pd.DataFrame(np.random.randn(15*20,100), index=ind, columns=np.arange(100))
res = df.apply(lambda x: ufloat(np.mean(x), stats.sem(x)), axis=1).unstack()

This becomes much more important for more complicated analyses. Doing manipulations of data with many-level multi-indexes becomes much, much harder if you also have to manage a second error table, column, or index. I can give an example for that as well, but it will be longer.

jreback · 2014-04-22T14:51:52Z

Using uncertainties makes all of your data object dtype, negating pretty much all of pandas efficiencies. Instead something like this would work (for the multi-level slicing this requires master/0.14), coming soon:

In [20]: ind = pd.MultiIndex.from_product([np.arange(5), np.arange(2)])

In [21]: cols = pd.MultiIndex.from_product([np.arange(5), ['value','error']])

In [22]: df = pd.DataFrame(np.random.randn(5*2,10), index=ind, columns=cols).sortlevel().sortlevel(axis=1)

In [25]: df
Out[25]: 
            0                   1                   2                   3                   4          
        error     value     error     value     error     value     error     value     error     value
0 0  1.684247 -0.768990  1.745643 -0.460112  0.547230  1.204622 -0.645565  0.767882  1.038075 -0.004924
  1 -1.038735  1.268667  0.288511 -0.056458  0.052893 -0.181397 -0.416198 -0.117648  1.092671 -0.085161
1 0 -1.027876 -0.504794  1.145330  0.149904 -1.735783 -1.292422  0.111824  1.213310 -0.165664 -1.644664
  1  0.356636  1.076804 -2.442231 -0.694032 -0.531767 -0.177785  0.911135 -0.477786  0.677379  1.758926
2 0  1.720729  0.170775  0.348073 -1.441842  1.377164 -1.434962 -1.332751 -0.681837 -0.169488 -0.847964
  1 -1.260312 -0.000384  0.333589  0.338253 -0.871582 -0.813060 -0.056995 -0.653637 -0.937449  1.143176
3 0 -1.457335 -1.102507  0.691152 -2.469394  0.615936  1.310255  1.306816 -0.035045  0.435257  1.455832
  1  1.855440  0.923589 -1.061110  0.995526  0.126394 -0.579312 -1.445212 -1.391565  1.575050  0.071588
4 0 -0.155716  0.917270 -0.257610 -1.180983  1.356626 -0.077675  0.973249 -0.418510 -0.607244 -0.927557
  1 -1.305623  0.737657 -0.891516  0.893158  1.387652 -1.825456  1.406268 -0.827154  0.147286 -1.361848

[10 rows x 10 columns]

In [23]: df.loc[:,(slice(None),'error')]
Out[23]: 
            0         1         2         3         4
        error     error     error     error     error
0 0  1.684247  1.745643  0.547230 -0.645565  1.038075
  1 -1.038735  0.288511  0.052893 -0.416198  1.092671
1 0 -1.027876  1.145330 -1.735783  0.111824 -0.165664
  1  0.356636 -2.442231 -0.531767  0.911135  0.677379
2 0  1.720729  0.348073  1.377164 -1.332751 -0.169488
  1 -1.260312  0.333589 -0.871582 -0.056995 -0.937449
3 0 -1.457335  0.691152  0.615936  1.306816  0.435257
  1  1.855440 -1.061110  0.126394 -1.445212  1.575050
4 0 -0.155716 -0.257610  1.356626  0.973249 -0.607244
  1 -1.305623 -0.891516  1.387652  1.406268  0.147286

[10 rows x 5 columns]

In [24]: df.loc[:,0]
Out[24]: 
        error     value
0 0  1.684247 -0.768990
  1 -1.038735  1.268667
1 0 -1.027876 -0.504794
  1  0.356636  1.076804
2 0  1.720729  0.170775
  1 -1.260312 -0.000384
3 0 -1.457335 -1.102507
  1  1.855440  0.923589
4 0 -0.155716  0.917270
  1 -1.305623  0.737657

[10 rows x 2 columns]

toddrjen · 2014-04-23T08:51:35Z

Yes, that is the problem with the current situation. The idea of this issue is to improve the current situation by creating values with uncertainties in a more integrated, reliable, and useful way.

Your proposal works fine for simple situations at the end of the analysis. But if you want to do manipulations, it becomes much more difficult. If you want to do manipulations with a many-level multiindex, it becomes extremely difficult. Under this proposal, these manipulations would be no more difficult than they are for single values.

If you want to do mathematics, such as adding or multiplying two dataframes, your proposal is also far more difficult. Mathematical operations on means and mathematical operations on standard errors are different. There are mathematical rules for handling this, called error propogation, that are handled automatically by the uncertainties package, but under your proposal would need to be looked up and coded explicitly. Also, using uncertainties just involves doing operations on the dataframe, while under your proposal you would need to split out the mean and error columns, do different mathematical operations on each, then recombine them. You can do this in pandas, but it is much more difficult than just df1*df2.

Since working with errors is almost universal in science, I think having strong, built-in support for it in pandas is important.

jreback · 2014-04-23T12:45:32Z

@toddrjen its a nice idea

not sure how efficient the uncertainties package handles these types of things. these are going to be represented as object dtype by pandas/numpy, so not sure efficient this would be. you might want to ask the author / investigate this.

If this could be integrated as a pseuo-dtype into numpy (or perhaps cythonize some hotspots) that might help.

So would need some performance tests to determine feasibility.

mroeschke · 2023-03-29T00:16:47Z

Yeah this would be best implemented by a 3rd party library using uncertainties as an EA data dtype. Closing

jreback added this to the Someday milestone Apr 23, 2014

jreback added Enhancement labels Apr 23, 2014

bgatessucks mentioned this issue Sep 6, 2016

Unexpected results for the mean of a DataFrame of ufloat from the uncertainties package. #14162

Closed

mroeschke added ExtensionArray Extending pandas with custom dtypes or arrays. and removed Dtype Conversions Unexpected or buggy dtype conversions labels Apr 11, 2021

mroeschke removed this from the Someday milestone Oct 13, 2022

mroeschke closed this as completed Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Aggregate to data with error range #6898

ENH: Aggregate to data with error range #6898

toddrjen commented Apr 17, 2014

jreback commented Apr 17, 2014

toddrjen commented Apr 22, 2014

jreback commented Apr 22, 2014

toddrjen commented Apr 23, 2014

jreback commented Apr 23, 2014

mroeschke commented Mar 29, 2023

ENH: Aggregate to data with error range #6898

ENH: Aggregate to data with error range #6898

Comments

toddrjen commented Apr 17, 2014

jreback commented Apr 17, 2014

toddrjen commented Apr 22, 2014

jreback commented Apr 22, 2014

toddrjen commented Apr 23, 2014

jreback commented Apr 23, 2014

mroeschke commented Mar 29, 2023