-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Aggregate to data with error range #6898
Comments
can u show a sample use case and implementation (even if slow)? |
So, say someone has 15 experiments (multindex 1), each with 20 experimental conditions (multindex 2), and they record 100 trials for each condition (columns). The person wants to publish this data, so they need to get the mean and error range for the data. So they need to collapse along condition, getting both the mean and standard error. With this approach, the user could use experiment as index and condition as column. Here is a simple example (using a lambda for the implementation): import pandas as pd
import numpy as np
from scipy import stats
from uncertainties import ufloat
ind = pd.MultiIndex.from_product([np.arange(15), np.arange(20)])
df = pd.DataFrame(np.random.randn(15*20,100), index=ind, columns=np.arange(100))
res = df.apply(lambda x: ufloat(np.mean(x), stats.sem(x)), axis=1).unstack() This becomes much more important for more complicated analyses. Doing manipulations of data with many-level multi-indexes becomes much, much harder if you also have to manage a second error table, column, or index. I can give an example for that as well, but it will be longer. |
Using
|
Yes, that is the problem with the current situation. The idea of this issue is to improve the current situation by creating values with uncertainties in a more integrated, reliable, and useful way. Your proposal works fine for simple situations at the end of the analysis. But if you want to do manipulations, it becomes much more difficult. If you want to do manipulations with a many-level multiindex, it becomes extremely difficult. Under this proposal, these manipulations would be no more difficult than they are for single values. If you want to do mathematics, such as adding or multiplying two dataframes, your proposal is also far more difficult. Mathematical operations on means and mathematical operations on standard errors are different. There are mathematical rules for handling this, called error propogation, that are handled automatically by the uncertainties package, but under your proposal would need to be looked up and coded explicitly. Also, using uncertainties just involves doing operations on the dataframe, while under your proposal you would need to split out the mean and error columns, do different mathematical operations on each, then recombine them. You can do this in pandas, but it is much more difficult than just df1*df2. Since working with errors is almost universal in science, I think having strong, built-in support for it in pandas is important. |
@toddrjen its a nice idea not sure how efficient the If this could be integrated as a pseuo-dtype into numpy (or perhaps cythonize some hotspots) that might help. So would need some performance tests to determine feasibility. |
Yeah this would be best implemented by a 3rd party library using uncertainties as an EA data dtype. Closing |
As mentioned in issue #6897, working with data with error ranges is pretty much universal in science, as well as many other fields. There are python packages, like
uncertainties
, for working with this sort of data. However, pandas has no built-in tools for creating or working with data with error ranges, leaving users to create their own columns or a separate pandas object to hold error ranges (see, e.g. #5638) or manually creating and usinguncertainties
objects.I think it would be very helpful if there was an aggregate method that would aggregate data to data with an error range (such as an
uncertanties
array). By default, it could usemean
to get the center values and andsem
(standard error of the mean) orstd
to get the error ranges, but it would probably be good for users to be able to specify their own functions for calculating the center values and/or the error ranges.The text was updated successfully, but these errors were encountered: