ENH: Standard Error of the Mean (sem) aggregation method #6897

toddrjen · 2014-04-17T09:27:47Z

A very common operation when trying to work with data is to find out the error range for the data. In scientific research, including error ranges is required.

There are two main ways to do this: standard deviation and standard error of the mean. Pandas has an optimized std aggregation method for both dataframe and groupby. However, it does not have an optimized standard error method, meaning users who want to compute error ranges have to rely on the unoptimized scipy method.

Since computing error ranges is such a common operation, I think it would be very useful if there was an optimized sem method like there is for std.

The text was updated successfully, but these errors were encountered:

jtratner · 2014-04-18T04:48:27Z

Does statsmodels do this?
On Apr 17, 2014 2:27 AM, "toddrjen" notifications@github.com wrote:

A very common operation when trying to work with data is to find out the
error range for the data. In scientific research, including error ranges is
required.

There are two main ways to do this: standard deviation and standard error
of the mean. Pandas has an optimized std aggregation method for both
dataframe and groupby. However, it does not have an optimized standard
error method, meaning users who want to compute error ranges have to rely
on the unoptimized scipy method.

Since computing error ranges is such a common operation, I think it would
be very useful if there was an optimized sem method like there is for std.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6897
.

toddrjen · 2014-04-22T09:17:59Z

Not as far as I can find. And I don't think it really belongs in statsmodels. In my opinion it is a pretty basic data wrangling task, like getting a mean or standard deviation, not the more advanced statistical modeling provided by statsmodel.

jreback · 2014-04-22T09:19:39Z

can u point to the scipy method?

jorisvandenbossche · 2014-04-22T09:44:58Z

http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.sem.html

@toddrjen What do you mean with an optimized method? std is optimized, so you don't have to rely on an 'unoptimized' scipy.stats method, you can just do: df.std()/len(df)

And by the way, scipy.stats.sem is not that 'unoptimized'. In fact, it is even faster, as this does not do eg the extra nan-checking as pandas does:

In [2]: s = pd.Series(np.random.randn(1000))

In [7]: from scipy import stats

In [8]: stats.sem(s.values)
Out[8]: 0.031635197968083853

In [9]: s.std() / np.sqrt(len(s))
Out[9]: 0.031635197968083832

In [11]: %timeit stats.sem(s.values)
10000 loops, best of 3: 46.2 µs per loop

In [12]: %timeit s.std() / np.sqrt(len(s))
10000 loops, best of 3: 85.7 µs per loop

In [12]: %timeit s.std() / np.sqrt(len(s))

But of course, the question still remains, do we provide a shortcut to this functionality in the form of a sem method, or do we just expect out users to divide the std themselves.

jreback · 2014-05-05T13:22:45Z

would be code-bloat IMHO, closing

thanks for the suggestion.

if you disagree, pls comment.

cpcloud · 2014-05-05T13:31:36Z

@jreback i don't think this is code bloat relative to the alternative:

You can't really use scipy.stats.sem because it doesn't handle nans:

In [19]: from scipy.stats import sem

In [20]: df = DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'c'])

In [21]: df
Out[21]:
        a       b       c
0  1.1658  0.2184 -2.0823
1  0.5625 -0.5034  0.7028
2 -0.8424  0.1333 -1.1065
3  0.9335 -0.6088  1.4308
4 -0.1027 -0.1888 -0.5816
5 -0.5202  0.3210 -0.9942
6 -0.8666  0.8711 -0.5691
7 -0.7701 -2.1855 -0.4302
8  1.0664 -1.2672  0.7117
9 -0.7530 -0.8466  0.0194

[10 rows x 3 columns]

In [22]: sem(df[df > 0])
Out[22]: array([ nan,  nan,  nan])

Okay, so let's try it with scipy.stats.mstats.sem:

In [26]: from scipy.stats.mstats import sem as sem

In [27]: sem(df[df > 0])
Out[27]:
masked_array(data = [-- -- --],
             mask = [ True  True  True],
       fill_value = 1e+20)

That's hardly what I would expect here, and masked arrays are almost as fun as recarrays. I'm +1 on reopening this.

Here's what it would take to get the desired result from scipy:

In [32]: Series(sem(np.ma.masked_invalid(df[df > 0])),index=df.columns)
Out[32]:
a    0.1321
b    0.1662
c    0.2881
dtype: float64

In [33]: df[df > 0].std() / sqrt(df[df > 0].count())
Out[33]:
a    0.1321
b    0.1662
c    0.2881
dtype: float64

jreback · 2014-05-05T13:34:06Z

no, but isn't this just

``s.std()/np.sqrt(len(s))` and even that's 'arbitrary' in my book

not an issue with the code-bloat per se, but the definition

cpcloud · 2014-05-05T13:36:28Z

agreed. that's really simple. i was just making a point about the nan handling, you can't just do len because that counts nans. not a huge deal

jreback · 2014-05-05T13:37:32Z

not averse to this, but it just seems so simple that a user should do this (as I might want a different definition); that said if this is pretty 'standard' then would be ok

cpcloud · 2014-05-05T13:43:09Z

every science institution i've ever worked in (just 3 really so not a whole lot of weight there) has used sem at some point (even if just to get a rough idea of error ranges). i see your point about different definitions, maybe other folks want to chime in

jreback · 2014-05-05T13:45:10Z

ok...will reopen for consideration in 0.15 then

toddrjen · 2014-05-05T14:37:07Z

I have also been at three different institutions, and they also all used SEM. And I have seen it on hundreds of papers, presentations, and posters.

jreback · 2014-05-05T14:40:57Z

@toddrjen

ok...that's fine then, pls submit a PR! (needs to go in core/nanops.py) with some updating in core/ops.py

toddrjen · 2014-05-15T13:55:16Z

Pull request submitted: #7133

jennykathambi90 · 2020-04-25T12:29:37Z

Pandas has df.sem() function or series.sem()

toddrjen mentioned this issue Apr 17, 2014

ENH: Aggregate to data with error range #6898

Closed

jreback added Usage Question labels May 5, 2014

jreback closed this as completed May 5, 2014

jreback reopened this May 5, 2014

jreback added this to the 0.15.0 milestone May 5, 2014

jreback added the API Design label May 5, 2014

toddrjen mentioned this issue May 15, 2014

ENH: Implement "standard error of the mean" #7133

Merged

jreback removed the Usage Question label May 15, 2014

jreback modified the milestones: 0.14.1, 0.15.0 May 15, 2014

jreback closed this as completed in #7133 Jun 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Standard Error of the Mean (sem) aggregation method #6897

ENH: Standard Error of the Mean (sem) aggregation method #6897

toddrjen commented Apr 17, 2014

jtratner commented Apr 18, 2014

toddrjen commented Apr 22, 2014

jreback commented Apr 22, 2014

jorisvandenbossche commented Apr 22, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

toddrjen commented May 5, 2014

jreback commented May 5, 2014

toddrjen commented May 15, 2014

jennykathambi90 commented Apr 25, 2020

ENH: Standard Error of the Mean (sem) aggregation method #6897

ENH: Standard Error of the Mean (sem) aggregation method #6897

Comments

toddrjen commented Apr 17, 2014

jtratner commented Apr 18, 2014

toddrjen commented Apr 22, 2014

jreback commented Apr 22, 2014

jorisvandenbossche commented Apr 22, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

toddrjen commented May 5, 2014

jreback commented May 5, 2014

toddrjen commented May 15, 2014

jennykathambi90 commented Apr 25, 2020