ENH: Implement "standard error of the mean" #7133

toddrjen · 2014-05-15T13:54:42Z

As discussed in issue #6897, here is a pull request implemented "standard error of the mean" (sem) calculations in pandas. This is an extremely common but very simple statistical test. It is used pretty universally in scientific data analysis.

sem is implemented in nanops, generic, and groupby. Unit tests are included and the docs have been updated.

I have also included an improved .gitignore, based on matplotlib's but including everything from the old .gitignore (hopefully better organized now, though).

jreback · 2014-05-15T15:58:20Z

@toddrjen will take a look at this for 0.14.1

I may cherry-pick your .gitignore now though...thanks

jreback · 2014-05-15T16:06:08Z

picked this: 592a537

thanks!

cpcloud · 2014-05-15T22:17:15Z

pandas/core/groupby.py

+        For multiple groupings, the result index will be a MultiIndex
+        """
+        if ddof == 1:
+            return (self._cython_agg_general('std') /


any reason not just call self.std() here?

do you have a test for when self._cython_agg_general fails (in this case)? if not can you add one?

toddrjen · 2014-05-16T13:29:34Z

@cpcloud I have implemented your suggestsions, I think:

in groupby, there is now a separate _count_sqrt method, while the sem method just calls the std and _count_sqrt methods. This leaves the decision on whether to optimize up to the std and _count_sqrt methods (which have different rules).
I have also implemented a general test in test_groupby that tests _cython_agg_general for all optimized operations and compares them to an expected result. It just uses a simple numeric test dataframe, it doesn't test non-numeric data or data with nan values, but to make sure basic operations are working it should be sufficient.

toddrjen · 2014-05-16T13:31:36Z

pandas/core/groupby.py

@@ -144,8 +144,15 @@ def _last(x):
        return _last(x)


-def _count_compat(x, axis=0):
-    return x.size
+def _count_compat(x, axis=None):


I made a slight change here since the "axis" argument was not doing anything, and was behaving inconsistently to how it normally behaves (axis=0 is along axis 0, axis=None is flat array). I can revert this if it is an issue.

its not doing anything but is for compat (or maybe @cpcloud removed that issue)

do the tests pass if you remove that parameter? i think there's a multiindex gruopby test that was failing iwhen i triedt o take it out

toddrjen · 2014-05-20T12:43:07Z

I have implemented all the requested changes:

nanvar and nansem now both use a common private function to get the element count. I have also implemented unit tests for all the nan* functions in nanops, including nanvar and nansem, in a new test_nanops.py unit test file.

I have removed count_sqrt. grouper.sem is now implemented using grouper.std and grouper.count directly. Also, as suggested, I removed std from cython_agg and implemented grouper.std as sqrt(grouper.var). This also allowed a little simplification of cython_agg. It is private API so I don't know if the change needs to be documented.

I have also updated the docs to refer to 0.14.1 rather than 0.14.0.

All unit tests are passing now.

The new unittests not related to sem are in their own commits (f5bbd07 and 5771791). Since they are just additional unittests and don't change any behavior, they could be cherry-picked for 0.14.0. The grouper.std changes are also in their own commit (1b7e239), but that change is probably too invasive for 0.14.0.

jreback · 2014-05-20T13:06:12Z

for a quick skim looks ok. this will be considered for 0.14.1. nothing except critical bugs will be included in 0.14.0 at this point.

jreback · 2014-05-30T14:42:42Z

ok, you can put release notes in v0.14.1 whatsnew. when you are ready for review, pls ping.

toddrjen · 2014-06-03T10:08:39Z

Okay, it seems to be good now. Please take a look.

toddrjen · 2014-06-03T11:15:33Z

I fixed up the nanops unit tests compared to what I had before. This identified a bunch of small bugs, which I fixed, as well as places where support for pandas objects in nanops could be fairly easily improved. Support for pandas objects is not 100%, but it is much, much better than it was.

toddrjen · 2014-06-03T11:16:53Z

doc/source/v0.14.1.txt

@@ -80,6 +84,7 @@ Bug Fixes
 - Bug in ``Float64Index`` which didn't allow duplicates (:issue:`7149`).
 - Bug in ``DataFrame.replace()`` where truthy values were being replaced
  (:issue:`7140`).
+- Various small bugs in ```pandas.core.nanops```


Does this need to be more detailed?

see my general comments, needs to link to a separate issue where each bug is laid out independtly (if possible) with an example.

jreback · 2014-06-03T11:23:04Z

create a new issue for all of the 'small bugs'. This should be done on top of this PR (or completely independently as possible). Mark and show all cases in the PR, and annotate the code and tests as much as possible.

It really is better to have small independent PR's rather than large multi-issue PR's. When things are interrelated its ok, but the bigger they get the harder to review, comment and fix.

Going to need a perf test on this. As well as a squash to a small number of commits, ideally have separate issues in separate commits (this is equivalent to PR's on top of each other).

jreback · 2014-06-03T11:23:41Z

doc/source/release.rst

@@ -48,6 +48,15 @@ analysis / manipulation tool available in any language.
 pandas 0.14.1
 -------------

+**Release date:** (not yet released)
+


all of this now goes in v0.14.1.txt, don't edit this file at all

jreback · 2014-06-03T11:38:06Z

@toddrjen this PR is doing way too much. you need to break this down into much simpler parts. remove all things associated with object dtype conversions, and Panel and higher dim implementations. They can be done separately and on top of each other. You are introducing way too much logic in nanops that is ONLY for ndarray/scalar types. checking for apply is simply nonsense here. This makes it impossible to follow. and their is no separation between functions at all. This is making bugs in the future MUCH MORE LIKELY.

So in order to accept this you need to split it into parts:

1. sem implm
1. changes for any inf handing object dtypes
1. changes for ndim > 2 handing. Need justification for why you are doing this.

pls create separate issues for 2 and 3.

toddrjen · 2014-06-03T11:38:46Z

pandas/tests/test_nanops.py

+        arr_float1_nan = self.arr_float1_nan
+        arr_nan_float1 = self.arr_nan_float1
+
+        while targ0.ndim:


This is simpler than the the other case, so I thought a while loop would be clearer than recursion.

toddrjen · 2014-06-03T11:52:06Z

I have reduced this to the stuff related to nanops, plus the previously-suggested simplification of groupby's std.

jreback · 2014-06-03T11:55:37Z

@toddrjen ok that looks good (just need the fix on the release notes ; move to v0.14.1.txt). that and a perf check and i think its ok to go in.

Then can spin off the bug fixes to another PR (they are wanted, but need some explict tests for them that validate)

toddrjen · 2014-06-03T12:07:20Z

I have already split the fixes into another branch and removed the pandas-related stuff from nanops there, but I will worry about that after this is merged. The documentation changes you requested have also been implemented.

jreback · 2014-06-03T12:11:58Z

pandas/tests/test_groupby.py

+               ('median', np.median),
+               ('std', np.std),
+               ('var', np.var),
+               ('sem', lambda x: nanops.nansem(x.values)),


can you test this with the scipy sem instead? (as this really doesn't test anything); need to skip if scipy not installed (not the whole test just that function)

jreback · 2014-06-03T16:21:45Z

just need perf check to make sure nothing changed, see here: https://github.com/pydata/pandas/wiki/Performance-Testing

jorisvandenbossche · 2014-06-03T20:11:03Z

pandas/core/generic.py

@@ -3794,7 +3794,8 @@ def mad(self,  axis=None, skipna=None, level=None, **kwargs):

        @Substitution(outname='variance',
                      desc="Return unbiased variance over requested "
-                           "axis\nNormalized by N-1")
+                           "axis.\n\nNormalized by N-1 by default."
+                           "This can be changed using the ddof argument")


Small detail: a space is needed after default. and can you wrap ddof in single backticks (like ddof).

toddrjen · 2014-06-04T06:31:49Z

What, specifically, do you need in terms of results from the performance
test? I get a pretty big table, should I post that here? Or do you just
want to know if there is any change outside of a specific range? If so,
what range?

jreback · 2014-06-04T12:29:08Z

post anything > ratio of say 1.2 (post the table format, but just those entries); if something is < 0.8 you can post as well (meaning it sped up)

toddrjen · 2014-06-04T16:29:01Z

Here are the results with values outside the range you specified:

Test name	head `ms`	base `ms`	ratio
panel_shift_minor	0.1006	0.4030	0.2497
dataframe_reindex	1.2967	1.6870	0.7686
stat_ops_series_std	0.7694	0.3263	2.3578

jreback · 2014-06-04T16:31:35Z

ok, can you repeat the stat_ops_series_std a couple of times to see if it changes (time is < 1ms), so sometimes suspect. and if not, see if you can figure out the reason for the increase?

toddrjen · 2014-06-05T09:40:38Z

I ran it again with 20 reps and 10 burnin and now it is at .9868, so the previous result seems to just be noise.

I just rebased on the most recent master and unit tests are passing, so I think this is ready to merge.

jreback · 2014-06-05T10:08:47Z

Looks ok, pls squash down to a smaller number of commits and i;ll merge; ping when ready

toddrjen · 2014-06-05T10:21:49Z

I have squashed the commits.

ENH: Implement "standard error of the mean"

jreback · 2014-06-05T10:25:48Z

thanks!

toddrjen mentioned this pull request May 15, 2014

ENH: Standard Error of the Mean (sem) aggregation method #6897

Closed

jreback added API Design labels May 15, 2014

jreback added this to the 0.14.1 milestone May 15, 2014

cpcloud reviewed May 15, 2014
View reviewed changes

toddrjen reviewed May 16, 2014
View reviewed changes

toddrjen reviewed Jun 3, 2014
View reviewed changes

jreback reviewed Jun 3, 2014
View reviewed changes

toddrjen reviewed Jun 3, 2014
View reviewed changes

jreback reviewed Jun 3, 2014
View reviewed changes

jorisvandenbossche reviewed Jun 3, 2014
View reviewed changes

toddrjen added 3 commits June 5, 2014 12:21

implement additional tests for groupby apply methods

79f41cc

simplify groupby's std method

ec9a09c

add sem to nanops and pandas object apply methods

2121b22

jreback added a commit that referenced this pull request Jun 5, 2014

Merge pull request #7133 from toddrjen/sem

8fed790

ENH: Implement "standard error of the mean"

jreback merged commit 8fed790 into pandas-dev:master Jun 5, 2014

toddrjen deleted the sem branch June 5, 2014 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement "standard error of the mean" #7133

ENH: Implement "standard error of the mean" #7133

toddrjen commented May 15, 2014

jreback commented May 15, 2014

jreback commented May 15, 2014

cpcloud May 15, 2014

toddrjen commented May 16, 2014

toddrjen May 16, 2014

jreback May 16, 2014

cpcloud May 16, 2014

toddrjen commented May 20, 2014

jreback commented May 20, 2014

jreback commented May 30, 2014

toddrjen commented Jun 3, 2014

toddrjen commented Jun 3, 2014

toddrjen Jun 3, 2014

jreback Jun 3, 2014

jreback commented Jun 3, 2014

jreback Jun 3, 2014

jreback commented Jun 3, 2014

toddrjen Jun 3, 2014

toddrjen commented Jun 3, 2014

jreback commented Jun 3, 2014

toddrjen commented Jun 3, 2014

jreback Jun 3, 2014

jreback commented Jun 3, 2014

jorisvandenbossche Jun 3, 2014

toddrjen Jun 3, 2014

toddrjen commented Jun 4, 2014

jreback commented Jun 4, 2014

toddrjen commented Jun 4, 2014

jreback commented Jun 4, 2014

toddrjen commented Jun 5, 2014

jreback commented Jun 5, 2014

toddrjen commented Jun 5, 2014

jreback commented Jun 5, 2014

ENH: Implement "standard error of the mean" #7133

ENH: Implement "standard error of the mean" #7133

Conversation

toddrjen commented May 15, 2014

jreback commented May 15, 2014

jreback commented May 15, 2014

Choose a reason for hiding this comment

toddrjen commented May 16, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddrjen commented May 20, 2014

jreback commented May 20, 2014

jreback commented May 30, 2014

toddrjen commented Jun 3, 2014

toddrjen commented Jun 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 3, 2014

Choose a reason for hiding this comment

jreback commented Jun 3, 2014

Choose a reason for hiding this comment

toddrjen commented Jun 3, 2014

jreback commented Jun 3, 2014

toddrjen commented Jun 3, 2014

Choose a reason for hiding this comment

jreback commented Jun 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddrjen commented Jun 4, 2014

jreback commented Jun 4, 2014

toddrjen commented Jun 4, 2014

jreback commented Jun 4, 2014

toddrjen commented Jun 5, 2014

jreback commented Jun 5, 2014

toddrjen commented Jun 5, 2014

jreback commented Jun 5, 2014