Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement "standard error of the mean" #7133

Merged
merged 3 commits into from
Jun 5, 2014
Merged

ENH: Implement "standard error of the mean" #7133

merged 3 commits into from
Jun 5, 2014

Conversation

toddrjen
Copy link
Contributor

closes #6897

As discussed in issue #6897, here is a pull request implemented "standard error of the mean" (sem) calculations in pandas. This is an extremely common but very simple statistical test. It is used pretty universally in scientific data analysis.

sem is implemented in nanops, generic, and groupby. Unit tests are included and the docs have been updated.

I have also included an improved .gitignore, based on matplotlib's but including everything from the old .gitignore (hopefully better organized now, though).

@jreback
Copy link
Contributor

jreback commented May 15, 2014

@toddrjen will take a look at this for 0.14.1

I may cherry-pick your .gitignore now though...thanks

@jreback
Copy link
Contributor

jreback commented May 15, 2014

picked this: 592a537

thanks!

For multiple groupings, the result index will be a MultiIndex
"""
if ddof == 1:
return (self._cython_agg_general('std') /
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not just call self.std() here?

do you have a test for when self._cython_agg_general fails (in this case)? if not can you add one?

@toddrjen
Copy link
Contributor Author

@cpcloud I have implemented your suggestsions, I think:

  1. in groupby, there is now a separate _count_sqrt method, while the sem method just calls the std and _count_sqrt methods. This leaves the decision on whether to optimize up to the std and _count_sqrt methods (which have different rules).
  2. I have also implemented a general test in test_groupby that tests _cython_agg_general for all optimized operations and compares them to an expected result. It just uses a simple numeric test dataframe, it doesn't test non-numeric data or data with nan values, but to make sure basic operations are working it should be sufficient.

@@ -144,8 +144,15 @@ def _last(x):
return _last(x)


def _count_compat(x, axis=0):
return x.size
def _count_compat(x, axis=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a slight change here since the "axis" argument was not doing anything, and was behaving inconsistently to how it normally behaves (axis=0 is along axis 0, axis=None is flat array). I can revert this if it is an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not doing anything but is for compat (or maybe @cpcloud removed that issue)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the tests pass if you remove that parameter? i think there's a multiindex gruopby test that was failing iwhen i triedt o take it out

@toddrjen
Copy link
Contributor Author

I have implemented all the requested changes:

nanvar and nansem now both use a common private function to get the element count. I have also implemented unit tests for all the nan* functions in nanops, including nanvar and nansem, in a new test_nanops.py unit test file.

I have removed count_sqrt. grouper.sem is now implemented using grouper.std and grouper.count directly. Also, as suggested, I removed std from cython_agg and implemented grouper.std as sqrt(grouper.var). This also allowed a little simplification of cython_agg. It is private API so I don't know if the change needs to be documented.

I have also updated the docs to refer to 0.14.1 rather than 0.14.0.

All unit tests are passing now.

The new unittests not related to sem are in their own commits (f5bbd07 and 5771791). Since they are just additional unittests and don't change any behavior, they could be cherry-picked for 0.14.0. The grouper.std changes are also in their own commit (1b7e239), but that change is probably too invasive for 0.14.0.

@jreback
Copy link
Contributor

jreback commented May 20, 2014

for a quick skim looks ok. this will be considered for 0.14.1. nothing except critical bugs will be included in 0.14.0 at this point.

@jreback
Copy link
Contributor

jreback commented May 30, 2014

ok, you can put release notes in v0.14.1 whatsnew. when you are ready for review, pls ping.

@toddrjen
Copy link
Contributor Author

toddrjen commented Jun 3, 2014

Okay, it seems to be good now. Please take a look.

@toddrjen
Copy link
Contributor Author

toddrjen commented Jun 3, 2014

I fixed up the nanops unit tests compared to what I had before. This identified a bunch of small bugs, which I fixed, as well as places where support for pandas objects in nanops could be fairly easily improved. Support for pandas objects is not 100%, but it is much, much better than it was.

@@ -80,6 +84,7 @@ Bug Fixes
- Bug in ``Float64Index`` which didn't allow duplicates (:issue:`7149`).
- Bug in ``DataFrame.replace()`` where truthy values were being replaced
(:issue:`7140`).
- Various small bugs in ```pandas.core.nanops```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be more detailed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my general comments, needs to link to a separate issue where each bug is laid out independtly (if possible) with an example.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

create a new issue for all of the 'small bugs'. This should be done on top of this PR (or completely independently as possible). Mark and show all cases in the PR, and annotate the code and tests as much as possible.

It really is better to have small independent PR's rather than large multi-issue PR's. When things are interrelated its ok, but the bigger they get the harder to review, comment and fix.

Going to need a perf test on this. As well as a squash to a small number of commits, ideally have separate issues in separate commits (this is equivalent to PR's on top of each other).

@@ -48,6 +48,15 @@ analysis / manipulation tool available in any language.
pandas 0.14.1
-------------

**Release date:** (not yet released)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this now goes in v0.14.1.txt, don't edit this file at all

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

@toddrjen this PR is doing way too much. you need to break this down into much simpler parts. remove all things associated with object dtype conversions, and Panel and higher dim implementations. They can be done separately and on top of each other. You are introducing way too much logic in nanops that is ONLY for ndarray/scalar types. checking for apply is simply nonsense here. This makes it impossible to follow. and their is no separation between functions at all. This is making bugs in the future MUCH MORE LIKELY.

So in order to accept this you need to split it into parts:

    1. sem implm
    1. changes for any inf handing object dtypes
    1. changes for ndim > 2 handing. Need justification for why you are doing this.

pls create separate issues for 2 and 3.

arr_float1_nan = self.arr_float1_nan
arr_nan_float1 = self.arr_nan_float1

while targ0.ndim:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is simpler than the the other case, so I thought a while loop would be clearer than recursion.

@toddrjen
Copy link
Contributor Author

toddrjen commented Jun 3, 2014

I have reduced this to the stuff related to nanops, plus the previously-suggested simplification of groupby's std.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

@toddrjen ok that looks good (just need the fix on the release notes ; move to v0.14.1.txt). that and a perf check and i think its ok to go in.

Then can spin off the bug fixes to another PR (they are wanted, but need some explict tests for them that validate)

@toddrjen
Copy link
Contributor Author

toddrjen commented Jun 3, 2014

I have already split the fixes into another branch and removed the pandas-related stuff from nanops there, but I will worry about that after this is merged. The documentation changes you requested have also been implemented.

('median', np.median),
('std', np.std),
('var', np.var),
('sem', lambda x: nanops.nansem(x.values)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test this with the scipy sem instead? (as this really doesn't test anything); need to skip if scipy not installed (not the whole test just that function)

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

just need perf check to make sure nothing changed, see here: https://github.com/pydata/pandas/wiki/Performance-Testing

@@ -3794,7 +3794,8 @@ def mad(self, axis=None, skipna=None, level=None, **kwargs):

@Substitution(outname='variance',
desc="Return unbiased variance over requested "
"axis\nNormalized by N-1")
"axis.\n\nNormalized by N-1 by default."
"This can be changed using the ddof argument")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small detail: a space is needed after default. and can you wrap ddof in single backticks (like ddof).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@toddrjen
Copy link
Contributor Author

toddrjen commented Jun 4, 2014

What, specifically, do you need in terms of results from the performance
test? I get a pretty big table, should I post that here? Or do you just
want to know if there is any change outside of a specific range? If so,
what range?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

post anything > ratio of say 1.2 (post the table format, but just those entries); if something is < 0.8 you can post as well (meaning it sped up)

@toddrjen
Copy link
Contributor Author

toddrjen commented Jun 4, 2014

Here are the results with values outside the range you specified:

Test name head ms base ms ratio
panel_shift_minor 0.1006 0.4030 0.2497
dataframe_reindex 1.2967 1.6870 0.7686
stat_ops_series_std 0.7694 0.3263 2.3578

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

ok, can you repeat the stat_ops_series_std a couple of times to see if it changes (time is < 1ms), so sometimes suspect. and if not, see if you can figure out the reason for the increase?

@toddrjen
Copy link
Contributor Author

toddrjen commented Jun 5, 2014

I ran it again with 20 reps and 10 burnin and now it is at .9868, so the previous result seems to just be noise.

I just rebased on the most recent master and unit tests are passing, so I think this is ready to merge.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2014

Looks ok, pls squash down to a smaller number of commits and i;ll merge; ping when ready

@toddrjen
Copy link
Contributor Author

toddrjen commented Jun 5, 2014

I have squashed the commits.

jreback added a commit that referenced this pull request Jun 5, 2014
ENH: Implement "standard error of the mean"
@jreback jreback merged commit 8fed790 into pandas-dev:master Jun 5, 2014
@jreback
Copy link
Contributor

jreback commented Jun 5, 2014

thanks!

@toddrjen toddrjen deleted the sem branch June 5, 2014 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Standard Error of the Mean (sem) aggregation method
4 participants