-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement "standard error of the mean" #7133
Conversation
@toddrjen will take a look at this for 0.14.1 I may cherry-pick your |
picked this: 592a537 thanks! |
For multiple groupings, the result index will be a MultiIndex | ||
""" | ||
if ddof == 1: | ||
return (self._cython_agg_general('std') / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not just call self.std()
here?
do you have a test for when self._cython_agg_general
fails (in this case)? if not can you add one?
@cpcloud I have implemented your suggestsions, I think:
|
@@ -144,8 +144,15 @@ def _last(x): | |||
return _last(x) | |||
|
|||
|
|||
def _count_compat(x, axis=0): | |||
return x.size | |||
def _count_compat(x, axis=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a slight change here since the "axis" argument was not doing anything, and was behaving inconsistently to how it normally behaves (axis=0 is along axis 0, axis=None is flat array). I can revert this if it is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not doing anything but is for compat (or maybe @cpcloud removed that issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the tests pass if you remove that parameter? i think there's a multiindex gruopby test that was failing iwhen i triedt o take it out
I have implemented all the requested changes:
I have removed I have also updated the docs to refer to 0.14.1 rather than 0.14.0. All unit tests are passing now. The new unittests not related to |
for a quick skim looks ok. this will be considered for 0.14.1. nothing except critical bugs will be included in 0.14.0 at this point. |
ok, you can put release notes in v0.14.1 whatsnew. when you are ready for review, pls ping. |
Okay, it seems to be good now. Please take a look. |
I fixed up the nanops unit tests compared to what I had before. This identified a bunch of small bugs, which I fixed, as well as places where support for pandas objects in nanops could be fairly easily improved. Support for pandas objects is not 100%, but it is much, much better than it was. |
@@ -80,6 +84,7 @@ Bug Fixes | |||
- Bug in ``Float64Index`` which didn't allow duplicates (:issue:`7149`). | |||
- Bug in ``DataFrame.replace()`` where truthy values were being replaced | |||
(:issue:`7140`). | |||
- Various small bugs in ```pandas.core.nanops``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be more detailed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my general comments, needs to link to a separate issue where each bug is laid out independtly (if possible) with an example.
create a new issue for all of the 'small bugs'. This should be done on top of this PR (or completely independently as possible). Mark and show all cases in the PR, and annotate the code and tests as much as possible. It really is better to have small independent PR's rather than large multi-issue PR's. When things are interrelated its ok, but the bigger they get the harder to review, comment and fix. Going to need a perf test on this. As well as a squash to a small number of commits, ideally have separate issues in separate commits (this is equivalent to PR's on top of each other). |
@@ -48,6 +48,15 @@ analysis / manipulation tool available in any language. | |||
pandas 0.14.1 | |||
------------- | |||
|
|||
**Release date:** (not yet released) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this now goes in v0.14.1.txt, don't edit this file at all
@toddrjen this PR is doing way too much. you need to break this down into much simpler parts. remove all things associated with object dtype conversions, and Panel and higher dim implementations. They can be done separately and on top of each other. You are introducing way too much logic in nanops that is ONLY for ndarray/scalar types. checking for So in order to accept this you need to split it into parts:
pls create separate issues for 2 and 3. |
arr_float1_nan = self.arr_float1_nan | ||
arr_nan_float1 = self.arr_nan_float1 | ||
|
||
while targ0.ndim: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is simpler than the the other case, so I thought a while loop would be clearer than recursion.
I have reduced this to the stuff related to nanops, plus the previously-suggested simplification of groupby's std. |
@toddrjen ok that looks good (just need the fix on the release notes ; move to v0.14.1.txt). that and a perf check and i think its ok to go in. Then can spin off the bug fixes to another PR (they are wanted, but need some explict tests for them that validate) |
I have already split the fixes into another branch and removed the pandas-related stuff from nanops there, but I will worry about that after this is merged. The documentation changes you requested have also been implemented. |
('median', np.median), | ||
('std', np.std), | ||
('var', np.var), | ||
('sem', lambda x: nanops.nansem(x.values)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test this with the scipy
sem instead? (as this really doesn't test anything); need to skip if scipy not installed (not the whole test just that function)
just need perf check to make sure nothing changed, see here: https://github.com/pydata/pandas/wiki/Performance-Testing |
@@ -3794,7 +3794,8 @@ def mad(self, axis=None, skipna=None, level=None, **kwargs): | |||
|
|||
@Substitution(outname='variance', | |||
desc="Return unbiased variance over requested " | |||
"axis\nNormalized by N-1") | |||
"axis.\n\nNormalized by N-1 by default." | |||
"This can be changed using the ddof argument") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small detail: a space is needed after default.
and can you wrap ddof
in single backticks (like ddof
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
What, specifically, do you need in terms of results from the performance |
post anything > ratio of say 1.2 (post the table format, but just those entries); if something is < 0.8 you can post as well (meaning it sped up) |
Here are the results with values outside the range you specified:
|
ok, can you repeat the |
I ran it again with 20 reps and 10 burnin and now it is at .9868, so the previous result seems to just be noise. I just rebased on the most recent master and unit tests are passing, so I think this is ready to merge. |
Looks ok, pls squash down to a smaller number of commits and i;ll merge; ping when ready |
I have squashed the commits. |
ENH: Implement "standard error of the mean"
thanks! |
closes #6897
As discussed in issue #6897, here is a pull request implemented "standard error of the mean" (sem) calculations in pandas. This is an extremely common but very simple statistical test. It is used pretty universally in scientific data analysis.
sem is implemented in nanops, generic, and groupby. Unit tests are included and the docs have been updated.
I have also included an improved .gitignore, based on matplotlib's but including everything from the old .gitignore (hopefully better organized now, though).