Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: correlation function accepts method being a callable #22684

Merged
merged 1 commit into from
Sep 26, 2018

Conversation

shadiakiki1986
Copy link
Contributor

@shadiakiki1986 shadiakiki1986 commented Sep 13, 2018

  • other than the listed strings for the method argument, accept a callable for generic correlation calculations
  • 
    
  • closes #xxxx (no issue opened for PR)
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry (not sure if I should do this or the person merging this PR)

@pep8speaks
Copy link

pep8speaks commented Sep 13, 2018

Hello @shadiakiki1986! Thanks for updating the PR.

Comment last updated on September 26, 2018 at 02:15 Hours UTC

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. Does this close an issue?

We'll need a release note in 0.24.0.txt

@@ -14,7 +14,7 @@ lxml
matplotlib
nbsphinx
numexpr
openpyxl=2.5.5
openpyxl==2.5.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is autogenerated. #22689

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok should I roll this edit back?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.. ipython:: python

# histogram intersection
histogram_intersection = lambda a, b: np.minimum( np.true_divide(a, a.sum()), np.true_divide(b, b.sum())).sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this pep8 compliant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in new commit

@@ -6652,10 +6652,12 @@ def corr(self, method='pearson', min_periods=1):

Parameters
----------
method : {'pearson', 'kendall', 'spearman'}
method : {'pearson', 'kendall', 'spearman', callable}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be `}, or callable on the outside of the options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved it to outside of the options

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two numpy 1d-array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"callable expecting two 1d ndarrays and returning a float" (does the callable get ndarrays or series?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ndarrays

@@ -789,6 +789,41 @@ def test_corr_invalid_method(self):
with tm.assert_raises_regex(ValueError, msg):
s1.corr(s2, method="____")

def test_corr_callable_method(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I care less about testing this exact way of computing the correlation, and more about ensure that the method is dispatched to.

Would it be possible to define a very simple "correlation" function that just returns something like the index of the columns? So the correlation of the nth or and mth column would be like (n + m). Not sure if that's possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will re-write the test tomorrow

Copy link
Contributor Author

@shadiakiki1986 shadiakiki1986 Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now includes a simpler correlation function. It is not possible to identify the nth/mth column as in your example because the correlation function itself does not know about the dataframe as a whole but only as each series on its own. The correlation function I chose is a simple 1 if exact equality else 0

@shadiakiki1986
Copy link
Contributor Author

This doesn't fix any issue, but I added a note in the 0.24.0 release notes

@gfyoung gfyoung added Enhancement Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Sep 14, 2018
@@ -17,6 +17,10 @@ New features

- ``ExcelWriter`` now accepts ``mode`` as a keyword argument, enabling append to existing workbooks when using the ``openpyxl`` engine (:issue:`3441`)


- :meth:`DataFrame.corr` and :meth:`Series.corr` now accept a callable for generic calculation methods of correlation, e.g. histogram intersection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use your PR as the issue number. Also, no new line above this sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jreback if you have a chance.

This provides a nice, generic alternative to #21925.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments.

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example in Examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -764,6 +764,9 @@ def nancorr(a, b, method='pearson', min_periods=None):


def get_corr_func(method):
if callable(method):
return method

if method in ['kendall', 'spearman']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to doc-string this signature here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just leave it as is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine as is. The type would be Callable[[ndarray, ndarray], float], but I'm not sure how familiar people are with typing yet.

@jreback jreback added this to the 0.24.0 milestone Sep 18, 2018
@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

@TomAugspurger yep agreed

@codecov
Copy link

codecov bot commented Sep 20, 2018

Codecov Report

Merging #22684 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22684      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50819    50821       +2     
==========================================
+ Hits        46850    46852       +2     
  Misses       3969     3969
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 42.37% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/nanops.py 95.14% <100%> (+0.01%) ⬆️
pandas/core/series.py 93.74% <100%> (ø) ⬆️
pandas/core/frame.py 97.2% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a936399...dbfd95f. Read the comment docs.


>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
... columns=['dogs', 'cats'])
>>> df.corr(method = histogram_intersection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing our doctests. Is there an issue with the output?

Also, pep8: no spaces around the =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

>>> s1 = pd.Series([1, 0, 2, 1])
>>> s2 = pd.Series([2, 3, 0, 1])
>>> s1.corr(s2, method = histogram_intersection)
0.416667
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to round this, or write it out at full precision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine as is. The type would be Callable[[ndarray, ndarray], float], but I'm not sure how familiar people are with typing yet.

@shadiakiki1986 shadiakiki1986 force-pushed the corr-method-callable branch 2 times, most recently from df06de9 to dc87331 Compare September 22, 2018 07:29
# simple correlation example
# returns 1 if exact equality, 0 otherwise
my_corr = lambda a, b: 1. if (a == b).all() else 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use result= and expected= here, rather than expected_1 and such. its much easier to follow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@shadiakiki1986 shadiakiki1986 force-pushed the corr-method-callable branch 2 times, most recently from 60b3498 to 4a69a70 Compare September 26, 2018 02:15
- other than the listed strings for the `method` argument, accept a callable for generic correlation calculations
@jreback jreback merged commit a393675 into pandas-dev:master Sep 26, 2018
@jreback
Copy link
Contributor

jreback commented Sep 26, 2018

thanks @shadiakiki1986 nice change!

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants