Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.corrwith has unintuitive behavior #22328

Closed
dsaxton opened this issue Aug 13, 2018 · 3 comments · Fixed by #22375
Closed

DataFrame.corrwith has unintuitive behavior #22328

dsaxton opened this issue Aug 13, 2018 · 3 comments · Fixed by #22375

Comments

@dsaxton
Copy link
Member

dsaxton commented Aug 13, 2018

import numpy as np
import pandas as pd

np.random.seed(2357)

df1 = pd.DataFrame(np.random.normal(size=(100, 3)))
df2 = pd.DataFrame(np.random.normal(size=(100, 3)))

# not clear what this means
df1.corrwith(df2)
# simply renaming columns produces nans
df2.columns = ["a", "b", "c"]
df1.corrwith(df2)
# this should arguably be the same as df1.corr()
df1.corrwith(df1)

Problem description

The docstring for corrwith describes its behavior as:

Compute pairwise correlation between rows or columns of two DataFrame
objects.

My interpretation of this would be that corrwith finds all pairwise correlations between the two DataFrames along the given axis (it's also defined for Series input which should probably be a bit more explicit in the docstring). That is (for axis=0) df1.corrwith(df2) would be functionally equivalent to

df2.apply(lambda x: df1.corrwith(x))

However, as implemented the method returns a single Series which appears to give specific pairwise correlations, and how the pairs are chosen does not appear to be documented.

Defining corrwith recursively as above seems more useful (you still retain all the information from the current implementation) and interpretable, fixes the apparent bug caused when the method tries to align columns, and also allows for an easy way of closing issue #21925 (since ultimately the method falls back on Series.corr which permits control of the correlation type). I can start to work on a pull request if others agree that this would be preferred.

Expected Output

# df1.corrwith(df2) should be equal to this in my opinion
df2.apply(lambda x: df1.corrwith(x))
# this yields the ordinary correlation matrix as expected, although it is slower than just df1.corr()
df1.apply(lambda x: df1.corrwith(x))

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.7.1
pip: 18.0
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

appears to give specific pairwise correlations, and how the pairs are chosen does not appear to be documented.

The columns are aligned by label (row or columns) like most pandas methods.

In [9]: df1.columns = [1, 2, 3]

In [10]: df1.corrwith(df2)
Out[10]:
0         NaN
1    0.046158
2   -0.028837
3         NaN
dtype: float64

A correction to the docs indicating that (possibly with an example too) would be most welcome.

@TomAugspurger
Copy link
Contributor

I don't think changing to df2.apply(lambda x: df1.corrwith(x)) is worth breaking backwards compatibility, but adding that to the docs or cookbook seems useful.

@dsaxton
Copy link
Member Author

dsaxton commented Aug 14, 2018

Agreed on the backwards compatibility. I can work on a PR that closes this (at least the docstring part) along with #21925.

@jreback jreback added this to the 0.24.0 milestone Aug 17, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018
@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Dec 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants