DataFrame.corrwith has unintuitive behavior #22328

dsaxton · 2018-08-13T23:50:25Z

import numpy as np
import pandas as pd

np.random.seed(2357)

df1 = pd.DataFrame(np.random.normal(size=(100, 3)))
df2 = pd.DataFrame(np.random.normal(size=(100, 3)))

# not clear what this means
df1.corrwith(df2)
# simply renaming columns produces nans
df2.columns = ["a", "b", "c"]
df1.corrwith(df2)
# this should arguably be the same as df1.corr()
df1.corrwith(df1)

Problem description

The docstring for corrwith describes its behavior as:

Compute pairwise correlation between rows or columns of two DataFrame
objects.

My interpretation of this would be that corrwith finds all pairwise correlations between the two DataFrames along the given axis (it's also defined for Series input which should probably be a bit more explicit in the docstring). That is (for axis=0) df1.corrwith(df2) would be functionally equivalent to

df2.apply(lambda x: df1.corrwith(x))

However, as implemented the method returns a single Series which appears to give specific pairwise correlations, and how the pairs are chosen does not appear to be documented.

Defining corrwith recursively as above seems more useful (you still retain all the information from the current implementation) and interpretable, fixes the apparent bug caused when the method tries to align columns, and also allows for an easy way of closing issue #21925 (since ultimately the method falls back on Series.corr which permits control of the correlation type). I can start to work on a pull request if others agree that this would be preferred.

Expected Output

# df1.corrwith(df2) should be equal to this in my opinion
df2.apply(lambda x: df1.corrwith(x))
# this yields the ordinary correlation matrix as expected, although it is slower than just df1.corr()
df1.apply(lambda x: df1.corrwith(x))

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.7.1
pip: 18.0
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-08-14T16:52:25Z

appears to give specific pairwise correlations, and how the pairs are chosen does not appear to be documented.

The columns are aligned by label (row or columns) like most pandas methods.

In [9]: df1.columns = [1, 2, 3]

In [10]: df1.corrwith(df2)
Out[10]:
0         NaN
1    0.046158
2   -0.028837
3         NaN
dtype: float64

A correction to the docs indicating that (possibly with an example too) would be most welcome.

TomAugspurger · 2018-08-14T16:54:35Z

I don't think changing to df2.apply(lambda x: df1.corrwith(x)) is worth breaking backwards compatibility, but adding that to the docs or cookbook seems useful.

dsaxton · 2018-08-14T21:21:13Z

Agreed on the backwards compatibility. I can work on a PR that closes this (at least the docstring part) along with #21925.

TomAugspurger added Docs Effort Low good first issue labels Aug 14, 2018

dsaxton mentioned this issue Aug 16, 2018

ENH: Enable DataFrame.corrwith to compute rank correlations #22375

Merged

jreback added this to the 0.24.0 milestone Aug 17, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018

jreback modified the milestones: Contributions Welcome, 0.24.0 Dec 30, 2018

jreback closed this as completed in #22375 Dec 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.corrwith has unintuitive behavior #22328

DataFrame.corrwith has unintuitive behavior #22328

dsaxton commented Aug 13, 2018 •

edited

Loading

TomAugspurger commented Aug 14, 2018

TomAugspurger commented Aug 14, 2018

dsaxton commented Aug 14, 2018

DataFrame.corrwith has unintuitive behavior #22328

DataFrame.corrwith has unintuitive behavior #22328

Comments

dsaxton commented Aug 13, 2018 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Aug 14, 2018

TomAugspurger commented Aug 14, 2018

dsaxton commented Aug 14, 2018

dsaxton commented Aug 13, 2018 •

edited

Loading

Output of `pd.show_versions()`