You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importnumpyasnpimportpandasaspdnp.random.seed(2357)
df1=pd.DataFrame(np.random.normal(size=(100, 3)))
df2=pd.DataFrame(np.random.normal(size=(100, 3)))
# not clear what this meansdf1.corrwith(df2)
# simply renaming columns produces nansdf2.columns= ["a", "b", "c"]
df1.corrwith(df2)
# this should arguably be the same as df1.corr()df1.corrwith(df1)
Problem description
The docstring for corrwith describes its behavior as:
Compute pairwise correlation between rows or columns of two DataFrame
objects.
My interpretation of this would be that corrwith finds all pairwise correlations between the two DataFrames along the given axis (it's also defined for Series input which should probably be a bit more explicit in the docstring). That is (for axis=0) df1.corrwith(df2) would be functionally equivalent to
df2.apply(lambda x: df1.corrwith(x))
However, as implemented the method returns a single Series which appears to give specific pairwise correlations, and how the pairs are chosen does not appear to be documented.
Defining corrwith recursively as above seems more useful (you still retain all the information from the current implementation) and interpretable, fixes the apparent bug caused when the method tries to align columns, and also allows for an easy way of closing issue #21925 (since ultimately the method falls back on Series.corr which permits control of the correlation type). I can start to work on a pull request if others agree that this would be preferred.
Expected Output
# df1.corrwith(df2) should be equal to this in my opiniondf2.apply(lambdax: df1.corrwith(x))
# this yields the ordinary correlation matrix as expected, although it is slower than just df1.corr()df1.apply(lambdax: df1.corrwith(x))
I don't think changing to df2.apply(lambda x: df1.corrwith(x)) is worth breaking backwards compatibility, but adding that to the docs or cookbook seems useful.
Problem description
The docstring for
corrwith
describes its behavior as:My interpretation of this would be that
corrwith
finds all pairwise correlations between the two DataFrames along the given axis (it's also defined for Series input which should probably be a bit more explicit in the docstring). That is (for axis=0) df1.corrwith(df2) would be functionally equivalent toHowever, as implemented the method returns a single Series which appears to give specific pairwise correlations, and how the pairs are chosen does not appear to be documented.
Defining
corrwith
recursively as above seems more useful (you still retain all the information from the current implementation) and interpretable, fixes the apparent bug caused when the method tries to align columns, and also allows for an easy way of closing issue #21925 (since ultimately the method falls back on Series.corr which permits control of the correlation type). I can start to work on a pull request if others agree that this would be preferred.Expected Output
Output of
pd.show_versions()
pandas: 0.23.0
pytest: 3.7.1
pip: 18.0
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: