Merge pull request #4950 from snth/pairwise

Pairwise versions for rolling_cov, ewmcov and expanding_cov
pandas-dev · Mar 28, 2014 · aa166bf · aa166bf
2 parents 1ff776a + 1fcb94e
commit aa166bf
Show file tree

Hide file tree

Showing 4 changed files with 295 additions and 177 deletions.
diff --git a/doc/source/computation.rst b/doc/source/computation.rst
@@ -59,6 +59,19 @@ The ``Series`` object has a method ``cov`` to compute covariance between series
 Analogously, ``DataFrame`` has a method ``cov`` to compute pairwise covariances
 among the series in the DataFrame, also excluding NA/null values.
 
+.. _computation.covariance.caveats:
+
+.. note::
+
+    Assuming the missing data are missing at random this results in an estimate
+    for the covariance matrix which is unbiased. However, for many applications
+    this estimate may not be acceptable because the estimated covariance matrix
+    is not guaranteed to be positive semi-definite. This could lead to
+    estimated correlations having absolute values which are greater than one,
+    and/or a non-invertible covariance matrix. See `Estimation of covariance
+    matrices <http://en.wikipedia.org/w/index.php?title=Estimation_of_covariance_matrices>`_
+    for more details.
+
 .. ipython:: python
 
    frame = DataFrame(randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
@@ -99,6 +112,12 @@ correlation methods are provided:
 
 All of these are currently computed using pairwise complete observations.
 
+.. note::
+
+    Please see the :ref:`caveats <computation.covariance.caveats>` associated
+    with this method of calculating correlation matrices in the 
+    :ref:`covariance section <computation.covariance>`.
+
 .. ipython:: python
 
    frame = DataFrame(randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
@@ -325,11 +344,14 @@ Binary rolling moments
 two ``Series`` or any combination of ``DataFrame/Series`` or
 ``DataFrame/DataFrame``. Here is the behavior in each case:
 
-- two ``Series``: compute the statistic for the pairing
+- two ``Series``: compute the statistic for the pairing.
 - ``DataFrame/Series``: compute the statistics for each column of the DataFrame
-  with the passed Series, thus returning a DataFrame
-- ``DataFrame/DataFrame``: compute statistic for matching column names,
-  returning a DataFrame
+  with the passed Series, thus returning a DataFrame.
+- ``DataFrame/DataFrame``: by default compute the statistic for matching column
+  names, returning a DataFrame. If the keyword argument ``pairwise=True`` is
+  passed then computes the statistic for each pair of columns, returning a
+  ``Panel`` whose ``items`` are the dates in question (see :ref:`the next section
+  <stats.moments.corr_pairwise>`).
 
 For example:
 
@@ -340,20 +362,42 @@ For example:
 
 .. _stats.moments.corr_pairwise:
 
-Computing rolling pairwise correlations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Computing rolling pairwise covariances and correlations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-In financial data analysis and other fields it's common to compute correlation
-matrices for a collection of time series. More difficult is to compute a
-moving-window correlation matrix. This can be done using the
-``rolling_corr_pairwise`` function, which yields a ``Panel`` whose ``items``
-are the dates in question:
+In financial data analysis and other fields it's common to compute covariance
+and correlation matrices for a collection of time series. Often one is also
+interested in moving-window covariance and correlation matrices. This can be
+done by passing the ``pairwise`` keyword argument, which in the case of
+``DataFrame`` inputs will yield a ``Panel`` whose ``items`` are the dates in
+question. In the case of a single DataFrame argument the ``pairwise`` argument
+can even be omitted:
+
+.. note::
+
+    Missing values are ignored and each entry is computed using the pairwise
+    complete observations.  Please see the :ref:`covariance section
+    <computation.covariance>` for :ref:`caveats
+    <computation.covariance.caveats>` associated with this method of
+    calculating covariance and correlation matrices.
 
 .. ipython:: python
 
-   correls = rolling_corr_pairwise(df, 50)
+   covs = rolling_cov(df[['B','C','D']], df[['A','B','C']], 50, pairwise=True)
+   covs[df.index[-50]]
+
+.. ipython:: python
+
+   correls = rolling_corr(df, 50)
    correls[df.index[-50]]
 
+.. note::
+
+    Prior to version 0.14 this was available through ``rolling_corr_pairwise``
+    which is now simply syntactic sugar for calling ``rolling_corr(...,
+    pairwise=True)`` and deprecated. This is likely to be removed in a future
+    release.
+
 You can efficiently retrieve the time series of correlations between two
 columns using ``ix`` indexing:
 

diff --git a/doc/source/v0.14.0.txt b/doc/source/v0.14.0.txt
@@ -183,6 +183,19 @@ These are out-of-bounds selections
 
    Because of the default `align` value changes, coordinates of bar plots are now located on integer values (0.0, 1.0, 2.0 ...). This is intended to make bar plot be located on the same coodinates as line plot. However, bar plot may differs unexpectedly when you manually adjust the bar location or drawing area, such as using `set_xlim`, `set_ylim`, etc. In this cases, please modify your script to meet with new coordinates. 
 
+- ``pairwise`` keyword was added to the statistical moment functions
+  ``rolling_cov``, ``rolling_corr``, ``ewmcov``, ``ewmcorr``,
+  ``expanding_cov``, ``expanding_corr`` to allow the calculation of moving
+  window covariance and correlation matrices (:issue:`4950`). See
+  :ref:`Computing rolling pairwise covariances and correlations
+  <stats.moments.corr_pairwise>` in the docs.
+
+  .. ipython:: python
+
+    df = DataFrame(np.random.randn(10,4),columns=list('ABCD'))
+    covs = rolling_cov(df[['A','B','C']], df[['B','C','D']], 5, pairwise=True)
+    covs[df.index[-1]]
+
 
 MultiIndexing Using Slicers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~