Implement Base.cor for DataFrame #583

farrellm · 2014-04-22T07:20:35Z

I think Base.cor has a well defined meaning for DataFrames that is distinct from the cor of the associated array. In particular, the correlation of the columns with NA handling, eg,

function corna(x, y)
    b = !(isna(x) | isna(y))
    cor(x, y)
end

import Base.cor
function cor(df::AbstractDataFrame)
    [corna(x[2], y[2]) for x=eachcol(df), y=eachcol(df)]
end

I realize NA handling is tricky, but could we add something like this? (and probably Base.cov at the same time) Thanks.

nalimilan · 2014-04-22T08:43:34Z

I think the NA handling should be a keyword argument. FWIW, R supports several behaviors wrt. NAs:

return NA if a NA is present -- there's also a variant returning an error
use rows where none of the variables is NA ("complete cases") -- with also a variant returning an error
use rows where no NA is present for the pair of variables considered (different set of rows for computed correlation)

I think R offers too many options, but the three ways presented above can be useful.

farrellm · 2014-04-22T14:54:30Z

Sounds reasonable. Are there any existing functions with this sort of behavior that I could mimic?

johnmyleswhite · 2014-04-22T14:54:34Z

We used to have this and removed it since it didn't work very well. I'm not totally convinced we should add it, since it encourages people to use a DataFrame when they should be using a DataMatrix, but maybe we should relax that rule. I'd say we should flag this for review after we finally finish defining and cleaning up the core functionality for DataFrames.

nalimilan · 2014-04-22T16:04:13Z

For a concrete use case, I'm currently working (in R) with a data frame storing different types of variables, most of which are numeric, but a few are categorical (with a different theoretical status, e.g. some identify the country to which observations belong to). I'm computing correlations between pairs of numeric variables, skipping the categorical ones. Copying these variables to a matrix wouldn't be very practical.

farrellm · 2014-04-22T17:20:43Z

Sounds reasonable to force conversion to DataMatrix (eg, to force the types to be homogeneous). But then I would still argue for basically the same functionality, just for DataMatrix instead of DataFrame, eg,

function cor(df::DataMatrix)
    [corna(df[:, a], df[:, b]) for a=1:size(df, 2), b=1:size(df, 2)]
end

FabianSchuetze · 2017-07-21T14:32:05Z

I wanted to ask whether we can indeed implement a function calculating the correlation for dataframes. So far,I use NamedArrays to do the following:

 using NamedArrays

function corr(df::DataFrame)
  """
  Calculates the correlation among the series in a dataframe
  """
  varNames = [i for i in names(df)]
  Statistics = NamedArray(zeros( length(varNames), length(varNames)))
  setnames!(Statistics, [string(i) for i in varNames], 2)
  setnames!(Statistics, [string(i) for i in varNames], 1)
  Statistics.array = round(cor(Array(df)),2)
  return Statistics
end

However, I wonder whether this feature can be implemented directly in DataFrames as I think caculating the correlation among series of data is a natural thing to do.

I hope this thread is the right location for asking such a question

nalimilan · 2017-07-22T15:23:52Z

I think we should implement something like this, either as cor or as a more general pairwise function which would take cor as its first argument. Adding a dependency on NamedArrays just for this isn't great, though.

ararslan · 2017-07-22T18:50:16Z

Rather than using NamedArrays, we could just return a DataFrame with a column of names.

dmbates · 2017-07-22T19:34:08Z

@ararslan A correlation matrix is symmetric and definitely a matrix, It doesn't have the tabular, potentially heterogeneous columns structure of a DataFrame. If a NamedArray is not sufficiently lightweight then perhaps a struct with a Matrix and Vector{Symbol} of column names would be appropriate. A specialized show method could be added and getindex delegated to the matrix.

ararslan · 2017-07-22T19:43:33Z

A correlation matrix is symmetric and definitely a matrix, It doesn't have the tabular, potentially heterogeneous columns structure of a DataFrame

Right. I was just trying to think of a way to preserve name information without adding a dependency.

perhaps a struct with a Matrix and Vector{Symbol} of column names would be appropriate

That seems doable, we'd just have to come up with an appropriate API. If we want the type to be <:AbstractMatrix, we'll have to implement a fair number of functions for it to adhere to the expected interface.

dmbates · 2017-07-22T20:55:15Z

That seems doable, we'd just have to come up with an appropriate API. If we want the type to be <:AbstractMatrix, we'll have to implement a fair number of functions for it to adhere to the expected interface.

By which time much of the NamedArrays package would have been replicated. :-) The classic conundrum.

nalimilan · 2017-07-22T21:25:10Z

By which time much of the NamedArrays package would have been replicated. :-) The classic conundrum.

Yes, we should definitely avoid reinventing something which already exists for the sake of avoiding a dependency. I think the root issue here is that there's a competition between NamedArrays and AxisArrays, so none of these packages is a standard dependency of core packages yet. But we should make a decision at some point, as it's absurd that e.g. one cannot get frequency tables or pivot tables as matrices using StatsBase or DataFrames. It would also be useful in Distances.jl to be able to give names to observations and/or variables in the input matrix, and to get a named matrix as the output.

rofinn · 2017-07-23T05:06:25Z

FWIW, I'd prefer that each package implement their own cor methods with their respective types.

Examples)

cor(::DataFrame) -> DataMatrix
cor(::NamedArray) -> NamedArray
cor(::AxisArray) -> AxisArray

Any other combinations should probably be left up to third party packages which can implement things like cor(::DataFrame) -> NamedArray.

ararslan · 2017-07-23T05:11:29Z

The problem with returning a DataArray as the correlation matrix computed from a DataFrame is that you lose all name information, which makes the result far less useful unless you know the order of the columns going into it and you're sure that the order doesn't change at any point in the function that computes the correlation.

rofinn · 2017-07-23T05:15:17Z

Yes, but that seems like an appropriate result given that a correlation matrix doesn't make sense as tabular data and DataFrames probably shouldn't know about NamedArrays or AxisArrays. I think it's a reasonable requirement (or assumption) that cor won't change the ordering from the source DataFrame (although there should probably be a docstring and test to confirm that assumption). The bigger concern with my suggestion is that we'd need to define a new API in StatsBase for specifying the return type (kind of like parse or convert) to avoid dispatch collisions.

ararslan · 2017-07-23T05:22:39Z

I agree that it's a reasonable assumption that names won't be permuted. But it's quite annoying if you want to get the correlation between two variables by name and you know their names but not their positions.

I wouldn't be too opposed to adding a dependency on NamedArrays or AxisArrays, but the choice of which may prove difficult. I think I've been hearing more about AxisArrays recently (mostly from Jeff) than I have about NamedArrays.

rofinn · 2017-07-23T05:47:13Z

I guess my only concern is that if AxisArrays adds DataFrames as a dependency (e.g., adding an AxisArray constructor which takes a DataFrame) we'll get a circular dependency. Since we're not sure which type to support why not just leave these kinds of interactions to a third-party package?

ararslan · 2017-07-23T05:52:43Z

I think it's well outside of the scope of AxisArrays to add a dependency on DataFrames, so I don't think we have to worry about that. Really my only concern here is bloat, since adding a dependency indirectly adds its dependencies as dependencies, but we already do have a kind of absurd number of dependencies here, so... what's one more? ¯\_(ツ)_/¯

nalimilan · 2017-07-23T09:46:13Z

Yes, I think the DataFrames -> NamedArrays/AxisArrays dependency is the most logical and useful one. I don't see why NamedArrays/AxisArrays would depend on DataFrames.

The idea of passing the expected return type to cor doesn't make a lot of sense IMHO: we should always return an array with names. The question is just to choose between NamedArrays and AxisArrays, and that's a question that needs to be solved for the whole JuliaStats ecosystem at some point anyway. The lack of a standard package for this has made all PRs requiring this kind of feature derail, just like this one or the frequency tables support in StatsBase.

ararslan · 2017-07-23T18:35:44Z

AxisArrays seems somewhat more complex than NamedArrays, but it has the Tim Holy Blessing™, which IMO says a lot about the quality and ongoing maintenance of the package. (That is of course not to disparage NamedArrays in any way, I'm just a Tim Holy fanboy.)

Figuring out what our use of any kind of array-with-names package would look like would likely take some rather involved design discussion. It would feel a bit random if cor returned an AxisArray (for example) but nothing else in the package was formulated in terms of AxisArrays. And once named tuples land in Base, that's another named thingamabob to consider, since then a DataFrame can just be formulated as a named tuple of Vector{T?}s. But it may be difficult to formulate a correlation matrix as a named tuple of whatevers.

andreasnoack · 2017-07-23T18:55:56Z

https://github.com/JuliaStats/DataFrames.jl/blob/master/src/statsmodels/statsmodel.jl is already a bit special here so maybe a solution could be a DataFramesStats package or maybe just the batteries included Stats package we have been talking about for glms and cor based on DataFrames. Such a package could have a lot of dependencies including either NamesArrays or AxisArrays.

ararslan · 2017-07-23T18:58:40Z

maybe a solution could be a DataFramesStats package

You mean like StatsModels? 😉

nalimilan · 2017-07-23T21:36:57Z

Indeed, the modeling-related features are supposed to move to StatsModels or StreamModels, and they are really not related to cor. I'm not sure what other features could go with it in a separate package.

Anyway, what I'm saying is that at some point one named arrays package should become a standard dependency of the JuliaStats ecosystem, and even be loaded by default via using Stats. This is an essential feature that former R users are going to miss all the time. Then it won't be an issue that DataFrames depends on it (waiting for optional dependencies to be implemented). (BTW, NamedTuple will never replace named arrays, just like Tuple cannot replace arrays, so that's a quite different debate.)

andreasnoack · 2017-07-24T01:34:59Z

You mean like StatsModels? 😉

I'm not sure because my understanding is that StatsModels is mainly handling all the infrastructure related to formulas and doesn't aim to provide a full set of statistics methods for DataFrames (I guess it would also be DataTables at this point in time).

ararslan · 2017-07-24T01:59:01Z

Yes, true

FabianSchuetze · 2017-07-24T15:53:52Z

I experimented a bit with NamedArrays and AxisArrays and compared their usefulness for generating correlation matrices. I used the followed code:

using NamedArrays, AxisArrays, DataFrames

df = DataFrame([i = rand(1000) for i in range(1,10)]);

function corNamedArray(df::DataFrame)
  """
  Calculates the correlation among the series in a dataframe and returns 
  a NamedArray 
  """
  varNames = [i for i in names(df)]
  Statistics = NamedArray(zeros( length(varNames), length(varNames)))
  setnames!(Statistics, [string(i) for i in varNames], 2)
  setnames!(Statistics, [string(i) for i in varNames], 1)
  Statistics.array = round(cor(Array(df)),2)
  return Statistics
end

function corAxisArrays(df::DataFrame)
  """
  Calculates the correlation among the series in a dataframe and returns 
  a AxisArray
  """
  varNames = [i for i in names(df)]
  Statistics = AxisArray(round(cor(Array(df)),2), 
                         Axis{:variable}(varNames),
                         Axis{:variables}(varNames))
  return Statistics
end 

StatNamedArray = corNamedArray(df)
StatAxisArray = corAxisArrays(df)

The time for the computation of each function corNamedArray and corAxisArray measured by @time is virtually identical. AxisArrays are a bit smaller than NamedArrays (Base.summarysize(StatAxisArray) = 928 vs Base.summarysize(StatNamedArray) = 1586). One can acess elments of the AxisArray with Symbols while NamedArrays are accessed with Strings ( StatAxisArray[:x1, :x1] vs StatNamedArray["x1", "x1"] ) . So far, I prefer working with AxisArrays as it resembles how I access series in DataFrames.

I haven't figured out how I can write only the upper diagonal of the correlation matrix in either aNamedArray or AxisArray. Does somebody know how to do that? I find reading only the upper diagonal visually appealing and if only one of NamedArrays or AxisArrays were capable of doing that, I would prefer working with that package.

nalimilan · 2017-07-25T11:34:08Z

I don't think these packages are stabilized yet. We should compare the potentials of their respective designs, not only their current features. Also, AxisArrays supports strings in addition to symbols (and NamedArrays could probably support symbols if we want).

bkamins · 2019-07-25T01:24:44Z

Closing this as this functionality should not live in DataFrames.jl. After Tables.jl if we add cor functionality it should apply to any type that follows this interface.

If someone really wants to do it in DataFrames.jl it is relatively easy to do using the core functionality we provide, see https://github.com/bkamins/JuliaCon2019-DataFrames-Tutorial/blob/master/DataFramesIntroduction.ipynb.

(reopen if you disagree)

farrellm changed the title ~~Implement Base.cor for DataFrame~~ Implement Base.cor for DataMatrix Apr 22, 2014

nalimilan changed the title ~~Implement Base.cor for DataMatrix~~ Implement Base.cor for DataFrame Oct 1, 2016

Nosferican mentioned this issue Nov 21, 2017

Comparison of a dataframe and a number #1281

Closed

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins closed this as completed Jul 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Base.cor for DataFrame #583

Implement Base.cor for DataFrame #583

farrellm commented Apr 22, 2014

nalimilan commented Apr 22, 2014

farrellm commented Apr 22, 2014

johnmyleswhite commented Apr 22, 2014

nalimilan commented Apr 22, 2014

farrellm commented Apr 22, 2014

FabianSchuetze commented Jul 21, 2017

nalimilan commented Jul 22, 2017

ararslan commented Jul 22, 2017

dmbates commented Jul 22, 2017

ararslan commented Jul 22, 2017

dmbates commented Jul 22, 2017

nalimilan commented Jul 22, 2017

rofinn commented Jul 23, 2017

ararslan commented Jul 23, 2017

rofinn commented Jul 23, 2017 •

edited

Loading

ararslan commented Jul 23, 2017

rofinn commented Jul 23, 2017 •

edited

Loading

ararslan commented Jul 23, 2017 •

edited

Loading

nalimilan commented Jul 23, 2017

ararslan commented Jul 23, 2017

andreasnoack commented Jul 23, 2017

ararslan commented Jul 23, 2017

nalimilan commented Jul 23, 2017

andreasnoack commented Jul 24, 2017

ararslan commented Jul 24, 2017

FabianSchuetze commented Jul 24, 2017

nalimilan commented Jul 25, 2017

bkamins commented Jul 25, 2019

Implement Base.cor for DataFrame #583

Implement Base.cor for DataFrame #583

Comments

farrellm commented Apr 22, 2014

nalimilan commented Apr 22, 2014

farrellm commented Apr 22, 2014

johnmyleswhite commented Apr 22, 2014

nalimilan commented Apr 22, 2014

farrellm commented Apr 22, 2014

FabianSchuetze commented Jul 21, 2017

nalimilan commented Jul 22, 2017

ararslan commented Jul 22, 2017

dmbates commented Jul 22, 2017

ararslan commented Jul 22, 2017

dmbates commented Jul 22, 2017

nalimilan commented Jul 22, 2017

rofinn commented Jul 23, 2017

ararslan commented Jul 23, 2017

rofinn commented Jul 23, 2017 • edited Loading

ararslan commented Jul 23, 2017

rofinn commented Jul 23, 2017 • edited Loading

ararslan commented Jul 23, 2017 • edited Loading

nalimilan commented Jul 23, 2017

ararslan commented Jul 23, 2017

andreasnoack commented Jul 23, 2017

ararslan commented Jul 23, 2017

nalimilan commented Jul 23, 2017

andreasnoack commented Jul 24, 2017

ararslan commented Jul 24, 2017

FabianSchuetze commented Jul 24, 2017

nalimilan commented Jul 25, 2017

bkamins commented Jul 25, 2019

rofinn commented Jul 23, 2017 •

edited

Loading

rofinn commented Jul 23, 2017 •

edited

Loading

ararslan commented Jul 23, 2017 •

edited

Loading