Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Base.cor for DataFrame #583

Closed
farrellm opened this issue Apr 22, 2014 · 28 comments
Closed

Implement Base.cor for DataFrame #583

farrellm opened this issue Apr 22, 2014 · 28 comments

Comments

@farrellm
Copy link

I think Base.cor has a well defined meaning for DataFrames that is distinct from the cor of the associated array. In particular, the correlation of the columns with NA handling, eg,

function corna(x, y)
    b = !(isna(x) | isna(y))
    cor(x, y)
end

import Base.cor
function cor(df::AbstractDataFrame)
    [corna(x[2], y[2]) for x=eachcol(df), y=eachcol(df)]
end

I realize NA handling is tricky, but could we add something like this? (and probably Base.cov at the same time) Thanks.

@nalimilan
Copy link
Member

I think the NA handling should be a keyword argument. FWIW, R supports several behaviors wrt. NAs:

  • return NA if a NA is present -- there's also a variant returning an error
  • use rows where none of the variables is NA ("complete cases") -- with also a variant returning an error
  • use rows where no NA is present for the pair of variables considered (different set of rows for computed correlation)

I think R offers too many options, but the three ways presented above can be useful.

@farrellm
Copy link
Author

Sounds reasonable. Are there any existing functions with this sort of behavior that I could mimic?

@johnmyleswhite
Copy link
Contributor

We used to have this and removed it since it didn't work very well. I'm not totally convinced we should add it, since it encourages people to use a DataFrame when they should be using a DataMatrix, but maybe we should relax that rule. I'd say we should flag this for review after we finally finish defining and cleaning up the core functionality for DataFrames.

@nalimilan
Copy link
Member

For a concrete use case, I'm currently working (in R) with a data frame storing different types of variables, most of which are numeric, but a few are categorical (with a different theoretical status, e.g. some identify the country to which observations belong to). I'm computing correlations between pairs of numeric variables, skipping the categorical ones. Copying these variables to a matrix wouldn't be very practical.

@farrellm
Copy link
Author

Sounds reasonable to force conversion to DataMatrix (eg, to force the types to be homogeneous). But then I would still argue for basically the same functionality, just for DataMatrix instead of DataFrame, eg,

function cor(df::DataMatrix)
    [corna(df[:, a], df[:, b]) for a=1:size(df, 2), b=1:size(df, 2)]
end

@farrellm farrellm changed the title Implement Base.cor for DataFrame Implement Base.cor for DataMatrix Apr 22, 2014
@nalimilan nalimilan changed the title Implement Base.cor for DataMatrix Implement Base.cor for DataFrame Oct 1, 2016
@FabianSchuetze
Copy link

I wanted to ask whether we can indeed implement a function calculating the correlation for dataframes. So far,I use NamedArrays to do the following:

 using NamedArrays

function corr(df::DataFrame)
  """
  Calculates the correlation among the series in a dataframe
  """
  varNames = [i for i in names(df)]
  Statistics = NamedArray(zeros( length(varNames), length(varNames)))
  setnames!(Statistics, [string(i) for i in varNames], 2)
  setnames!(Statistics, [string(i) for i in varNames], 1)
  Statistics.array = round(cor(Array(df)),2)
  return Statistics
end

However, I wonder whether this feature can be implemented directly in DataFrames as I think caculating the correlation among series of data is a natural thing to do.

I hope this thread is the right location for asking such a question

@nalimilan
Copy link
Member

I think we should implement something like this, either as cor or as a more general pairwise function which would take cor as its first argument. Adding a dependency on NamedArrays just for this isn't great, though.

@ararslan
Copy link
Member

Rather than using NamedArrays, we could just return a DataFrame with a column of names.

@dmbates
Copy link
Contributor

dmbates commented Jul 22, 2017

@ararslan A correlation matrix is symmetric and definitely a matrix, It doesn't have the tabular, potentially heterogeneous columns structure of a DataFrame. If a NamedArray is not sufficiently lightweight then perhaps a struct with a Matrix and Vector{Symbol} of column names would be appropriate. A specialized show method could be added and getindex delegated to the matrix.

@ararslan
Copy link
Member

A correlation matrix is symmetric and definitely a matrix, It doesn't have the tabular, potentially heterogeneous columns structure of a DataFrame

Right. I was just trying to think of a way to preserve name information without adding a dependency.

perhaps a struct with a Matrix and Vector{Symbol} of column names would be appropriate

That seems doable, we'd just have to come up with an appropriate API. If we want the type to be <:AbstractMatrix, we'll have to implement a fair number of functions for it to adhere to the expected interface.

@dmbates
Copy link
Contributor

dmbates commented Jul 22, 2017

That seems doable, we'd just have to come up with an appropriate API. If we want the type to be <:AbstractMatrix, we'll have to implement a fair number of functions for it to adhere to the expected interface.

By which time much of the NamedArrays package would have been replicated. :-) The classic conundrum.

@nalimilan
Copy link
Member

By which time much of the NamedArrays package would have been replicated. :-) The classic conundrum.

Yes, we should definitely avoid reinventing something which already exists for the sake of avoiding a dependency. I think the root issue here is that there's a competition between NamedArrays and AxisArrays, so none of these packages is a standard dependency of core packages yet. But we should make a decision at some point, as it's absurd that e.g. one cannot get frequency tables or pivot tables as matrices using StatsBase or DataFrames. It would also be useful in Distances.jl to be able to give names to observations and/or variables in the input matrix, and to get a named matrix as the output.

@rofinn
Copy link
Member

rofinn commented Jul 23, 2017

FWIW, I'd prefer that each package implement their own cor methods with their respective types.

Examples)

  • cor(::DataFrame) -> DataMatrix
  • cor(::NamedArray) -> NamedArray
  • cor(::AxisArray) -> AxisArray

Any other combinations should probably be left up to third party packages which can implement things like cor(::DataFrame) -> NamedArray.

@ararslan
Copy link
Member

The problem with returning a DataArray as the correlation matrix computed from a DataFrame is that you lose all name information, which makes the result far less useful unless you know the order of the columns going into it and you're sure that the order doesn't change at any point in the function that computes the correlation.

@rofinn
Copy link
Member

rofinn commented Jul 23, 2017

Yes, but that seems like an appropriate result given that a correlation matrix doesn't make sense as tabular data and DataFrames probably shouldn't know about NamedArrays or AxisArrays. I think it's a reasonable requirement (or assumption) that cor won't change the ordering from the source DataFrame (although there should probably be a docstring and test to confirm that assumption). The bigger concern with my suggestion is that we'd need to define a new API in StatsBase for specifying the return type (kind of like parse or convert) to avoid dispatch collisions.

@ararslan
Copy link
Member

I agree that it's a reasonable assumption that names won't be permuted. But it's quite annoying if you want to get the correlation between two variables by name and you know their names but not their positions.

I wouldn't be too opposed to adding a dependency on NamedArrays or AxisArrays, but the choice of which may prove difficult. I think I've been hearing more about AxisArrays recently (mostly from Jeff) than I have about NamedArrays.

@rofinn
Copy link
Member

rofinn commented Jul 23, 2017

I guess my only concern is that if AxisArrays adds DataFrames as a dependency (e.g., adding an AxisArray constructor which takes a DataFrame) we'll get a circular dependency. Since we're not sure which type to support why not just leave these kinds of interactions to a third-party package?

@ararslan
Copy link
Member

ararslan commented Jul 23, 2017

I think it's well outside of the scope of AxisArrays to add a dependency on DataFrames, so I don't think we have to worry about that. Really my only concern here is bloat, since adding a dependency indirectly adds its dependencies as dependencies, but we already do have a kind of absurd number of dependencies here, so... what's one more? ¯\_(ツ)_/¯

@nalimilan
Copy link
Member

Yes, I think the DataFrames -> NamedArrays/AxisArrays dependency is the most logical and useful one. I don't see why NamedArrays/AxisArrays would depend on DataFrames.

The idea of passing the expected return type to cor doesn't make a lot of sense IMHO: we should always return an array with names. The question is just to choose between NamedArrays and AxisArrays, and that's a question that needs to be solved for the whole JuliaStats ecosystem at some point anyway. The lack of a standard package for this has made all PRs requiring this kind of feature derail, just like this one or the frequency tables support in StatsBase.

@ararslan
Copy link
Member

AxisArrays seems somewhat more complex than NamedArrays, but it has the Tim Holy Blessing™, which IMO says a lot about the quality and ongoing maintenance of the package. (That is of course not to disparage NamedArrays in any way, I'm just a Tim Holy fanboy.)

Figuring out what our use of any kind of array-with-names package would look like would likely take some rather involved design discussion. It would feel a bit random if cor returned an AxisArray (for example) but nothing else in the package was formulated in terms of AxisArrays. And once named tuples land in Base, that's another named thingamabob to consider, since then a DataFrame can just be formulated as a named tuple of Vector{T?}s. But it may be difficult to formulate a correlation matrix as a named tuple of whatevers.

@andreasnoack
Copy link
Member

https://github.com/JuliaStats/DataFrames.jl/blob/master/src/statsmodels/statsmodel.jl is already a bit special here so maybe a solution could be a DataFramesStats package or maybe just the batteries included Stats package we have been talking about for glms and cor based on DataFrames. Such a package could have a lot of dependencies including either NamesArrays or AxisArrays.

@ararslan
Copy link
Member

maybe a solution could be a DataFramesStats package

You mean like StatsModels? 😉

@nalimilan
Copy link
Member

Indeed, the modeling-related features are supposed to move to StatsModels or StreamModels, and they are really not related to cor. I'm not sure what other features could go with it in a separate package.

Anyway, what I'm saying is that at some point one named arrays package should become a standard dependency of the JuliaStats ecosystem, and even be loaded by default via using Stats. This is an essential feature that former R users are going to miss all the time. Then it won't be an issue that DataFrames depends on it (waiting for optional dependencies to be implemented). (BTW, NamedTuple will never replace named arrays, just like Tuple cannot replace arrays, so that's a quite different debate.)

@andreasnoack
Copy link
Member

You mean like StatsModels? 😉

I'm not sure because my understanding is that StatsModels is mainly handling all the infrastructure related to formulas and doesn't aim to provide a full set of statistics methods for DataFrames (I guess it would also be DataTables at this point in time).

@ararslan
Copy link
Member

Yes, true

@FabianSchuetze
Copy link

I experimented a bit with NamedArrays and AxisArrays and compared their usefulness for generating correlation matrices. I used the followed code:

using NamedArrays, AxisArrays, DataFrames

df = DataFrame([i = rand(1000) for i in range(1,10)]);

function corNamedArray(df::DataFrame)
  """
  Calculates the correlation among the series in a dataframe and returns 
  a NamedArray 
  """
  varNames = [i for i in names(df)]
  Statistics = NamedArray(zeros( length(varNames), length(varNames)))
  setnames!(Statistics, [string(i) for i in varNames], 2)
  setnames!(Statistics, [string(i) for i in varNames], 1)
  Statistics.array = round(cor(Array(df)),2)
  return Statistics
end

function corAxisArrays(df::DataFrame)
  """
  Calculates the correlation among the series in a dataframe and returns 
  a AxisArray
  """
  varNames = [i for i in names(df)]
  Statistics = AxisArray(round(cor(Array(df)),2), 
                         Axis{:variable}(varNames),
                         Axis{:variables}(varNames))
  return Statistics
end 

StatNamedArray = corNamedArray(df)
StatAxisArray = corAxisArrays(df)

The time for the computation of each function corNamedArray and corAxisArray measured by @time is virtually identical. AxisArrays are a bit smaller than NamedArrays (Base.summarysize(StatAxisArray) = 928 vs Base.summarysize(StatNamedArray) = 1586). One can acess elments of the AxisArray with Symbols while NamedArrays are accessed with Strings ( StatAxisArray[:x1, :x1] vs StatNamedArray["x1", "x1"] ) . So far, I prefer working with AxisArrays as it resembles how I access series in DataFrames.

I haven't figured out how I can write only the upper diagonal of the correlation matrix in either aNamedArray or AxisArray. Does somebody know how to do that? I find reading only the upper diagonal visually appealing and if only one of NamedArrays or AxisArrays were capable of doing that, I would prefer working with that package.

@nalimilan
Copy link
Member

I don't think these packages are stabilized yet. We should compare the potentials of their respective designs, not only their current features. Also, AxisArrays supports strings in addition to symbols (and NamedArrays could probably support symbols if we want).

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

Closing this as this functionality should not live in DataFrames.jl. After Tables.jl if we add cor functionality it should apply to any type that follows this interface.

If someone really wants to do it in DataFrames.jl it is relatively easy to do using the core functionality we provide, see https://github.com/bkamins/JuliaCon2019-DataFrames-Tutorial/blob/master/DataFramesIntroduction.ipynb.

(reopen if you disagree)

@bkamins bkamins closed this as completed Jul 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants