-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Base.cor for DataFrame #583
Comments
I think the NA handling should be a keyword argument. FWIW, R supports several behaviors wrt. NAs:
I think R offers too many options, but the three ways presented above can be useful. |
Sounds reasonable. Are there any existing functions with this sort of behavior that I could mimic? |
We used to have this and removed it since it didn't work very well. I'm not totally convinced we should add it, since it encourages people to use a DataFrame when they should be using a DataMatrix, but maybe we should relax that rule. I'd say we should flag this for review after we finally finish defining and cleaning up the core functionality for DataFrames. |
For a concrete use case, I'm currently working (in R) with a data frame storing different types of variables, most of which are numeric, but a few are categorical (with a different theoretical status, e.g. some identify the country to which observations belong to). I'm computing correlations between pairs of numeric variables, skipping the categorical ones. Copying these variables to a matrix wouldn't be very practical. |
Sounds reasonable to force conversion to DataMatrix (eg, to force the types to be homogeneous). But then I would still argue for basically the same functionality, just for DataMatrix instead of DataFrame, eg, function cor(df::DataMatrix)
[corna(df[:, a], df[:, b]) for a=1:size(df, 2), b=1:size(df, 2)]
end |
I wanted to ask whether we can indeed implement a function calculating the correlation for dataframes. So far,I use
However, I wonder whether this feature can be implemented directly in I hope this thread is the right location for asking such a question |
I think we should implement something like this, either as |
Rather than using |
@ararslan A correlation matrix is symmetric and definitely a matrix, It doesn't have the tabular, potentially heterogeneous columns structure of a DataFrame. If a NamedArray is not sufficiently lightweight then perhaps a struct with a Matrix and Vector{Symbol} of column names would be appropriate. A specialized show method could be added and getindex delegated to the matrix. |
Right. I was just trying to think of a way to preserve name information without adding a dependency.
That seems doable, we'd just have to come up with an appropriate API. If we want the type to be |
By which time much of the |
Yes, we should definitely avoid reinventing something which already exists for the sake of avoiding a dependency. I think the root issue here is that there's a competition between NamedArrays and AxisArrays, so none of these packages is a standard dependency of core packages yet. But we should make a decision at some point, as it's absurd that e.g. one cannot get frequency tables or pivot tables as matrices using StatsBase or DataFrames. It would also be useful in Distances.jl to be able to give names to observations and/or variables in the input matrix, and to get a named matrix as the output. |
FWIW, I'd prefer that each package implement their own Examples)
Any other combinations should probably be left up to third party packages which can implement things like |
The problem with returning a |
Yes, but that seems like an appropriate result given that a correlation matrix doesn't make sense as tabular data and DataFrames probably shouldn't know about NamedArrays or AxisArrays. I think it's a reasonable requirement (or assumption) that |
I agree that it's a reasonable assumption that names won't be permuted. But it's quite annoying if you want to get the correlation between two variables by name and you know their names but not their positions. I wouldn't be too opposed to adding a dependency on NamedArrays or AxisArrays, but the choice of which may prove difficult. I think I've been hearing more about AxisArrays recently (mostly from Jeff) than I have about NamedArrays. |
I guess my only concern is that if AxisArrays adds DataFrames as a dependency (e.g., adding an |
I think it's well outside of the scope of AxisArrays to add a dependency on DataFrames, so I don't think we have to worry about that. Really my only concern here is bloat, since adding a dependency indirectly adds its dependencies as dependencies, but we already do have a kind of absurd number of dependencies here, so... what's one more? ¯\_(ツ)_/¯ |
Yes, I think the DataFrames -> NamedArrays/AxisArrays dependency is the most logical and useful one. I don't see why NamedArrays/AxisArrays would depend on DataFrames. The idea of passing the expected return type to |
AxisArrays seems somewhat more complex than NamedArrays, but it has the Tim Holy Blessing™, which IMO says a lot about the quality and ongoing maintenance of the package. (That is of course not to disparage NamedArrays in any way, I'm just a Tim Holy fanboy.) Figuring out what our use of any kind of array-with-names package would look like would likely take some rather involved design discussion. It would feel a bit random if |
https://github.com/JuliaStats/DataFrames.jl/blob/master/src/statsmodels/statsmodel.jl is already a bit special here so maybe a solution could be a |
You mean like StatsModels? 😉 |
Indeed, the modeling-related features are supposed to move to StatsModels or StreamModels, and they are really not related to Anyway, what I'm saying is that at some point one named arrays package should become a standard dependency of the JuliaStats ecosystem, and even be loaded by default via |
I'm not sure because my understanding is that |
Yes, true |
I experimented a bit with
The time for the computation of each function I haven't figured out how I can write only the upper diagonal of the correlation matrix in either a |
I don't think these packages are stabilized yet. We should compare the potentials of their respective designs, not only their current features. Also, AxisArrays supports strings in addition to symbols (and NamedArrays could probably support symbols if we want). |
Closing this as this functionality should not live in DataFrames.jl. After Tables.jl if we add If someone really wants to do it in DataFrames.jl it is relatively easy to do using the core functionality we provide, see https://github.com/bkamins/JuliaCon2019-DataFrames-Tutorial/blob/master/DataFramesIntroduction.ipynb. (reopen if you disagree) |
I think Base.cor has a well defined meaning for DataFrames that is distinct from the cor of the associated array. In particular, the correlation of the columns with NA handling, eg,
I realize NA handling is tricky, but could we add something like this? (and probably Base.cov at the same time) Thanks.
The text was updated successfully, but these errors were encountered: